public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
@ 2026-03-07 18:24 Shakeel Butt
  2026-03-09 21:33 ` Roman Gushchin
                   ` (7 more replies)
  0 siblings, 8 replies; 16+ messages in thread
From: Shakeel Butt @ 2026-03-07 18:24 UTC (permalink / raw)
  To: lsf-pc
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel

Over the last couple of weeks, I have been brainstorming on how I would go
about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
focus on existing challenges and issues. This proposal outlines the high-level
direction. Followup emails and patch series will cover and brainstorm the
mechanisms (of course BPF) to achieve these goals.

Memory cgroups provide memory accounting and the ability to control memory usage
of workloads through two categories of limits. Throttling limits (memory.max and
memory.high) cap memory consumption. Protection limits (memory.min and
memory.low) shield a workload's memory from reclaim under external memory
pressure.

Challenges
----------

- Workload owners rarely know their actual memory requirements, leading to
  overprovisioned limits, lower utilization, and higher infrastructure costs.

- Throttling limit enforcement is synchronous in the allocating task's context,
  which can stall latency-sensitive threads.

- The stalled thread may hold shared locks, causing priority inversion -- all
  waiters are blocked regardless of their priority.

- Enforcement is indiscriminate -- there is no way to distinguish a
  performance-critical or latency-critical allocator from a latency-tolerant
  one.

- Protection limits assume static working sets size, forcing owners to either
  overprovision or build complex userspace infrastructure to dynamically adjust
  them.

Feature Wishlist
----------------

Here is the list of features and capabilities I want to enable in the
redesigned memcg limit enforcement world.

Per-Memcg Background Reclaim

In the new memcg world, with the goal of (mostly) eliminating direct synchronous
reclaim for limit enforcement, provide per-memcg background reclaimers which can
scale across CPUs with the allocation rate.

Lock-Aware Throttling

The ability to avoid throttling an allocating task that is holding locks, to
prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
in memcg reclaim, blocking all waiters regardless of their priority or
criticality.

Thread-Level Throttling Control

Workloads should be able to indicate at the thread level which threads can be
synchronously throttled and which cannot. For example, while experimenting with
sched_ext, we drastically improved the performance of AI training workloads by
prioritizing threads interacting with the GPU. Similarly, applications can
identify the threads or thread pools on their performance-critical paths and
the memcg enforcement mechanism should not throttle them.

Combined Memory and Swap Limits

Some users (Google actually) need the ability to enforce limits based on
combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
ceiling on total memory commitment rather than treating memory and swap
independently.

Dynamic Protection Limits

Rather than static protection limits, the kernel should support defining
protection based on the actual working set of the workload, leveraging signals
such as working set estimation, PSI, refault rates, or a combination thereof to
automatically adapt to the workload's current memory needs.

Shared Memory Semantics

With more flexibility in limit enforcement, the kernel should be able to
account for memory shared between workloads (cgroups) during enforcement.
Today, enforcement only looks at each workload's memory usage independently.
Sensible shared memory semantics would allow the enforcer to consider
cross-cgroup sharing when making reclaim and throttling decisions.

Memory Tiering

With a flexible limit enforcement mechanism, the kernel can balance memory
usage of different workloads across memory tiers based on their performance
requirements. Tier accounting and hotness tracking are orthogonal, but the
decisions of when and how to balance memory between tiers should be handled by
the enforcer.

Collaborative Load Shedding

Many workloads communicate with an external entity for load balancing and rely
on their own usage metrics like RSS or memory pressure to signal whether they
can accept more or less work. This is guesswork. Instead of the
workload guessing, the limit enforcer -- which is actually managing the
workload's memory usage -- should be able to communicate available headroom or
request the workload to shed load or reduce memory usage. This collaborative
load shedding mechanism would allow workloads to make informed decisions rather
than reacting to coarse signals.

Cross-Subsystem Collaboration

Finally, the limit enforcement mechanism should collaborate with the CPU
scheduler and other subsystems that can release memory. For example, dirty
memory is not reclaimable and the memory subsystem wakes up flushers to trigger
writeback. However, flushers need CPU to run -- asking the CPU scheduler to
prioritize them ensures the kernel does not lack reclaimable memory under
stressful conditions. Similarly, some subsystems free memory through workqueues
or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
definitely take advantage by having visibility into these situations.

Putting It All Together
-----------------------

To illustrate the end goal, here is an example of the scenario I want to
enable. Suppose there is an AI agent controlling the resources of a host. I
should be able to provide the following policy and everything should work out
of the box:

Policy: "keep system-level memory utilization below 95 percent;
avoid priority inversions by not throttling allocators holding locks; trim each
workload's usage to its working set without regressing its relevant performance
metrics; collaborate with workloads on load shedding and memory trimming
decisions; and under extreme memory pressure, collaborate with the OOM killer
and the central job scheduler to kill and clean up a workload."

Initially I added this example for fun, but from [1] it seems like there is a
real need to enable such capabilities.

[1] https://arxiv.org/abs/2602.09345


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
@ 2026-03-09 21:33 ` Roman Gushchin
  2026-03-09 23:09   ` Shakeel Butt
  2026-03-11  4:57 ` Jiayuan Chen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2026-03-09 21:33 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Hui Zhu, JP Kobryn,
	Muchun Song, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel

Shakeel Butt <shakeel.butt@linux.dev> writes:

> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
>
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
>
> Challenges
> ----------
>
> - Workload owners rarely know their actual memory requirements, leading to
>   overprovisioned limits, lower utilization, and higher infrastructure costs.
>
> - Throttling limit enforcement is synchronous in the allocating task's context,
>   which can stall latency-sensitive threads.
>
> - The stalled thread may hold shared locks, causing priority inversion -- all
>   waiters are blocked regardless of their priority.
>
> - Enforcement is indiscriminate -- there is no way to distinguish a
>   performance-critical or latency-critical allocator from a latency-tolerant
>   one.
>
> - Protection limits assume static working sets size, forcing owners to either
>   overprovision or build complex userspace infrastructure to dynamically adjust
>   them.
>
> Feature Wishlist
> ----------------
>
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
>
> Per-Memcg Background Reclaim
>
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.
>
> Lock-Aware Throttling
>
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
>
> Thread-Level Throttling Control
>
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.
>
> Combined Memory and Swap Limits
>
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
>
> Dynamic Protection Limits
>
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.
>
> Shared Memory Semantics
>
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
>
> Memory Tiering
>
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
>
> Collaborative Load Shedding
>
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
>
> Cross-Subsystem Collaboration
>
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
>
> Putting It All Together
> -----------------------
>
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
>
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."

This is fantastic, thanks Shakeel!

I'd be very interested to discuss it! I suggested a somewhat
similar/related topic to the bpf track (bpf use cases in mm), we might
think of joining them.

Thanks!


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-09 21:33 ` Roman Gushchin
@ 2026-03-09 23:09   ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2026-03-09 23:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Hui Zhu, JP Kobryn,
	Muchun Song, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel, Vlastimil Babka, willy

On Mon, Mar 09, 2026 at 02:33:22PM -0700, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
[...]
> 
> This is fantastic, thanks Shakeel!
> 
> I'd be very interested to discuss it! I suggested a somewhat
> similar/related topic to the bpf track (bpf use cases in mm), we might
> think of joining them.
> 

Awesome, we can request to have a joint session between MM and BPF to discuss
these topics. I know Emil and Tal are also interested in upstreaming
cached_ext. So, we have multiple topics in the intersection of MM and BPF.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
  2026-03-09 21:33 ` Roman Gushchin
@ 2026-03-11  4:57 ` Jiayuan Chen
  2026-03-11 17:00   ` Shakeel Butt
  2026-03-11  7:19 ` Muchun Song
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Jiayuan Chen @ 2026-03-11  4:57 UTC (permalink / raw)
  To: Shakeel Butt, lsf-pc
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel


On 3/8/26 2:24 AM, Shakeel Butt wrote:
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
>
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
>
> Challenges
> ----------
>
> - Workload owners rarely know their actual memory requirements, leading to
>    overprovisioned limits, lower utilization, and higher infrastructure costs.
>
> - Throttling limit enforcement is synchronous in the allocating task's context,
>    which can stall latency-sensitive threads.
>
> - The stalled thread may hold shared locks, causing priority inversion -- all
>    waiters are blocked regardless of their priority.
>
> - Enforcement is indiscriminate -- there is no way to distinguish a
>    performance-critical or latency-critical allocator from a latency-tolerant
>    one.
>
> - Protection limits assume static working sets size, forcing owners to either
>    overprovision or build complex userspace infrastructure to dynamically adjust
>    them.
>
> Feature Wishlist
> ----------------
>
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
>
> Per-Memcg Background Reclaim
>
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.

This sounds like a very useful approach. I have a few questions I'm 
thinking through:

How would you approach implementing this background reclaim? I'm imagining
something like asynchronous memory.reclaim operations - is that in line
with your thinking?

And regarding cold page identification - do you have a preferred approach?
I'm curious what the most practical way would be to accurately identify
which pages to reclaim.

Would be great to hear your perspective.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
  2026-03-09 21:33 ` Roman Gushchin
  2026-03-11  4:57 ` Jiayuan Chen
@ 2026-03-11  7:19 ` Muchun Song
  2026-03-11 20:39   ` Shakeel Butt
  2026-03-11  7:29 ` Greg Thelen
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Muchun Song @ 2026-03-11  7:19 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel



> On Mar 8, 2026, at 02:24, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
> 
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
> 
> Challenges
> ----------
> 
> - Workload owners rarely know their actual memory requirements, leading to
>  overprovisioned limits, lower utilization, and higher infrastructure costs.
> 
> - Throttling limit enforcement is synchronous in the allocating task's context,
>  which can stall latency-sensitive threads.
> 
> - The stalled thread may hold shared locks, causing priority inversion -- all
>  waiters are blocked regardless of their priority.
> 
> - Enforcement is indiscriminate -- there is no way to distinguish a
>  performance-critical or latency-critical allocator from a latency-tolerant
>  one.
> 
> - Protection limits assume static working sets size, forcing owners to either
>  overprovision or build complex userspace infrastructure to dynamically adjust
>  them.
> 
> Feature Wishlist
> ----------------
> 
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
> 
> Per-Memcg Background Reclaim
> 
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.

Hi Shakeel,

I'm quite interested in this. Internally, we privately maintain a set
of code to implement asynchronous reclamation, but we're also trying to
discard these private codes as much as possible. Therefore, we want to
implement a similar asynchronous reclamation mechanism in user space
through the memory.reclaim mechanism. However, currently there's a lack
of suitable policy notification mechanisms to trigger user threads to
proactively reclaim in advance.

> 
> Lock-Aware Throttling
> 
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.

This is a real problem we encountered, especially with the jbd handler
resources of the ext4 file system. Our current attempt is to defer
memory reclamation until returning to user space, in order to solve
various priority inversion issues caused by the jbd handler. Therefore,
I would be interested to discuss this topic.

Muchun,
Thanks.

> 
> Thread-Level Throttling Control
> 
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.
> 
> Combined Memory and Swap Limits
> 
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
> 
> Dynamic Protection Limits
> 
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.
> 
> Shared Memory Semantics
> 
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
> 
> Memory Tiering
> 
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
> 
> Collaborative Load Shedding
> 
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
> 
> Cross-Subsystem Collaboration
> 
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
> 
> Putting It All Together
> -----------------------
> 
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
> 
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."
> 
> Initially I added this example for fun, but from [1] it seems like there is a
> real need to enable such capabilities.
> 
> [1] https://arxiv.org/abs/2602.09345



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
                   ` (2 preceding siblings ...)
  2026-03-11  7:19 ` Muchun Song
@ 2026-03-11  7:29 ` Greg Thelen
  2026-03-11 21:35   ` Shakeel Butt
  2026-03-11 13:20 ` Johannes Weiner
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Greg Thelen @ 2026-03-11  7:29 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel

On Sat, Mar 7, 2026 at 10:24 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
>
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
>
> Challenges
> ----------
>
> - Workload owners rarely know their actual memory requirements, leading to
>   overprovisioned limits, lower utilization, and higher infrastructure costs.
>
> - Throttling limit enforcement is synchronous in the allocating task's context,
>   which can stall latency-sensitive threads.
>
> - The stalled thread may hold shared locks, causing priority inversion -- all
>   waiters are blocked regardless of their priority.
>
> - Enforcement is indiscriminate -- there is no way to distinguish a
>   performance-critical or latency-critical allocator from a latency-tolerant
>   one.
>
> - Protection limits assume static working sets size, forcing owners to either
>   overprovision or build complex userspace infrastructure to dynamically adjust
>   them.
>
> Feature Wishlist
> ----------------
>
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
>
> Per-Memcg Background Reclaim
>
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.
>
> Lock-Aware Throttling
>
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
>
> Thread-Level Throttling Control
>
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.
>
> Combined Memory and Swap Limits
>
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
>
> Dynamic Protection Limits
>
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.
>
> Shared Memory Semantics
>
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
>
> Memory Tiering
>
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
>
> Collaborative Load Shedding
>
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
>
> Cross-Subsystem Collaboration
>
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
>
> Putting It All Together
> -----------------------
>
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
>
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."
>
> Initially I added this example for fun, but from [1] it seems like there is a
> real need to enable such capabilities.
>
> [1] https://arxiv.org/abs/2602.09345
>

Very interesting set of topics. A few more come to mind.

I've wondered about preallocating memory or guaranteeing access to
physical memory for a job. Memcg has max limits and min protections,
but no preallocation (i.e. no conceptual memcg free list). So if a job
is configured with 1GB min workingset protection that only ensures 1GB
won't be reclaimed, not that 1GB can be allocated in a reasonable
amount of time. This isn't just a job startup problem: if a page is
freed with MADV_DONTNEED a subsequent pgfault may require a lot of
time to handle, even if usage is below min.

Initial allocation policies are controlled by mempolicy/cpuset. Should
we continue to keep allocation policies and resource accounting
separate? It's a little strange that memcg can (1) cap max usage of
tier X memory, and (2) provide minimum protection for tier X usage,
but has no influence on where memory is initially allocated?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
                   ` (3 preceding siblings ...)
  2026-03-11  7:29 ` Greg Thelen
@ 2026-03-11 13:20 ` Johannes Weiner
  2026-03-11 22:47   ` Shakeel Butt
  2026-03-12  3:06 ` hui.zhu
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 16+ messages in thread
From: Johannes Weiner @ 2026-03-11 13:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel

On Sat, Mar 07, 2026 at 10:24:24AM -0800, Shakeel Butt wrote:
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
> 
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
> 
> Challenges
> ----------
> 
> - Workload owners rarely know their actual memory requirements, leading to
>   overprovisioned limits, lower utilization, and higher infrastructure costs.

Is this actually a challenge?

It appears to me proactive reclaim is fairly widespread at this point,
giving workload owners, job schedulers, and capacity planners
real-world, long-term profiles of memory usage.

Workload owners can use this to adjust their limits accordingly, of
course, but even that is less relevant if schedulers and planners go
off of the measured information. The limits become failsafes, no
longer the declarative source of truth for memory size.

> - Throttling limit enforcement is synchronous in the allocating task's context,
>   which can stall latency-sensitive threads.
>
> - The stalled thread may hold shared locks, causing priority inversion -- all
>   waiters are blocked regardless of their priority.
> 
> - Enforcement is indiscriminate -- there is no way to distinguish a
>   performance-critical or latency-critical allocator from a latency-tolerant
>   one.
> 
> - Protection limits assume static working sets size, forcing owners to either
>   overprovision or build complex userspace infrastructure to dynamically adjust
>   them.
> 
> Feature Wishlist
> ----------------
> 
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
> 
> Per-Memcg Background Reclaim
> 
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.

Meta has been carrying this patch for half a decade:

https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/

It sounds like others have carried similar patches.

The relevance of this, too, has somewhat faded with proactive
reclaim. But I think it would still be worthwhile to have. The primary
objection was a lack of attribution of the consumed CPU cycles.

> Lock-Aware Throttling
> 
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
> 
> Thread-Level Throttling Control
> 
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.

I'm struggling to envision this.

CPU and GPU are renewable resources where a bias in access time and
scheduling delays over time is naturally compensated.

With memory access past the limit, though, such a bias adds up over
time. How do you prevent high priority threads from runaway memory
consumption that ends up OOMing the host?

> Combined Memory and Swap Limits
> 
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
> 
> Dynamic Protection Limits
> 
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.

This should be possible with today's interfaces of memory.reclaim,
memory.pressure and memory.low, right?

> Shared Memory Semantics
> 
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.

My understanding is that this hasn't been a problem of implementation,
but one of identifying reasonable, predictable semantics - how exactly
the liability of shared resources are allocated to participating groups.

> Memory Tiering
> 
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
> 
> Collaborative Load Shedding
> 
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
> 
> Cross-Subsystem Collaboration
> 
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.

It sounds like the lock holder problem would also fit into this
category: Identifying critical lock holders and allowing them
temporary access past the memory and CPU limits.

But as per above, I'm not sure if blank check exemptions are workable
for memory. It makes sense for allocations in the reclaim path for
example, because it doesn't leave us wondering who will pay for the
excess through a deficit. It's less obvious for a path that is
involved with further expansion of the cgroup's footprint.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-11  4:57 ` Jiayuan Chen
@ 2026-03-11 17:00   ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2026-03-11 17:00 UTC (permalink / raw)
  To: Jiayuan Chen
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel

On Wed, Mar 11, 2026 at 12:57:34PM +0800, Jiayuan Chen wrote:
> 
> On 3/8/26 2:24 AM, Shakeel Butt wrote:
[...]
> > 
> > Per-Memcg Background Reclaim
> > 
> > In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> > reclaim for limit enforcement, provide per-memcg background reclaimers which can
> > scale across CPUs with the allocation rate.
> 
> This sounds like a very useful approach. I have a few questions I'm thinking
> through:
> 
> How would you approach implementing this background reclaim? I'm imagining
> something like asynchronous memory.reclaim operations - is that in line
> with your thinking?

Yes something similar. I still need to figure out the details of the mechanism
but it will be calling try_to_free_mem_cgroup_pages(). More specifically the
context need more thought because we need to account the CPU consumption of
those background reclaimers to corresponding cgroup. Will we be using BPF
workqueues or something else, need more investigation.

> 
> And regarding cold page identification - do you have a preferred approach?
> I'm curious what the most practical way would be to accurately identify
> which pages to reclaim.

That's orthogonal and is the job of the reclaim mechanism which can traditional
LRU or MGLRU.
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-11  7:19 ` Muchun Song
@ 2026-03-11 20:39   ` Shakeel Butt
  2026-03-12  2:46     ` Muchun Song
  0 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2026-03-11 20:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel

On Wed, Mar 11, 2026 at 03:19:31PM +0800, Muchun Song wrote:
> 
> 
> > On Mar 8, 2026, at 02:24, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > 

[...]

> > 
> > Per-Memcg Background Reclaim
> > 
> > In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> > reclaim for limit enforcement, provide per-memcg background reclaimers which can
> > scale across CPUs with the allocation rate.
> 
> Hi Shakeel,
> 
> I'm quite interested in this. Internally, we privately maintain a set
> of code to implement asynchronous reclamation, but we're also trying to
> discard these private codes as much as possible. Therefore, we want to
> implement a similar asynchronous reclamation mechanism in user space
> through the memory.reclaim mechanism. However, currently there's a lack
> of suitable policy notification mechanisms to trigger user threads to
> proactively reclaim in advance.

Cool, can you please share what "suitable policy notification mechanisms" you
need for your use-case? This will give me more data on the comparison between
memory.reclaim and the proposed approach.


> 
> > 
> > Lock-Aware Throttling
> > 
> > The ability to avoid throttling an allocating task that is holding locks, to
> > prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> > in memcg reclaim, blocking all waiters regardless of their priority or
> > criticality.
> 
> This is a real problem we encountered, especially with the jbd handler
> resources of the ext4 file system. Our current attempt is to defer
> memory reclamation until returning to user space, in order to solve
> various priority inversion issues caused by the jbd handler. Therefore,
> I would be interested to discuss this topic.

Awesome, do you use memory.max and memory.high both and defer the reclaim for
both? Are you deferring all the reclaims or just the ones where the charging
process has the lock? (I need to look what jbd handler is).



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-11  7:29 ` Greg Thelen
@ 2026-03-11 21:35   ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2026-03-11 21:35 UTC (permalink / raw)
  To: Greg Thelen
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel

Hi Greg,

On Wed, Mar 11, 2026 at 12:29:45AM -0700, Greg Thelen wrote:
> On Sat, Mar 7, 2026 at 10:24 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> >
> 
> Very interesting set of topics. A few more come to mind.

Thanks.

> 
> I've wondered about preallocating memory or guaranteeing access to
> physical memory for a job. Memcg has max limits and min protections,
> but no preallocation (i.e. no conceptual memcg free list). So if a job
> is configured with 1GB min workingset protection that only ensures 1GB
> won't be reclaimed, not that 1GB can be allocated in a reasonable
> amount of time. This isn't just a job startup problem: if a page is
> freed with MADV_DONTNEED a subsequent pgfault may require a lot of
> time to handle, even if usage is below min.

This is indeed correct i.e. protection limits protect the workload from external
reclaim but does not provide any gurantee on allocating memory in a reasonable
cheap way (without triggering reclaim/compaction). This is one of the challenge
to implement userspace oom-killer in an aggressively overcommitted environment.

However to me providing memory allocation guarantees is more of a system level
feature and orthogonal to memcg. And I see your next para is about that :)

Anyways I think if we keep system memory utilization below some value and
guarantee there is always some free memory (this can be done by having common
ancestor of all workloads and ancestor has a limit or node controller maintains
the condition that the sum of limits of all top level cgroups is below some
percentage of total memory) then we might not need memcg free list or similar
mechanisms (most of the time, I think).

> 
> Initial allocation policies are controlled by mempolicy/cpuset. Should
> we continue to keep allocation policies and resource accounting
> separate? It's a little strange that memcg can (1) cap max usage of
> tier X memory, and (2) provide minimum protection for tier X usage,
> but has no influence on where memory is initially allocated?

I think I understand your point but I think the implementation would be too
messy. This is orthogonal to the proposal but I would say a good topic for
LSFMMBPF if you want to lead the discussion.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-11 13:20 ` Johannes Weiner
@ 2026-03-11 22:47   ` Shakeel Butt
  0 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2026-03-11 22:47 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel

On Wed, Mar 11, 2026 at 09:20:14AM -0400, Johannes Weiner wrote:
> On Sat, Mar 07, 2026 at 10:24:24AM -0800, Shakeel Butt wrote:

[...]

> > 
> > - Workload owners rarely know their actual memory requirements, leading to
> >   overprovisioned limits, lower utilization, and higher infrastructure costs.
> 
> Is this actually a challenge?
> 
> It appears to me proactive reclaim is fairly widespread at this point,
> giving workload owners, job schedulers, and capacity planners
> real-world, long-term profiles of memory usage.
> 
> Workload owners can use this to adjust their limits accordingly, of
> course, but even that is less relevant if schedulers and planners go
> off of the measured information. The limits become failsafes, no
> longer the declarative source of truth for memory size.

Yes for sophisticated users, this is a solved problem, particularly for
workloads with consistent memory usage behavior. I think workloads with
inconsistent or sporadic usage behavior is still a challenge. 

> 
> > 
> > Per-Memcg Background Reclaim
> > 
> > In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> > reclaim for limit enforcement, provide per-memcg background reclaimers which can
> > scale across CPUs with the allocation rate.
> 
> Meta has been carrying this patch for half a decade:
> 
> https://lore.kernel.org/linux-mm/20200219181219.54356-1-hannes@cmpxchg.org/
> 
> It sounds like others have carried similar patches.

Yeah ByteDance has something similar too.

> 
> The relevance of this, too, has somewhat faded with proactive
> reclaim. But I think it would still be worthwhile to have. The primary
> objection was a lack of attribution of the consumed CPU cycles.
> 
> > Lock-Aware Throttling
> > 
> > The ability to avoid throttling an allocating task that is holding locks, to
> > prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> > in memcg reclaim, blocking all waiters regardless of their priority or
> > criticality.
> > 
> > Thread-Level Throttling Control
> > 
> > Workloads should be able to indicate at the thread level which threads can be
> > synchronously throttled and which cannot. For example, while experimenting with
> > sched_ext, we drastically improved the performance of AI training workloads by
> > prioritizing threads interacting with the GPU. Similarly, applications can
> > identify the threads or thread pools on their performance-critical paths and
> > the memcg enforcement mechanism should not throttle them.
> 
> I'm struggling to envision this.
> 
> CPU and GPU are renewable resources where a bias in access time and
> scheduling delays over time is naturally compensated.
> 
> With memory access past the limit, though, such a bias adds up over
> time. How do you prevent high priority threads from runaway memory
> consumption that ends up OOMing the host?

Oh don't consider this feature in isolation. In practice there definitely will
be background reclaimers running here. The way I am envisioning the scenario for
this feature is something like: At some usage threshold, we will start the
background reclaimers, at the next threshold, we will start synchronously
throttle the threads that are allowed by the workload and at next threshold
point we may decide to just kill the workload.

> 
> > Combined Memory and Swap Limits
> > 
> > Some users (Google actually) need the ability to enforce limits based on
> > combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> > ceiling on total memory commitment rather than treating memory and swap
> > independently.
> > 
> > Dynamic Protection Limits
> > 
> > Rather than static protection limits, the kernel should support defining
> > protection based on the actual working set of the workload, leveraging signals
> > such as working set estimation, PSI, refault rates, or a combination thereof to
> > automatically adapt to the workload's current memory needs.
> 
> This should be possible with today's interfaces of memory.reclaim,
> memory.pressure and memory.low, right?

Yes, node controller or workload can dynamically their protection limit based on
psi or refaults or some other metrics.

> 
> > Shared Memory Semantics
> > 
> > With more flexibility in limit enforcement, the kernel should be able to
> > account for memory shared between workloads (cgroups) during enforcement.
> > Today, enforcement only looks at each workload's memory usage independently.
> > Sensible shared memory semantics would allow the enforcer to consider
> > cross-cgroup sharing when making reclaim and throttling decisions.
> 
> My understanding is that this hasn't been a problem of implementation,
> but one of identifying reasonable, predictable semantics - how exactly
> the liability of shared resources are allocated to participating groups.
> 

This particular feature is hand-wavy at the moment particulary due to lack of
mechanism that tells how much memory is really shared.

The high level idea is when we know there is shared memory/fs between different
workloads, during throttling decision, we can consider their memory usage
excluding the shared usage. So, mainly their exclusive memory usage. Will this
help or is useful, I need to brainstorm more.

> > Memory Tiering
> > 
> > With a flexible limit enforcement mechanism, the kernel can balance memory
> > usage of different workloads across memory tiers based on their performance
> > requirements. Tier accounting and hotness tracking are orthogonal, but the
> > decisions of when and how to balance memory between tiers should be handled by
> > the enforcer.
> > 
> > Collaborative Load Shedding
> > 
> > Many workloads communicate with an external entity for load balancing and rely
> > on their own usage metrics like RSS or memory pressure to signal whether they
> > can accept more or less work. This is guesswork. Instead of the
> > workload guessing, the limit enforcer -- which is actually managing the
> > workload's memory usage -- should be able to communicate available headroom or
> > request the workload to shed load or reduce memory usage. This collaborative
> > load shedding mechanism would allow workloads to make informed decisions rather
> > than reacting to coarse signals.
> > 
> > Cross-Subsystem Collaboration
> > 
> > Finally, the limit enforcement mechanism should collaborate with the CPU
> > scheduler and other subsystems that can release memory. For example, dirty
> > memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> > writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> > prioritize them ensures the kernel does not lack reclaimable memory under
> > stressful conditions. Similarly, some subsystems free memory through workqueues
> > or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> > definitely take advantage by having visibility into these situations.
> 
> It sounds like the lock holder problem would also fit into this
> category: Identifying critical lock holders and allowing them
> temporary access past the memory and CPU limits.
> 
> But as per above, I'm not sure if blank check exemptions are workable
> for memory. It makes sense for allocations in the reclaim path for
> example, because it doesn't leave us wondering who will pay for the
> excess through a deficit. It's less obvious for a path that is
> involved with further expansion of the cgroup's footprint.

No need to have blank check. Same as above for the thread throttling, the lock
holder not getting throttled will be, in practice, in the presense of background
reclaimers and may get killed if going over board too much.

Thanks for taking a look and poking holes.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-11 20:39   ` Shakeel Butt
@ 2026-03-12  2:46     ` Muchun Song
  2026-03-13  6:17       ` teawater
  0 siblings, 1 reply; 16+ messages in thread
From: Muchun Song @ 2026-03-12  2:46 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel



> On Mar 12, 2026, at 04:39, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> On Wed, Mar 11, 2026 at 03:19:31PM +0800, Muchun Song wrote:
>> 
>> 
>>> On Mar 8, 2026, at 02:24, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>>> 
> 
> [...]
> 
>>> 
>>> Per-Memcg Background Reclaim
>>> 
>>> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
>>> reclaim for limit enforcement, provide per-memcg background reclaimers which can
>>> scale across CPUs with the allocation rate.
>> 
>> Hi Shakeel,
>> 
>> I'm quite interested in this. Internally, we privately maintain a set
>> of code to implement asynchronous reclamation, but we're also trying to
>> discard these private codes as much as possible. Therefore, we want to
>> implement a similar asynchronous reclamation mechanism in user space
>> through the memory.reclaim mechanism. However, currently there's a lack
>> of suitable policy notification mechanisms to trigger user threads to
>> proactively reclaim in advance.
> 
> Cool, can you please share what "suitable policy notification mechanisms" you
> need for your use-case? This will give me more data on the comparison between
> memory.reclaim and the proposed approach.

If we expect the proactive reclamation to be triggered when the current
memcg's memory usage reaches a certain point, we have to continuously read
memory.current to determine whether it has reached our set watermark value
to trigger asynchronous reclamation. Perhaps we need an event that can notify
user-space threads when the current memory usage reaches a specific
watermark value. Currently, the events supported by memory.events may lack
the capability for custom watermarks.

> 
> 
>> 
>>> 
>>> Lock-Aware Throttling
>>> 
>>> The ability to avoid throttling an allocating task that is holding locks, to
>>> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
>>> in memcg reclaim, blocking all waiters regardless of their priority or
>>> criticality.
>> 
>> This is a real problem we encountered, especially with the jbd handler
>> resources of the ext4 file system. Our current attempt is to defer
>> memory reclamation until returning to user space, in order to solve
>> various priority inversion issues caused by the jbd handler. Therefore,
>> I would be interested to discuss this topic.
> 
> Awesome, do you use memory.max and memory.high both and defer the reclaim for
> both? Are you deferring all the reclaims or just the ones where the charging
> process has the lock? (I need to look what jbd handler is).
> 

We do not use memory.high, although it supports deferring memory reclamation
to user-space, it also attempts to throttle memory allocation speed, which
introduces significant latency. In our application's case, we would rather
accept an OOM under such circumstances. We previously attempted to address
the priority inversion issue caused by the jbd handler separately (which we
frequently encounter since we use the ext4 file system), and you can refer
to this [1]. Of course, this solution lacks generality, as it requires
calling new interfaces for various lock resources. Therefore, we internally
have a more aggressive idea: defer all reclamation triggered by kernel-space
memory allocation until just before returning to user-space. This should
resolve the vast majority of priority inversion problems. The only potential
issue introduced is that kernel-space memory usage may briefly exceed memory.max.

[1] https://lore.kernel.org/linux-mm/cover.1750234270.git.hezhongkun.hzk@bytedance.com/#r

Muchun,
Thanks.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
                   ` (4 preceding siblings ...)
  2026-03-11 13:20 ` Johannes Weiner
@ 2026-03-12  3:06 ` hui.zhu
  2026-03-12  3:36 ` hui.zhu
  2026-03-25 18:47 ` Donet Tom
  7 siblings, 0 replies; 16+ messages in thread
From: hui.zhu @ 2026-03-12  3:06 UTC (permalink / raw)
  To: Shakeel Butt, lsf-pc
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, JP Kobryn,
	Muchun Song, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel

2026年3月8日 02:24, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > 写到:


> 
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
> 
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
> 
> Challenges
> ----------
> 
> - Workload owners rarely know their actual memory requirements, leading to
>  overprovisioned limits, lower utilization, and higher infrastructure costs.
> 
> - Throttling limit enforcement is synchronous in the allocating task's context,
>  which can stall latency-sensitive threads.
> 
> - The stalled thread may hold shared locks, causing priority inversion -- all
>  waiters are blocked regardless of their priority.
> 
> - Enforcement is indiscriminate -- there is no way to distinguish a
>  performance-critical or latency-critical allocator from a latency-tolerant
>  one.
> 
> - Protection limits assume static working sets size, forcing owners to either
>  overprovision or build complex userspace infrastructure to dynamically adjust
>  them.
> 
> Feature Wishlist
> ----------------
> 
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.

Thanks for summarizing and categorizing all of this.

> 
> Per-Memcg Background Reclaim
> 
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.
> 
> Lock-Aware Throttling
> 
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
> 
> Thread-Level Throttling Control
> 
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.

Does this mean that different threads within the same memcg can be
selectively exempt from throttling control via BPF?

> 
> Combined Memory and Swap Limits
> 
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
> 
> Dynamic Protection Limits
> 
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.

This part is what we are interesting now.
https://www.spinics.net/lists/kernel/msg6037006.html this is the RFC for it.

Best,
Hui

> 
> Shared Memory Semantics
> 
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
> 
> Memory Tiering
> 
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
> 
> Collaborative Load Shedding
> 
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
> 
> Cross-Subsystem Collaboration
> 
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
> 
> Putting It All Together
> -----------------------
> 
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
> 
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."
> 
> Initially I added this example for fun, but from [1] it seems like there is a
> real need to enable such capabilities.
> 
> [1] https://arxiv.org/abs/2602.09345
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
                   ` (5 preceding siblings ...)
  2026-03-12  3:06 ` hui.zhu
@ 2026-03-12  3:36 ` hui.zhu
  2026-03-25 18:47 ` Donet Tom
  7 siblings, 0 replies; 16+ messages in thread
From: hui.zhu @ 2026-03-12  3:36 UTC (permalink / raw)
  To: Shakeel Butt, lsf-pc
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, JP Kobryn,
	Muchun Song, Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis,
	David Rientjes, Martin KaFai Lau, Meta kernel team, linux-mm,
	cgroups, bpf, linux-kernel

2026年3月8日 02:24, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > 写到:


> 
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
> 
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
> 
> Challenges
> ----------
> 
> - Workload owners rarely know their actual memory requirements, leading to
>  overprovisioned limits, lower utilization, and higher infrastructure costs.
> 
> - Throttling limit enforcement is synchronous in the allocating task's context,
>  which can stall latency-sensitive threads.
> 
> - The stalled thread may hold shared locks, causing priority inversion -- all
>  waiters are blocked regardless of their priority.
> 
> - Enforcement is indiscriminate -- there is no way to distinguish a
>  performance-critical or latency-critical allocator from a latency-tolerant
>  one.
> 
> - Protection limits assume static working sets size, forcing owners to either
>  overprovision or build complex userspace infrastructure to dynamically adjust
>  them.
> 
> Feature Wishlist
> ----------------
> 
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
> 
> Per-Memcg Background Reclaim
> 
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.

I am aware that several companies maintain out-of-tree patches for
asynchronous reclaim, though some have not yet attempted to upstream
them.

Would it be feasible to introduce a generic memcg asynchronous reclaim
framework into the upstream kernel, where eBPF is used to orchestrate
and control the reclaim logic? In this model, the kernel's role would
be to enforce "guardrails" for these operations—for instance,
restricting a BPF program to initiating only one asynchronous reclaim
pass at a time—to ensure system safety and predictability.

Best,
Hui


> 
> Lock-Aware Throttling
> 
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
> 
> Thread-Level Throttling Control
> 
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.
> 
> Combined Memory and Swap Limits
> 
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
> 
> Dynamic Protection Limits
> 
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.
> 
> Shared Memory Semantics
> 
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
> 
> Memory Tiering
> 
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
> 
> Collaborative Load Shedding
> 
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
> 
> Cross-Subsystem Collaboration
> 
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
> 
> Putting It All Together
> -----------------------
> 
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
> 
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."
> 
> Initially I added this example for fun, but from [1] it seems like there is a
> real need to enable such capabilities.
> 
> [1] https://arxiv.org/abs/2602.09345
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-12  2:46     ` Muchun Song
@ 2026-03-13  6:17       ` teawater
  0 siblings, 0 replies; 16+ messages in thread
From: teawater @ 2026-03-13  6:17 UTC (permalink / raw)
  To: Muchun Song, Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, JP Kobryn,
	Geliang Tang, Sweet Tea Dorminy, Emil Tsalapatis, David Rientjes,
	Martin KaFai Lau, Meta kernel team, linux-mm, cgroups, bpf,
	linux-kernel

> 
> > 
> > On Mar 12, 2026, at 04:39, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >  
> >  On Wed, Mar 11, 2026 at 03:19:31PM +0800, Muchun Song wrote:
> > 
> > > 
> > > 
> > > 
> >  On Mar 8, 2026, at 02:24, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >  
> >  
> >  [...]
> >  
> >  
> >  Per-Memcg Background Reclaim
> >  
> >  In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> >  reclaim for limit enforcement, provide per-memcg background reclaimers which can
> >  scale across CPUs with the allocation rate.
> > 
> > > 
> > > Hi Shakeel,
> > >  
> > >  I'm quite interested in this. Internally, we privately maintain a set
> > >  of code to implement asynchronous reclamation, but we're also trying to
> > >  discard these private codes as much as possible. Therefore, we want to
> > >  implement a similar asynchronous reclamation mechanism in user space
> > >  through the memory.reclaim mechanism. However, currently there's a lack
> > >  of suitable policy notification mechanisms to trigger user threads to
> > >  proactively reclaim in advance.
> > > 
> >  
> >  Cool, can you please share what "suitable policy notification mechanisms" you
> >  need for your use-case? This will give me more data on the comparison between
> >  memory.reclaim and the proposed approach.
> > 
> If we expect the proactive reclamation to be triggered when the current
> memcg's memory usage reaches a certain point, we have to continuously read
> memory.current to determine whether it has reached our set watermark value
> to trigger asynchronous reclamation. Perhaps we need an event that can notify
> user-space threads when the current memory usage reaches a specific
> watermark value. Currently, the events supported by memory.events may lack
> the capability for custom watermarks.

I agree. Even with BPF controlling proactive reclamation, I believe
there needs to be an event reflecting capacity changes to signal
when to stop. 
Otherwise, the reclamation volume per batch would have to be set very
low, leading to frequent BPF triggers and poor efficiency.

Best,
Hui


> 
> > 
> > > 
> > > 
> > > 
> >  
> >  Lock-Aware Throttling
> >  
> >  The ability to avoid throttling an allocating task that is holding locks, to
> >  prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> >  in memcg reclaim, blocking all waiters regardless of their priority or
> >  criticality.
> > 
> > > 
> > > This is a real problem we encountered, especially with the jbd handler
> > >  resources of the ext4 file system. Our current attempt is to defer
> > >  memory reclamation until returning to user space, in order to solve
> > >  various priority inversion issues caused by the jbd handler. Therefore,
> > >  I would be interested to discuss this topic.
> > > 
> >  
> >  Awesome, do you use memory.max and memory.high both and defer the reclaim for
> >  both? Are you deferring all the reclaims or just the ones where the charging
> >  process has the lock? (I need to look what jbd handler is).
> > 
> We do not use memory.high, although it supports deferring memory reclamation
> to user-space, it also attempts to throttle memory allocation speed, which
> introduces significant latency. In our application's case, we would rather
> accept an OOM under such circumstances. We previously attempted to address
> the priority inversion issue caused by the jbd handler separately (which we
> frequently encounter since we use the ext4 file system), and you can refer
> to this [1]. Of course, this solution lacks generality, as it requires
> calling new interfaces for various lock resources. Therefore, we internally
> have a more aggressive idea: defer all reclamation triggered by kernel-space
> memory allocation until just before returning to user-space. This should
> resolve the vast majority of priority inversion problems. The only potential
> issue introduced is that kernel-space memory usage may briefly exceed memory.max.
> 
> [1] https://lore.kernel.org/linux-mm/cover.1750234270.git.hezhongkun.hzk@bytedance.com/#r
> 
> Muchun,
> Thanks.
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
  2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
                   ` (6 preceding siblings ...)
  2026-03-12  3:36 ` hui.zhu
@ 2026-03-25 18:47 ` Donet Tom
  7 siblings, 0 replies; 16+ messages in thread
From: Donet Tom @ 2026-03-25 18:47 UTC (permalink / raw)
  To: Shakeel Butt, lsf-pc
  Cc: Andrew Morton, Tejun Heo, Michal Hocko, Johannes Weiner,
	Alexei Starovoitov, Michal Koutný, Roman Gushchin, Hui Zhu,
	JP Kobryn, Muchun Song, Geliang Tang, Sweet Tea Dorminy,
	Emil Tsalapatis, David Rientjes, Martin KaFai Lau,
	Meta kernel team, linux-mm, cgroups, bpf, linux-kernel


On 3/7/26 11:54 PM, Shakeel Butt wrote:
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
>
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
>
> Challenges
> ----------
>
> - Workload owners rarely know their actual memory requirements, leading to
>    overprovisioned limits, lower utilization, and higher infrastructure costs.
>
> - Throttling limit enforcement is synchronous in the allocating task's context,
>    which can stall latency-sensitive threads.
>
> - The stalled thread may hold shared locks, causing priority inversion -- all
>    waiters are blocked regardless of their priority.
>
> - Enforcement is indiscriminate -- there is no way to distinguish a
>    performance-critical or latency-critical allocator from a latency-tolerant
>    one.
>
> - Protection limits assume static working sets size, forcing owners to either
>    overprovision or build complex userspace infrastructure to dynamically adjust
>    them.
>
> Feature Wishlist
> ----------------
>
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.
>
> Per-Memcg Background Reclaim
>
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.
>
> Lock-Aware Throttling
>
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
>
> Thread-Level Throttling Control
>
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.
>
> Combined Memory and Swap Limits
>
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
>
> Dynamic Protection Limits
>
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.
>
> Shared Memory Semantics
>
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
>
> Memory Tiering
>
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.


Hi Shakeel


This looks like a good idea. I was thinking along similar lines,
but wasn’t sure about the best way to implement it.

For memcg with memory tiering, the idea is that we set
memory.high and memory.max as the maximum limits. Within
memory.high, a certain percentage (x%) could be backed by
higher-tier memory, with the remaining portion coming from
lower-tier memory.

In this model, an application would get up to
memory.high * x / 100 from higher-tier memory, and the rest
from lower-tier memory.

Once the higher-tier usage reaches its limit, we would start
demoting pages. If the lower-tier usage also reaches its
limit, we would then start swapping out pages from lower tier.

What is your opinion on how memory tiering should be handled in memcg?


-Donet

>
> Collaborative Load Shedding
>
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
>
> Cross-Subsystem Collaboration
>
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
>
> Putting It All Together
> -----------------------
>
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
>
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."
>
> Initially I added this example for fun, but from [1] it seems like there is a
> real need to enable such capabilities.
>
> [1] https://arxiv.org/abs/2602.09345
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-03-25 18:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
2026-03-09 21:33 ` Roman Gushchin
2026-03-09 23:09   ` Shakeel Butt
2026-03-11  4:57 ` Jiayuan Chen
2026-03-11 17:00   ` Shakeel Butt
2026-03-11  7:19 ` Muchun Song
2026-03-11 20:39   ` Shakeel Butt
2026-03-12  2:46     ` Muchun Song
2026-03-13  6:17       ` teawater
2026-03-11  7:29 ` Greg Thelen
2026-03-11 21:35   ` Shakeel Butt
2026-03-11 13:20 ` Johannes Weiner
2026-03-11 22:47   ` Shakeel Butt
2026-03-12  3:06 ` hui.zhu
2026-03-12  3:36 ` hui.zhu
2026-03-25 18:47 ` Donet Tom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox