Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: hui.zhu@linux.dev
To: "Shakeel Butt" <shakeel.butt@linux.dev>,
	lsf-pc@lists.linux-foundation.org
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Tejun Heo" <tj@kernel.org>, "Michal Hocko" <mhocko@suse.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"JP Kobryn" <inwardvessel@gmail.com>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Geliang Tang" <geliang@kernel.org>,
	"Sweet Tea Dorminy" <sweettea-kernel@dorminy.me>,
	"Emil Tsalapatis" <emil@etsalapatis.com>,
	"David Rientjes" <rientjes@google.com>,
	"Martin KaFai Lau" <martin.lau@linux.dev>,
	"Meta kernel team" <kernel-team@meta.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org, bpf@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext)
Date: Thu, 12 Mar 2026 03:06:04 +0000	[thread overview]
Message-ID: <e0390b058eb5123c99e6c8a72306efe7a1770411@linux.dev> (raw)
In-Reply-To: <20260307182424.2889780-1-shakeel.butt@linux.dev>

2026年3月8日 02:24, "Shakeel Butt" <shakeel.butt@linux.dev mailto:shakeel.butt@linux.dev?to=%22Shakeel%20Butt%22%20%3Cshakeel.butt%40linux.dev%3E > 写到:


> 
> Over the last couple of weeks, I have been brainstorming on how I would go
> about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a
> focus on existing challenges and issues. This proposal outlines the high-level
> direction. Followup emails and patch series will cover and brainstorm the
> mechanisms (of course BPF) to achieve these goals.
> 
> Memory cgroups provide memory accounting and the ability to control memory usage
> of workloads through two categories of limits. Throttling limits (memory.max and
> memory.high) cap memory consumption. Protection limits (memory.min and
> memory.low) shield a workload's memory from reclaim under external memory
> pressure.
> 
> Challenges
> ----------
> 
> - Workload owners rarely know their actual memory requirements, leading to
>  overprovisioned limits, lower utilization, and higher infrastructure costs.
> 
> - Throttling limit enforcement is synchronous in the allocating task's context,
>  which can stall latency-sensitive threads.
> 
> - The stalled thread may hold shared locks, causing priority inversion -- all
>  waiters are blocked regardless of their priority.
> 
> - Enforcement is indiscriminate -- there is no way to distinguish a
>  performance-critical or latency-critical allocator from a latency-tolerant
>  one.
> 
> - Protection limits assume static working sets size, forcing owners to either
>  overprovision or build complex userspace infrastructure to dynamically adjust
>  them.
> 
> Feature Wishlist
> ----------------
> 
> Here is the list of features and capabilities I want to enable in the
> redesigned memcg limit enforcement world.

Thanks for summarizing and categorizing all of this.

> 
> Per-Memcg Background Reclaim
> 
> In the new memcg world, with the goal of (mostly) eliminating direct synchronous
> reclaim for limit enforcement, provide per-memcg background reclaimers which can
> scale across CPUs with the allocation rate.
> 
> Lock-Aware Throttling
> 
> The ability to avoid throttling an allocating task that is holding locks, to
> prevent priority inversion. In Meta's fleet, we have observed lock holders stuck
> in memcg reclaim, blocking all waiters regardless of their priority or
> criticality.
> 
> Thread-Level Throttling Control
> 
> Workloads should be able to indicate at the thread level which threads can be
> synchronously throttled and which cannot. For example, while experimenting with
> sched_ext, we drastically improved the performance of AI training workloads by
> prioritizing threads interacting with the GPU. Similarly, applications can
> identify the threads or thread pools on their performance-critical paths and
> the memcg enforcement mechanism should not throttle them.

Does this mean that different threads within the same memcg can be
selectively exempt from throttling control via BPF?

> 
> Combined Memory and Swap Limits
> 
> Some users (Google actually) need the ability to enforce limits based on
> combined memory and swap usage, similar to cgroup v1's memsw limit, providing a
> ceiling on total memory commitment rather than treating memory and swap
> independently.
> 
> Dynamic Protection Limits
> 
> Rather than static protection limits, the kernel should support defining
> protection based on the actual working set of the workload, leveraging signals
> such as working set estimation, PSI, refault rates, or a combination thereof to
> automatically adapt to the workload's current memory needs.

This part is what we are interesting now.
https://www.spinics.net/lists/kernel/msg6037006.html this is the RFC for it.

Best,
Hui

> 
> Shared Memory Semantics
> 
> With more flexibility in limit enforcement, the kernel should be able to
> account for memory shared between workloads (cgroups) during enforcement.
> Today, enforcement only looks at each workload's memory usage independently.
> Sensible shared memory semantics would allow the enforcer to consider
> cross-cgroup sharing when making reclaim and throttling decisions.
> 
> Memory Tiering
> 
> With a flexible limit enforcement mechanism, the kernel can balance memory
> usage of different workloads across memory tiers based on their performance
> requirements. Tier accounting and hotness tracking are orthogonal, but the
> decisions of when and how to balance memory between tiers should be handled by
> the enforcer.
> 
> Collaborative Load Shedding
> 
> Many workloads communicate with an external entity for load balancing and rely
> on their own usage metrics like RSS or memory pressure to signal whether they
> can accept more or less work. This is guesswork. Instead of the
> workload guessing, the limit enforcer -- which is actually managing the
> workload's memory usage -- should be able to communicate available headroom or
> request the workload to shed load or reduce memory usage. This collaborative
> load shedding mechanism would allow workloads to make informed decisions rather
> than reacting to coarse signals.
> 
> Cross-Subsystem Collaboration
> 
> Finally, the limit enforcement mechanism should collaborate with the CPU
> scheduler and other subsystems that can release memory. For example, dirty
> memory is not reclaimable and the memory subsystem wakes up flushers to trigger
> writeback. However, flushers need CPU to run -- asking the CPU scheduler to
> prioritize them ensures the kernel does not lack reclaimable memory under
> stressful conditions. Similarly, some subsystems free memory through workqueues
> or RCU callbacks. While this may seem orthogonal to limit enforcement, we can
> definitely take advantage by having visibility into these situations.
> 
> Putting It All Together
> -----------------------
> 
> To illustrate the end goal, here is an example of the scenario I want to
> enable. Suppose there is an AI agent controlling the resources of a host. I
> should be able to provide the following policy and everything should work out
> of the box:
> 
> Policy: "keep system-level memory utilization below 95 percent;
> avoid priority inversions by not throttling allocators holding locks; trim each
> workload's usage to its working set without regressing its relevant performance
> metrics; collaborate with workloads on load shedding and memory trimming
> decisions; and under extreme memory pressure, collaborate with the OOM killer
> and the central job scheduler to kill and clean up a workload."
> 
> Initially I added this example for fun, but from [1] it seems like there is a
> real need to enable such capabilities.
> 
> [1] https://arxiv.org/abs/2602.09345
>

next prev parent reply	other threads:[~2026-03-12  3:06 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-07 18:24 [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) Shakeel Butt
2026-03-09 21:33 ` Roman Gushchin
2026-03-09 23:09   ` Shakeel Butt
2026-03-11  4:57 ` Jiayuan Chen
2026-03-11 17:00   ` Shakeel Butt
2026-03-11  7:19 ` Muchun Song
2026-03-11 20:39   ` Shakeel Butt
2026-03-12  2:46     ` Muchun Song
2026-03-13  6:17       ` teawater
2026-03-11  7:29 ` Greg Thelen
2026-03-11 21:35   ` Shakeel Butt
2026-03-11 13:20 ` Johannes Weiner
2026-03-11 22:47   ` Shakeel Butt
2026-03-12  3:06 ` hui.zhu [this message]
2026-03-12  3:36 ` hui.zhu
2026-03-25 18:47 ` Donet Tom

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e0390b058eb5123c99e6c8a72306efe7a1770411@linux.dev \
    --to=hui.zhu@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=emil@etsalapatis.com \
    --cc=geliang@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=inwardvessel@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=martin.lau@linux.dev \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=sweettea-kernel@dorminy.me \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox