From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-172.mta0.migadu.com (out-172.mta0.migadu.com [91.218.175.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CC127314B72 for ; Mon, 9 Mar 2026 21:33:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773092024; cv=none; b=YrrsGv3SBYai/jF+uTfldq5wgTHeXLEdh5sciYy4low2b2ndPmqxZLYjeYb78a45GDxRLA7IW0gzkoD3ah7sEzu5sxN9VnXtObCdlxAJOc9IRWla+RU01onQ1yDrRfRjWB1jEy6gvc4a6taXGqKQvK/kL7owec5jLoOX9SRI7js= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773092024; c=relaxed/simple; bh=sX6BwC3BCH9RqGWQ0rs0ZhqsSUA0OElBmhhKAo8x29U=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=YGyVCZ9Nmk4/GGr1ZmbmHgqAkE+FjmNJq9Ur7N06I/lmQacRUbN3P370jZZRm0RR8SXgwADVx1SGMs2flSJR/b3v66AQx4InJGKnXA9p3YmqoEoR2XZnXFLQeFChbJvsF6HVyw4jwJ/SbaT6ZQ+w761ZNolTCjquQ6CCym/s0o4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=HilA6hXG; arc=none smtp.client-ip=91.218.175.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="HilA6hXG" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773092010; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rdIdQECuuDl6/Ar1yEpr/74GArV/vN6WZaGFta6VVEM=; b=HilA6hXGBS3zlK+1DJFkasUJrC0tswqKYMhZWKCAiaIEc2Egp+H0cvGUVM0xZu3ODgB951 JUlv+ib/MdrGbUsFxrOEB1lSez7AepQYHC/Pvh1Gc4MJ1WSuo97JCC338RbrmNabeyxyIK fJ/df6cmzuFMGm5fFy3xj/H8Mslldow= From: Roman Gushchin To: Shakeel Butt Cc: lsf-pc@lists.linux-foundation.org, Andrew Morton , Tejun Heo , Michal Hocko , Johannes Weiner , Alexei Starovoitov , Michal =?utf-8?Q?Koutn=C3=BD?= , Hui Zhu , JP Kobryn , Muchun Song , Geliang Tang , Sweet Tea Dorminy , Emil Tsalapatis , David Rientjes , Martin KaFai Lau , Meta kernel team , linux-mm@kvack.org, cgroups@vger.kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) In-Reply-To: <20260307182424.2889780-1-shakeel.butt@linux.dev> (Shakeel Butt's message of "Sat, 7 Mar 2026 10:24:24 -0800") References: <20260307182424.2889780-1-shakeel.butt@linux.dev> Date: Mon, 09 Mar 2026 14:33:22 -0700 Message-ID: <87sea8zo0t.fsf@linux.dev> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain X-Migadu-Flow: FLOW_OUT Shakeel Butt writes: > Over the last couple of weeks, I have been brainstorming on how I would go > about redesigning memcg, taking inspiration from sched_ext and bpfoom, with a > focus on existing challenges and issues. This proposal outlines the high-level > direction. Followup emails and patch series will cover and brainstorm the > mechanisms (of course BPF) to achieve these goals. > > Memory cgroups provide memory accounting and the ability to control memory usage > of workloads through two categories of limits. Throttling limits (memory.max and > memory.high) cap memory consumption. Protection limits (memory.min and > memory.low) shield a workload's memory from reclaim under external memory > pressure. > > Challenges > ---------- > > - Workload owners rarely know their actual memory requirements, leading to > overprovisioned limits, lower utilization, and higher infrastructure costs. > > - Throttling limit enforcement is synchronous in the allocating task's context, > which can stall latency-sensitive threads. > > - The stalled thread may hold shared locks, causing priority inversion -- all > waiters are blocked regardless of their priority. > > - Enforcement is indiscriminate -- there is no way to distinguish a > performance-critical or latency-critical allocator from a latency-tolerant > one. > > - Protection limits assume static working sets size, forcing owners to either > overprovision or build complex userspace infrastructure to dynamically adjust > them. > > Feature Wishlist > ---------------- > > Here is the list of features and capabilities I want to enable in the > redesigned memcg limit enforcement world. > > Per-Memcg Background Reclaim > > In the new memcg world, with the goal of (mostly) eliminating direct synchronous > reclaim for limit enforcement, provide per-memcg background reclaimers which can > scale across CPUs with the allocation rate. > > Lock-Aware Throttling > > The ability to avoid throttling an allocating task that is holding locks, to > prevent priority inversion. In Meta's fleet, we have observed lock holders stuck > in memcg reclaim, blocking all waiters regardless of their priority or > criticality. > > Thread-Level Throttling Control > > Workloads should be able to indicate at the thread level which threads can be > synchronously throttled and which cannot. For example, while experimenting with > sched_ext, we drastically improved the performance of AI training workloads by > prioritizing threads interacting with the GPU. Similarly, applications can > identify the threads or thread pools on their performance-critical paths and > the memcg enforcement mechanism should not throttle them. > > Combined Memory and Swap Limits > > Some users (Google actually) need the ability to enforce limits based on > combined memory and swap usage, similar to cgroup v1's memsw limit, providing a > ceiling on total memory commitment rather than treating memory and swap > independently. > > Dynamic Protection Limits > > Rather than static protection limits, the kernel should support defining > protection based on the actual working set of the workload, leveraging signals > such as working set estimation, PSI, refault rates, or a combination thereof to > automatically adapt to the workload's current memory needs. > > Shared Memory Semantics > > With more flexibility in limit enforcement, the kernel should be able to > account for memory shared between workloads (cgroups) during enforcement. > Today, enforcement only looks at each workload's memory usage independently. > Sensible shared memory semantics would allow the enforcer to consider > cross-cgroup sharing when making reclaim and throttling decisions. > > Memory Tiering > > With a flexible limit enforcement mechanism, the kernel can balance memory > usage of different workloads across memory tiers based on their performance > requirements. Tier accounting and hotness tracking are orthogonal, but the > decisions of when and how to balance memory between tiers should be handled by > the enforcer. > > Collaborative Load Shedding > > Many workloads communicate with an external entity for load balancing and rely > on their own usage metrics like RSS or memory pressure to signal whether they > can accept more or less work. This is guesswork. Instead of the > workload guessing, the limit enforcer -- which is actually managing the > workload's memory usage -- should be able to communicate available headroom or > request the workload to shed load or reduce memory usage. This collaborative > load shedding mechanism would allow workloads to make informed decisions rather > than reacting to coarse signals. > > Cross-Subsystem Collaboration > > Finally, the limit enforcement mechanism should collaborate with the CPU > scheduler and other subsystems that can release memory. For example, dirty > memory is not reclaimable and the memory subsystem wakes up flushers to trigger > writeback. However, flushers need CPU to run -- asking the CPU scheduler to > prioritize them ensures the kernel does not lack reclaimable memory under > stressful conditions. Similarly, some subsystems free memory through workqueues > or RCU callbacks. While this may seem orthogonal to limit enforcement, we can > definitely take advantage by having visibility into these situations. > > Putting It All Together > ----------------------- > > To illustrate the end goal, here is an example of the scenario I want to > enable. Suppose there is an AI agent controlling the resources of a host. I > should be able to provide the following policy and everything should work out > of the box: > > Policy: "keep system-level memory utilization below 95 percent; > avoid priority inversions by not throttling allocators holding locks; trim each > workload's usage to its working set without regressing its relevant performance > metrics; collaborate with workloads on load shedding and memory trimming > decisions; and under extreme memory pressure, collaborate with the OOM killer > and the central job scheduler to kill and clean up a workload." This is fantastic, thanks Shakeel! I'd be very interested to discuss it! I suggested a somewhat similar/related topic to the bpf track (bpf use cases in mm), we might think of joining them. Thanks!