From: Joshua Hahn <joshua.hahnjy@gmail.com>
To: linux-mm@kvack.org
Cc: Tejun Heo <tj@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>,
"Michal Koutny" <mkoutny@suse.com>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>,
Muchun Song <muchun.song@linux.dev>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Kemeng Shi <shikemeng@huaweicloud.com>,
Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
Barry Song <baohua@kernel.org>,
Youngjun Park <youngjun.park@lge.com>,
Qi Zheng <qi.zheng@linux.dev>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
Kaiyang Zhao <kaiyang2@cs.cmu.edu>,
David Rientjes <rientjes@google.com>,
Yiannis Nikolakopoulos <yiannis@zptcorp.com>,
"Rao, Bharata Bhasker" <bharata@amd.com>,
cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
kernel-team@meta.com
Subject: [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware
Date: Thu, 23 Apr 2026 13:34:34 -0700 [thread overview]
Message-ID: <20260423203445.2914963-1-joshua.hahnjy@gmail.com> (raw)
INTRODUCTION
============
Memory cgroups provide an interface that allow multiple works on a host to
co-exist via weak and strong memory isolation guarantees. This works, because
for the most part, all memory has equal utility. Isolating a cgroup’s memory
footprint restricts how much it can hurt other workloads competing for memory,
or protects it from other cgroups looking for more memory.
However, on systems with tiered memory (e.g. CXL), memory utility is no longer
homogeneous; toptier and lowtier memory provide different performance
characteristics and have different scarcity, meaning memory footprint no longer
serves as an accurate representation of a cgroup’s consumption of the system’s
limited resources. As an extreme example, a cgroup with 10G of toptier
(e.g. DRAM) memory and a cgroup with 10G of lowtier (e.g. CXL) memory both
appear to be consuming the same amount of system resources from memcg’s
perspective, despite the performance asymmetry between the two workloads.
Therefore on tiered systems, memory isolation cannot currently happen, as
workloads that are well-behaved within their memcg limits may still hurt the
performance of other well-behaving workloads by hogging more than its
“fair share” of toptier memory.
Introduce tier-aware memcg limits, which establish independent toptier limits
that scale with the memory limits and the ratio of toptier:total memory
available on the system.
INTERFACE
=========
This series introduces only one adjustable knob to userspace; a new cgroup mount
option “memory_tiered_limits” which toggles whether the cgroup mount will scale
toptier limits. It also introduces 4 new read-only sysfs entries per-cgroup:
memory.toptier_{min, low, high, max}.
The new toptier memory limits are scaled according to the amount of toptier
memory and total memory available on the system as such:
memory.toptier_high = (toptier_mem / total_mem) * memory.high
For instance, on a host with 100GB memory, with 75G toptier and 25G CXL, the
“toptier ratio” would be 75 / 100 = 0.75. A cgroup with the following memcg
limits {min: 8G, low: 12G, high: 20G, max: 24G} might see toptier limits scaled
at {min: 6G, low: 9G, high: 15G, max: 18G}.
USE CASES
=========
There are workloads that benefit from tiered memory limits, and those that do
not. Explicitly, hosts containing multiple workloads with the goal of maximizing
host-level throughput may see a regression because fairness is not free; it comes
at the cost of underutilized toptier memory, overhead to manage memory
migrations, and host-level memory hotness inversion.
On the other hand, fairness can prove to be a valuable resource for a number of
configurations, especially with workloads that want to raise the lower bound on
performance, rather than optimize for raw throughput:
- VM hosting services that must provide the maximal performance guarantee
(i.e. supremum) for any workload present on a host.
- Database workloads that want to minimize the maximum latency (i.e. infimum)
for queries hosted on the host.
- Hosts running memory-isolated sharded workloads that blocks progress until the
last shard terminates.
- Any workload that wants to minimize variance, as a means to gather measurable
gains in performance over time.
TESTING
=======
To demonstrate the fairness and minimum performance guarantee increases, I
performed some performance tests across three data access patterns. All tests
were done on a 1T host with 750G DRAM and 250G CXL, spawning 4 220G workloads
{memory.high == memory.max == 220G}. 3 of those workloads are “memory hogs”,
who get to run on the host and pre-allocate all of their memory. The last
workload is the “victim”, who only gets to run once the other 3 workloads have
already allocated their memory. Once the victim allocates its memory as well,
we begin measuring read times for the following setups:
1. random memory access in the 220G anon region
2. hot / cold memory access, where the hot region (100G) gets 90% of the reads,
and the cold region (120G) gets 10% of the reads
First, let’s look at what the results look like with NUMAB=2:
Per-cgroup throughput (Mops/s):
Cgroup Baseline Tier-Aware
------ -------- ----------
hog 21.457 17.733
hog 22.773 16.329
hog 22.630 16.549
victim 12.315 16.950
DRAM / CXL distribution (GB):
Cgroup Baseline Tier-Aware
------ -------- ----------
hog 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
hog 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
hog 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
victim 69.3 DRAM / 150.7 CXL 186.7 DRAM / 33.3 CXL
Experiment 2 (hot / cold access)
Per-cgroup throughput (Mops/s):
Cgroup Baseline Tier-Aware
------ -------- ----------
wl0 24.280 17.815
wl1 23.929 15.019
wl2 23.645 15.605
wl3 11.624 15.998
DRAM / CXL distribution (GB):
Cgroup Baseline Tier-Aware
------ -------- ----------
wl0 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
wl1 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
wl2 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
wl3 70.4 DRAM / 149.6 CXL 186.7 DRAM / 33.3 CXL
With NUMAB=0, the pattern remains the same, but overall, throughput seems
increased, and variance seems decreased.
I believe there is a negative interaction here between NUMA balancing’s
host-level hotness tracking, and the tier-aware memcg limit’s push to make
memcg-aware migration decisions (see open questions below).
The results above demonstrate the desired effect of fairly distributing CXL
usage across the workloads regardless of when they were launched, and minimizing
performance variance.
OPEN QUESTIONS (for mailing list & for LSFMMBPF)
================================================
1. Should memory.toptier_max be enforced? And if so, what should it look like?
In my testing, I have found that enforcing memory.toptier_max in the same way
as memory.max leads to significant throttling, as each allocation above the
toptier limit causes a loop of allocate on toptier --> scan toptier LRU for
victim --> demote victim page --> allocate on toptier...
Thus, in my test above, I ran with the last patch (memory.toptier_max
enforcement) disabled. Are there use-cases for enforcing memory.toptier_max?
For this RFC, I’ve included it for review, but I feel that it makes sense to
drop toptier enforcement.
2. This version of the code does its best to generalize the memcg stock system
as much as possible, but still only makes a distinction between toptier /
lowtier. Does it make sense to support 3+ tiers? Are there currently real
systems / hardware out there that desires to enforce fairness at that scale?
2-1. Should swap be considered its own tier?
3. Should users be able to tune anything? Currently, the only choice is for
users to enable the limits or not. Options for userspace tuning include:
setting cgroup-wide toptier limits; system-wide toptier:lowtier ratios;
cgroup-level toptier:lowtier ratios.
4. Tiered memcg limits interfere with existing promotion mechanisms like NUMA
balancing (NUMAB2), that promote memory on a systemwide basis, ignoring
process and memcg contexts. What kinds of promotion mechanisms should be used
to work in memcg-aware contexts?
DEPENDENCIES
============
This work is built upon my recent RFC [1] to move stocks from the memcg level to
the page_counter level, to make the toptier charging path cheaper. In addition,
this patch is limited to working on LRU folios; kmem memory and memory that is
otherwise not charged on an lruvec-basis (i.e. has both physical node & memcg
information; aka enum memcg_stat_item) is not accounted for. There are landed &
ongoing efforts to introduce per-lruvec accounting for these as well:
- vmalloc (from Johannes): mm-stable [2]
- zswap / zswapped / zswap_incompressible [3]
- percpu: in progress [4]
CHANGELOG V1 --> V2
===================
- The toptier:total ratio calculation has been simplified to ignore cpusets and
now exist as a system-wide ratio. This came from the realization that having
cgroups that opt-in and opt-out of CXL co-existing on the system leads to a
question on how the limits should be enforced, and whether such a configuration
is even desirable.
- The simplification above means struct page_counter can be per-memcg, not
mem_cgroup_per_node.
- Independent memcg stock management for toptier
- Included min / max enforcement (for max, see questions above)
- Exported toptier limits as read-only sysfs files
- Turned the build config into a mount option, as suggested by Michal Hocko
Thank you for reading this long cover letter. Have a great day everyone!
[1] https://lore.kernel.org/all/20260410210742.550489-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/all/20260220191035.3703800-1-hannes@cmpxchg.org/
[3] https://lore.kernel.org/all/20260226192936.3190275-1-joshua.hahnjy@gmail.com/
[4] https://lore.kernel.org/all/20260404033844.1892595-1-joshua.hahnjy@gmail.com/
Joshua Hahn (9):
cgroup: Introduce memory_tiered_limits cgroup mount option
mm/memory-tiers: Introduce toptier utility functions
mm/memcontrol: Refactor page_counter charging in try_charge_memcg
mm/memcontrol: charge/uncharge toptier memory to mem_cgroup
mm/memcontrol: Set toptier limits proportional to memory limits
mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages
mm/memcontrol: Make memory.low and memory.min tier-aware
mm/memcontrol: Make memory.high tier-aware
mm/memcontrol: Make memory.max tier-aware
include/linux/cgroup-defs.h | 5 +
include/linux/memcontrol.h | 35 ++++
include/linux/memory-tiers.h | 17 ++
include/linux/swap.h | 3 +-
kernel/cgroup/cgroup.c | 12 ++
mm/memcontrol-v1.c | 6 +-
mm/memcontrol.c | 306 +++++++++++++++++++++++++++++++++++++----
mm/memory-tiers.c | 46 +++++-
mm/vmscan.c | 11 +-
9 files changed, 402 insertions(+), 39 deletions(-)
--
2.52.0
next reply other threads:[~2026-04-23 20:34 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-23 20:34 Joshua Hahn [this message]
2026-04-23 20:34 ` [RFC PATCH 1/9 v2] cgroup: Introduce memory_tiered_limits cgroup mount option Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 3/9 v2] mm/memcontrol: Refactor page_counter charging in try_charge_memcg Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 4/9 v2] mm/memcontrol: charge/uncharge toptier memory to mem_cgroup Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 5/9 v2] mm/memcontrol: Set toptier limits proportional to memory limits Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 6/9 v2] mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 7/9 v2] mm/memcontrol: Make memory.low and memory.min tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 8/9 v2] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 9/9 v2] mm/memcontrol: Make memory.max tier-aware Joshua Hahn
2026-05-11 15:56 ` [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware David Hildenbrand (Arm)
2026-05-11 20:03 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260423203445.2914963-1-joshua.hahnjy@gmail.com \
--to=joshua.hahnjy@gmail.com \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=bharata@amd.com \
--cc=bhe@redhat.com \
--cc=cgroups@vger.kernel.org \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kaiyang2@cs.cmu.edu \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=nphamcs@gmail.com \
--cc=qi.zheng@linux.dev \
--cc=rientjes@google.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=tj@kernel.org \
--cc=vbabka@kernel.org \
--cc=weixugc@google.com \
--cc=yiannis@zptcorp.com \
--cc=youngjun.park@lge.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox