* [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware
@ 2026-04-23 20:34 Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 1/9 v2] cgroup: Introduce memory_tiered_limits cgroup mount option Joshua Hahn
` (8 more replies)
0 siblings, 9 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Tejun Heo, Johannes Weiner, Michal Koutny, Michal Hocko,
Roman Gushchin, Shakeel Butt, Andrew Morton, David Hildenbrand,
Chris Li, Kairui Song, Muchun Song, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Youngjun Park, Qi Zheng, Axel Rasmussen, Yuanchu Xie, Wei Xu,
Kaiyang Zhao, David Rientjes, Yiannis Nikolakopoulos,
Rao, Bharata Bhasker, cgroups, linux-kernel, kernel-team
INTRODUCTION
============
Memory cgroups provide an interface that allow multiple works on a host to
co-exist via weak and strong memory isolation guarantees. This works, because
for the most part, all memory has equal utility. Isolating a cgroup’s memory
footprint restricts how much it can hurt other workloads competing for memory,
or protects it from other cgroups looking for more memory.
However, on systems with tiered memory (e.g. CXL), memory utility is no longer
homogeneous; toptier and lowtier memory provide different performance
characteristics and have different scarcity, meaning memory footprint no longer
serves as an accurate representation of a cgroup’s consumption of the system’s
limited resources. As an extreme example, a cgroup with 10G of toptier
(e.g. DRAM) memory and a cgroup with 10G of lowtier (e.g. CXL) memory both
appear to be consuming the same amount of system resources from memcg’s
perspective, despite the performance asymmetry between the two workloads.
Therefore on tiered systems, memory isolation cannot currently happen, as
workloads that are well-behaved within their memcg limits may still hurt the
performance of other well-behaving workloads by hogging more than its
“fair share” of toptier memory.
Introduce tier-aware memcg limits, which establish independent toptier limits
that scale with the memory limits and the ratio of toptier:total memory
available on the system.
INTERFACE
=========
This series introduces only one adjustable knob to userspace; a new cgroup mount
option “memory_tiered_limits” which toggles whether the cgroup mount will scale
toptier limits. It also introduces 4 new read-only sysfs entries per-cgroup:
memory.toptier_{min, low, high, max}.
The new toptier memory limits are scaled according to the amount of toptier
memory and total memory available on the system as such:
memory.toptier_high = (toptier_mem / total_mem) * memory.high
For instance, on a host with 100GB memory, with 75G toptier and 25G CXL, the
“toptier ratio” would be 75 / 100 = 0.75. A cgroup with the following memcg
limits {min: 8G, low: 12G, high: 20G, max: 24G} might see toptier limits scaled
at {min: 6G, low: 9G, high: 15G, max: 18G}.
USE CASES
=========
There are workloads that benefit from tiered memory limits, and those that do
not. Explicitly, hosts containing multiple workloads with the goal of maximizing
host-level throughput may see a regression because fairness is not free; it comes
at the cost of underutilized toptier memory, overhead to manage memory
migrations, and host-level memory hotness inversion.
On the other hand, fairness can prove to be a valuable resource for a number of
configurations, especially with workloads that want to raise the lower bound on
performance, rather than optimize for raw throughput:
- VM hosting services that must provide the maximal performance guarantee
(i.e. supremum) for any workload present on a host.
- Database workloads that want to minimize the maximum latency (i.e. infimum)
for queries hosted on the host.
- Hosts running memory-isolated sharded workloads that blocks progress until the
last shard terminates.
- Any workload that wants to minimize variance, as a means to gather measurable
gains in performance over time.
TESTING
=======
To demonstrate the fairness and minimum performance guarantee increases, I
performed some performance tests across three data access patterns. All tests
were done on a 1T host with 750G DRAM and 250G CXL, spawning 4 220G workloads
{memory.high == memory.max == 220G}. 3 of those workloads are “memory hogs”,
who get to run on the host and pre-allocate all of their memory. The last
workload is the “victim”, who only gets to run once the other 3 workloads have
already allocated their memory. Once the victim allocates its memory as well,
we begin measuring read times for the following setups:
1. random memory access in the 220G anon region
2. hot / cold memory access, where the hot region (100G) gets 90% of the reads,
and the cold region (120G) gets 10% of the reads
First, let’s look at what the results look like with NUMAB=2:
Per-cgroup throughput (Mops/s):
Cgroup Baseline Tier-Aware
------ -------- ----------
hog 21.457 17.733
hog 22.773 16.329
hog 22.630 16.549
victim 12.315 16.950
DRAM / CXL distribution (GB):
Cgroup Baseline Tier-Aware
------ -------- ----------
hog 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
hog 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
hog 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
victim 69.3 DRAM / 150.7 CXL 186.7 DRAM / 33.3 CXL
Experiment 2 (hot / cold access)
Per-cgroup throughput (Mops/s):
Cgroup Baseline Tier-Aware
------ -------- ----------
wl0 24.280 17.815
wl1 23.929 15.019
wl2 23.645 15.605
wl3 11.624 15.998
DRAM / CXL distribution (GB):
Cgroup Baseline Tier-Aware
------ -------- ----------
wl0 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
wl1 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
wl2 220.0 DRAM / 0.0 CXL 181.6 DRAM / 38.4 CXL
wl3 70.4 DRAM / 149.6 CXL 186.7 DRAM / 33.3 CXL
With NUMAB=0, the pattern remains the same, but overall, throughput seems
increased, and variance seems decreased.
I believe there is a negative interaction here between NUMA balancing’s
host-level hotness tracking, and the tier-aware memcg limit’s push to make
memcg-aware migration decisions (see open questions below).
The results above demonstrate the desired effect of fairly distributing CXL
usage across the workloads regardless of when they were launched, and minimizing
performance variance.
OPEN QUESTIONS (for mailing list & for LSFMMBPF)
================================================
1. Should memory.toptier_max be enforced? And if so, what should it look like?
In my testing, I have found that enforcing memory.toptier_max in the same way
as memory.max leads to significant throttling, as each allocation above the
toptier limit causes a loop of allocate on toptier --> scan toptier LRU for
victim --> demote victim page --> allocate on toptier...
Thus, in my test above, I ran with the last patch (memory.toptier_max
enforcement) disabled. Are there use-cases for enforcing memory.toptier_max?
For this RFC, I’ve included it for review, but I feel that it makes sense to
drop toptier enforcement.
2. This version of the code does its best to generalize the memcg stock system
as much as possible, but still only makes a distinction between toptier /
lowtier. Does it make sense to support 3+ tiers? Are there currently real
systems / hardware out there that desires to enforce fairness at that scale?
2-1. Should swap be considered its own tier?
3. Should users be able to tune anything? Currently, the only choice is for
users to enable the limits or not. Options for userspace tuning include:
setting cgroup-wide toptier limits; system-wide toptier:lowtier ratios;
cgroup-level toptier:lowtier ratios.
4. Tiered memcg limits interfere with existing promotion mechanisms like NUMA
balancing (NUMAB2), that promote memory on a systemwide basis, ignoring
process and memcg contexts. What kinds of promotion mechanisms should be used
to work in memcg-aware contexts?
DEPENDENCIES
============
This work is built upon my recent RFC [1] to move stocks from the memcg level to
the page_counter level, to make the toptier charging path cheaper. In addition,
this patch is limited to working on LRU folios; kmem memory and memory that is
otherwise not charged on an lruvec-basis (i.e. has both physical node & memcg
information; aka enum memcg_stat_item) is not accounted for. There are landed &
ongoing efforts to introduce per-lruvec accounting for these as well:
- vmalloc (from Johannes): mm-stable [2]
- zswap / zswapped / zswap_incompressible [3]
- percpu: in progress [4]
CHANGELOG V1 --> V2
===================
- The toptier:total ratio calculation has been simplified to ignore cpusets and
now exist as a system-wide ratio. This came from the realization that having
cgroups that opt-in and opt-out of CXL co-existing on the system leads to a
question on how the limits should be enforced, and whether such a configuration
is even desirable.
- The simplification above means struct page_counter can be per-memcg, not
mem_cgroup_per_node.
- Independent memcg stock management for toptier
- Included min / max enforcement (for max, see questions above)
- Exported toptier limits as read-only sysfs files
- Turned the build config into a mount option, as suggested by Michal Hocko
Thank you for reading this long cover letter. Have a great day everyone!
[1] https://lore.kernel.org/all/20260410210742.550489-1-joshua.hahnjy@gmail.com/
[2] https://lore.kernel.org/all/20260220191035.3703800-1-hannes@cmpxchg.org/
[3] https://lore.kernel.org/all/20260226192936.3190275-1-joshua.hahnjy@gmail.com/
[4] https://lore.kernel.org/all/20260404033844.1892595-1-joshua.hahnjy@gmail.com/
Joshua Hahn (9):
cgroup: Introduce memory_tiered_limits cgroup mount option
mm/memory-tiers: Introduce toptier utility functions
mm/memcontrol: Refactor page_counter charging in try_charge_memcg
mm/memcontrol: charge/uncharge toptier memory to mem_cgroup
mm/memcontrol: Set toptier limits proportional to memory limits
mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages
mm/memcontrol: Make memory.low and memory.min tier-aware
mm/memcontrol: Make memory.high tier-aware
mm/memcontrol: Make memory.max tier-aware
include/linux/cgroup-defs.h | 5 +
include/linux/memcontrol.h | 35 ++++
include/linux/memory-tiers.h | 17 ++
include/linux/swap.h | 3 +-
kernel/cgroup/cgroup.c | 12 ++
mm/memcontrol-v1.c | 6 +-
mm/memcontrol.c | 306 +++++++++++++++++++++++++++++++++++++----
mm/memory-tiers.c | 46 +++++-
mm/vmscan.c | 11 +-
9 files changed, 402 insertions(+), 39 deletions(-)
--
2.52.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* [RFC PATCH 1/9 v2] cgroup: Introduce memory_tiered_limits cgroup mount option
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 3/9 v2] mm/memcontrol: Refactor page_counter charging in try_charge_memcg Joshua Hahn
` (7 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Tejun Heo, Johannes Weiner, Michal Koutný, Michal Hocko,
Roman Gushchin, Shakeel Butt, Muchun Song, cgroups, linux-kernel,
kernel-team
Introduce a cgroup mount option memory_tiered_limits to enable
tier-proportional scaling of the memory cgroup controller limits
memory.{min, low, high, max}.
The mount option currently does not have any effect.
Later commits will scale memcg limits proportional to the system's
toptier:total capacity ratio.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
include/linux/cgroup-defs.h | 5 +++++
include/linux/memcontrol.h | 14 ++++++++++++++
kernel/cgroup/cgroup.c | 12 ++++++++++++
3 files changed, 31 insertions(+)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index bb92f5c169ca2..0b6861f4faece 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -128,6 +128,11 @@ enum {
* Enable legacy local pids.events.
*/
CGRP_ROOT_PIDS_LOCAL_EVENTS = (1 << 20),
+
+ /*
+ * Enable tier-proportional scaling of limits for the memory controller.
+ */
+ CGRP_ROOT_MEMORY_TIERED_LIMITS = (1 << 21),
};
/* cftype->flags */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dc3fa687759b4..be45641e890e4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -533,6 +533,15 @@ static inline bool mem_cgroup_disabled(void)
return !cgroup_subsys_enabled(memory_cgrp_subsys);
}
+static inline bool mem_cgroup_tiered_limits(void)
+{
+#ifdef CONFIG_NUMA
+ return cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_TIERED_LIMITS;
+#else
+ return false;
+#endif
+}
+
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
@@ -1084,6 +1093,11 @@ static inline bool mem_cgroup_disabled(void)
return true;
}
+static inline bool mem_cgroup_tiered_limits(void)
+{
+ return false;
+}
+
static inline void memcg_memory_event(struct mem_cgroup *memcg,
enum memcg_memory_event event)
{
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index babf7b4560488..6a34d0e179dc5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1989,6 +1989,7 @@ enum cgroup2_param {
Opt_memory_recursiveprot,
Opt_memory_hugetlb_accounting,
Opt_pids_localevents,
+ Opt_memory_tiered_limits,
nr__cgroup2_params
};
@@ -1999,6 +2000,7 @@ static const struct fs_parameter_spec cgroup2_fs_parameters[] = {
fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot),
fsparam_flag("memory_hugetlb_accounting", Opt_memory_hugetlb_accounting),
fsparam_flag("pids_localevents", Opt_pids_localevents),
+ fsparam_flag("memory_tiered_limits", Opt_memory_tiered_limits),
{}
};
@@ -2031,6 +2033,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
case Opt_pids_localevents:
ctx->flags |= CGRP_ROOT_PIDS_LOCAL_EVENTS;
return 0;
+ case Opt_memory_tiered_limits:
+ ctx->flags |= CGRP_ROOT_MEMORY_TIERED_LIMITS;
+ return 0;
}
return -EINVAL;
}
@@ -2072,6 +2077,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags)
cgrp_dfl_root.flags |= CGRP_ROOT_PIDS_LOCAL_EVENTS;
else
cgrp_dfl_root.flags &= ~CGRP_ROOT_PIDS_LOCAL_EVENTS;
+
+ if (root_flags & CGRP_ROOT_MEMORY_TIERED_LIMITS)
+ cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_TIERED_LIMITS;
+ else
+ cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_TIERED_LIMITS;
}
}
@@ -2089,6 +2099,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root
seq_puts(seq, ",memory_hugetlb_accounting");
if (cgrp_dfl_root.flags & CGRP_ROOT_PIDS_LOCAL_EVENTS)
seq_puts(seq, ",pids_localevents");
+ if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_TIERED_LIMITS)
+ seq_puts(seq, ",memory_tiered_limits");
return 0;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 3/9 v2] mm/memcontrol: Refactor page_counter charging in try_charge_memcg
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 1/9 v2] cgroup: Introduce memory_tiered_limits cgroup mount option Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 4/9 v2] mm/memcontrol: charge/uncharge toptier memory to mem_cgroup Joshua Hahn
` (6 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Muchun Song, cgroups, linux-kernel, kernel-team
In preparation for adding charging and uncharging of a new page_counter
toptier to try_charge_memcg, refactor the code so that it is easier to
roll back partial charges when any one of the three page_counters
fail to charge.
No functional changes intended.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/memcontrol.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7de23ecd7cef6..8f7bedb55dbb1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2385,18 +2385,22 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
retry:
reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
- if (!do_memsw_account() ||
- page_counter_try_charge(&memcg->memsw, nr_pages, &counter)) {
- if (page_counter_try_charge(&memcg->memory, nr_pages, &counter))
- goto done_restock;
- if (do_memsw_account())
- page_counter_uncharge(&memcg->memsw, nr_pages);
- mem_over_limit = mem_cgroup_from_counter(counter, memory);
- } else {
+
+ if (do_memsw_account() &&
+ !page_counter_try_charge(&memcg->memsw, nr_pages, &counter)) {
mem_over_limit = mem_cgroup_from_counter(counter, memsw);
reclaim_options &= ~MEMCG_RECLAIM_MAY_SWAP;
+ goto reclaim;
}
+ if (page_counter_try_charge(&memcg->memory, nr_pages, &counter))
+ goto done_restock;
+
+ if (do_memsw_account())
+ page_counter_uncharge(&memcg->memsw, nr_pages);
+ mem_over_limit = mem_cgroup_from_counter(counter, memory);
+
+reclaim:
/*
* Prevent unbounded recursion when reclaim operations need to
* allocate memory. This might exceed the limits temporarily,
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 4/9 v2] mm/memcontrol: charge/uncharge toptier memory to mem_cgroup
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 1/9 v2] cgroup: Introduce memory_tiered_limits cgroup mount option Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 3/9 v2] mm/memcontrol: Refactor page_counter charging in try_charge_memcg Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 5/9 v2] mm/memcontrol: Set toptier limits proportional to memory limits Joshua Hahn
` (5 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Muchun Song, cgroups, linux-kernel, kernel-team
Memory cgroup limits currently offer a way to isolate memory as a
resource, but treats the cost/value of all memory to be equal,
regardless of whether it is present in a toptier node or not.
To better capture the asymmetric utility of toptier memory from
"lowtier" memory, account toptier memory usage in parallel to existing
memory accounting mechanisms. To do this, introduce a new page_counter
"toptier" to mem_cgroup.
From a simplified perspective, we can achieve this by checking the
physical location of folios when the memory page_counter is updated, and
decide whether to also account to toptier. Add a new "toptier" parameter
to try_charge_memcg(), which callers must determine.
However, as of this RFC, this simplified model only works on LRU folios
(callers of try_charge_memcg() from charge_memcg()). The other two
sites, obj_cgroup_charge_pages() and mem_cgroup_sk_charge(), will be
addressed in future patches that transition enum memcg_stat_item to
a per-lruvec counter (enum memcg_stat_item).
Enforcement mechanisms are not present at this point. Failing the
toptier limit check leads to nothing, but the charges are accumulated.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 63 ++++++++++++++++++++++++++++++++++----
2 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index be45641e890e4..0cdb6cd1955dc 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -206,6 +206,7 @@ struct mem_cgroup {
/* Accounted resources */
struct page_counter memory; /* Both v1 & v2 */
+ struct page_counter toptier; /* v2 only */
union {
struct page_counter swap; /* v2 only */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8f7bedb55dbb1..d891cf77cf6d6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -53,6 +53,7 @@
#include <linux/seq_file.h>
#include <linux/vmpressure.h>
#include <linux/memremap.h>
+#include <linux/memory-tiers.h>
#include <linux/mm_inline.h>
#include <linux/swap_cgroup.h>
#include <linux/cpu.h>
@@ -2096,6 +2097,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu)
for_each_mem_cgroup(memcg) {
page_counter_drain_cpu(&memcg->memory, cpu);
+ page_counter_drain_cpu(&memcg->toptier, cpu);
page_counter_drain_cpu(&memcg->memsw, cpu);
}
@@ -2370,7 +2372,7 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
}
static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages)
+ unsigned int nr_pages, bool toptier)
{
int nr_retries = MAX_RECLAIM_RETRIES;
struct mem_cgroup *mem_over_limit;
@@ -2382,9 +2384,11 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool raised_max_event = false;
unsigned long pflags;
bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
+ bool toptier_charged;
retry:
reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
+ toptier_charged = false;
if (do_memsw_account() &&
!page_counter_try_charge(&memcg->memsw, nr_pages, &counter)) {
@@ -2393,11 +2397,18 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto reclaim;
}
+ if (toptier &&
+ page_counter_try_charge(&memcg->toptier, nr_pages, &counter))
+ toptier_charged = true;
+
if (page_counter_try_charge(&memcg->memory, nr_pages, &counter))
goto done_restock;
+ if (toptier_charged)
+ page_counter_uncharge(&memcg->toptier, nr_pages);
if (do_memsw_account())
page_counter_uncharge(&memcg->memsw, nr_pages);
+
mem_over_limit = mem_cgroup_from_counter(counter, memory);
reclaim:
@@ -2490,6 +2501,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
* being freed very soon. Allow memory usage go over the limit
* temporarily by force charging it.
*/
+ if (toptier)
+ page_counter_charge(&memcg->toptier, nr_pages);
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
@@ -2559,7 +2572,7 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_is_root(memcg))
return 0;
- return try_charge_memcg(memcg, gfp_mask, nr_pages);
+ return try_charge_memcg(memcg, gfp_mask, nr_pages, false);
}
static void commit_charge(struct folio *folio, struct obj_cgroup *objcg)
@@ -2859,7 +2872,7 @@ static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
memcg = get_mem_cgroup_from_objcg(objcg);
- ret = try_charge_memcg(memcg, gfp, nr_pages);
+ ret = try_charge_memcg(memcg, gfp, nr_pages, false);
if (ret)
goto out;
@@ -2888,6 +2901,11 @@ static void page_set_objcg(struct page *page, const struct obj_cgroup *objcg)
page->memcg_data = (unsigned long)objcg | MEMCG_DATA_KMEM;
}
+static bool should_charge_toptier(struct folio *folio)
+{
+ return mem_cgroup_tiered_limits() && node_is_toptier(folio_nid(folio));
+}
+
/**
* __memcg_kmem_charge_page: charge a kmem page to the current memory cgroup
* @page: page to charge
@@ -3760,6 +3778,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
static void mem_cgroup_free(struct mem_cgroup *memcg)
{
page_counter_free_stock(&memcg->memory);
+ page_counter_free_stock(&memcg->toptier);
page_counter_free_stock(&memcg->memsw);
lru_gen_exit_memcg(memcg);
memcg_wb_domain_exit(memcg);
@@ -3866,6 +3885,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
WRITE_ONCE(memcg->swappiness, mem_cgroup_swappiness(parent));
page_counter_init(&memcg->memory, &parent->memory, memcg_on_dfl);
+ page_counter_init(&memcg->toptier, &parent->toptier, memcg_on_dfl);
page_counter_init(&memcg->swap, &parent->swap, false);
#ifdef CONFIG_MEMCG_V1
memcg->memory.track_failcnt = !memcg_on_dfl;
@@ -3877,6 +3897,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
init_memcg_stats();
init_memcg_events();
page_counter_init(&memcg->memory, NULL, true);
+ page_counter_init(&memcg->toptier, NULL, true);
page_counter_init(&memcg->swap, NULL, false);
#ifdef CONFIG_MEMCG_V1
page_counter_init(&memcg->kmem, NULL, false);
@@ -3936,6 +3957,7 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
/* failure is nonfatal, charges fall back to direct hierarchy */
page_counter_enable_stock(&memcg->memory, MEMCG_CHARGE_BATCH);
+ page_counter_enable_stock(&memcg->toptier, MEMCG_CHARGE_BATCH);
if (do_memsw_account())
page_counter_enable_stock(&memcg->memsw, MEMCG_CHARGE_BATCH);
@@ -4013,6 +4035,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
drain_all_stock(memcg);
page_counter_disable_stock(&memcg->memory);
+ page_counter_disable_stock(&memcg->toptier);
page_counter_disable_stock(&memcg->memsw);
mem_cgroup_private_id_put(memcg, 1);
@@ -4825,7 +4848,8 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
objcg = get_obj_cgroup_from_memcg(memcg);
/* Do not account at the root objcg level. */
if (!obj_cgroup_is_root(objcg))
- ret = try_charge_memcg(memcg, gfp, folio_nr_pages(folio));
+ ret = try_charge_memcg(memcg, gfp, folio_nr_pages(folio),
+ should_charge_toptier(folio));
if (ret) {
obj_cgroup_put(objcg);
return ret;
@@ -4922,6 +4946,7 @@ struct uncharge_gather {
unsigned long nr_memory;
unsigned long pgpgout;
unsigned long nr_kmem;
+ unsigned long nr_toptier;
int nid;
};
@@ -4942,6 +4967,8 @@ static void uncharge_batch(const struct uncharge_gather *ug)
mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
memcg1_account_kmem(memcg, -ug->nr_kmem);
}
+ if (ug->nr_toptier)
+ page_counter_uncharge(&memcg->toptier, ug->nr_toptier);
memcg1_oom_recover(memcg);
}
@@ -4987,8 +5014,11 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
ug->nr_kmem += nr_pages;
} else {
/* LRU pages aren't accounted at the root level */
- if (!obj_cgroup_is_root(objcg))
+ if (!obj_cgroup_is_root(objcg)) {
ug->nr_memory += nr_pages;
+ if (should_charge_toptier(folio))
+ ug->nr_toptier += nr_pages;
+ }
ug->pgpgout++;
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
@@ -5063,6 +5093,10 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
+
+ /* old folio's toptier usage will be uncharged on free */
+ if (should_charge_toptier(new))
+ page_counter_charge(&memcg->toptier, nr_pages);
}
obj_cgroup_get(objcg);
@@ -5105,6 +5139,23 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
if (!objcg)
return;
+ if (!obj_cgroup_is_root(objcg)) {
+ struct mem_cgroup *memcg;
+ unsigned long nr_pages = folio_nr_pages(old);
+ bool old_toptier, new_toptier;
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
+ old_toptier = should_charge_toptier(old);
+ new_toptier = should_charge_toptier(new);
+
+ if (old_toptier && !new_toptier)
+ page_counter_uncharge(&memcg->toptier, nr_pages);
+ else if (!old_toptier && new_toptier)
+ page_counter_charge(&memcg->toptier, nr_pages);
+ rcu_read_unlock();
+ }
+
/* Transfer the charge and the objcg ref */
commit_charge(new, objcg);
@@ -5180,7 +5231,7 @@ bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages,
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return memcg1_charge_skmem(memcg, nr_pages, gfp_mask);
- if (try_charge_memcg(memcg, gfp_mask, nr_pages) == 0) {
+ if (try_charge_memcg(memcg, gfp_mask, nr_pages, false) == 0) {
mod_memcg_state(memcg, MEMCG_SOCK, nr_pages);
return true;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 5/9 v2] mm/memcontrol: Set toptier limits proportional to memory limits
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
` (2 preceding siblings ...)
2026-04-23 20:34 ` [RFC PATCH 4/9 v2] mm/memcontrol: charge/uncharge toptier memory to mem_cgroup Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 6/9 v2] mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages Joshua Hahn
` (4 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, David Hildenbrand, Muchun Song, Lorenzo Stoakes,
Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
Suren Baghdasaryan, cgroups, linux-kernel, kernel-team
Compute proportional toptier limits based on memory limits when
users write to memory limit sysfs files, or when memory hotplug causes
the toptier capacity / total capacity ratio to be shifted.
Also introduce new read-only cgroup files memory.toptier_{min,low,high,max}
to expose the derived toptier limits.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
include/linux/memcontrol.h | 12 +++++
mm/memcontrol.c | 93 ++++++++++++++++++++++++++++++++++++++
mm/memory-tiers.c | 8 +++-
3 files changed, 111 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0cdb6cd1955dc..6bcb866440075 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -543,6 +543,14 @@ static inline bool mem_cgroup_tiered_limits(void)
#endif
}
+#ifdef CONFIG_NUMA
+void update_memcg_toptier_limits(void);
+#else
+static inline void update_memcg_toptier_limits(void)
+{
+}
+#endif
+
static inline void mem_cgroup_protection(struct mem_cgroup *root,
struct mem_cgroup *memcg,
unsigned long *min,
@@ -1099,6 +1107,10 @@ static inline bool mem_cgroup_tiered_limits(void)
return false;
}
+static inline void update_memcg_toptier_limits(void)
+{
+}
+
static inline void memcg_memory_event(struct mem_cgroup *memcg,
enum memcg_memory_event event)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d891cf77cf6d6..3acb06388405c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3875,6 +3875,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
return ERR_CAST(memcg);
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
+ page_counter_set_high(&memcg->toptier, PAGE_COUNTER_MAX);
memcg1_soft_limit_reset(memcg);
#ifdef CONFIG_ZSWAP
memcg->zswap_max = PAGE_COUNTER_MAX;
@@ -4092,6 +4093,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
page_counter_set_max(&memcg->memory, PAGE_COUNTER_MAX);
+ page_counter_set_max(&memcg->toptier, PAGE_COUNTER_MAX);
page_counter_set_max(&memcg->swap, PAGE_COUNTER_MAX);
#ifdef CONFIG_MEMCG_V1
page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX);
@@ -4100,6 +4102,9 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
page_counter_set_min(&memcg->memory, 0);
page_counter_set_low(&memcg->memory, 0);
page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
+ page_counter_set_min(&memcg->toptier, 0);
+ page_counter_set_low(&memcg->toptier, 0);
+ page_counter_set_high(&memcg->toptier, PAGE_COUNTER_MAX);
memcg1_soft_limit_reset(memcg);
page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
memcg_wb_domain_size_changed(memcg);
@@ -4438,12 +4443,51 @@ static ssize_t memory_peak_write(struct kernfs_open_file *of, char *buf,
#undef OFP_PEAK_UNSET
+static inline unsigned long page_counter_max_or_scale(unsigned long val)
+{
+ return val == PAGE_COUNTER_MAX ? PAGE_COUNTER_MAX :
+ mt_scale_by_toptier(val);
+}
+
+void update_memcg_toptier_limits(void)
+{
+ struct mem_cgroup *memcg;
+
+ if (!mem_cgroup_tiered_limits())
+ return;
+
+ for_each_mem_cgroup(memcg) {
+ unsigned long old_min = READ_ONCE(memcg->memory.min);
+ unsigned long old_low = READ_ONCE(memcg->memory.low);
+ unsigned long old_high = READ_ONCE(memcg->memory.high);
+ unsigned long old_max = READ_ONCE(memcg->memory.max);
+
+ if (memcg == root_mem_cgroup)
+ continue;
+
+ page_counter_set_min(&memcg->toptier,
+ page_counter_max_or_scale(old_min));
+ page_counter_set_low(&memcg->toptier,
+ page_counter_max_or_scale(old_low));
+ page_counter_set_high(&memcg->toptier,
+ page_counter_max_or_scale(old_high));
+ xchg(&memcg->toptier.max,
+ page_counter_max_or_scale(old_max));
+ }
+}
+
static int memory_min_show(struct seq_file *m, void *v)
{
return seq_puts_memcg_tunable(m,
READ_ONCE(mem_cgroup_from_seq(m)->memory.min));
}
+static int toptier_min_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->toptier.min));
+}
+
static ssize_t memory_min_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -4457,6 +4501,9 @@ static ssize_t memory_min_write(struct kernfs_open_file *of,
return err;
page_counter_set_min(&memcg->memory, min);
+ if (mem_cgroup_tiered_limits())
+ page_counter_set_min(&memcg->toptier,
+ page_counter_max_or_scale(min));
return nbytes;
}
@@ -4467,6 +4514,12 @@ static int memory_low_show(struct seq_file *m, void *v)
READ_ONCE(mem_cgroup_from_seq(m)->memory.low));
}
+static int toptier_low_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->toptier.low));
+}
+
static ssize_t memory_low_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -4480,6 +4533,9 @@ static ssize_t memory_low_write(struct kernfs_open_file *of,
return err;
page_counter_set_low(&memcg->memory, low);
+ if (mem_cgroup_tiered_limits())
+ page_counter_set_low(&memcg->toptier,
+ page_counter_max_or_scale(low));
return nbytes;
}
@@ -4490,6 +4546,12 @@ static int memory_high_show(struct seq_file *m, void *v)
READ_ONCE(mem_cgroup_from_seq(m)->memory.high));
}
+static int toptier_high_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->toptier.high));
+}
+
static ssize_t memory_high_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -4505,6 +4567,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
return err;
page_counter_set_high(&memcg->memory, high);
+ if (mem_cgroup_tiered_limits())
+ page_counter_set_high(&memcg->toptier,
+ page_counter_max_or_scale(high));
if (of->file->f_flags & O_NONBLOCK)
goto out;
@@ -4542,6 +4607,12 @@ static int memory_max_show(struct seq_file *m, void *v)
READ_ONCE(mem_cgroup_from_seq(m)->memory.max));
}
+static int toptier_max_show(struct seq_file *m, void *v)
+{
+ return seq_puts_memcg_tunable(m,
+ READ_ONCE(mem_cgroup_from_seq(m)->toptier.max));
+}
+
static ssize_t memory_max_write(struct kernfs_open_file *of,
char *buf, size_t nbytes, loff_t off)
{
@@ -4557,6 +4628,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return err;
xchg(&memcg->memory.max, max);
+ if (mem_cgroup_tiered_limits())
+ xchg(&memcg->toptier.max, page_counter_max_or_scale(max));
if (of->file->f_flags & O_NONBLOCK)
goto out;
@@ -4762,6 +4835,26 @@ static struct cftype memory_files[] = {
.seq_show = memory_max_show,
.write = memory_max_write,
},
+ {
+ .name = "toptier_min",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = toptier_min_show,
+ },
+ {
+ .name = "toptier_low",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = toptier_low_show,
+ },
+ {
+ .name = "toptier_high",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = toptier_high_show,
+ },
+ {
+ .name = "toptier_max",
+ .flags = CFTYPE_NOT_ON_ROOT,
+ .seq_show = toptier_max_show,
+ },
{
.name = "events",
.flags = CFTYPE_NOT_ON_ROOT,
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index acc02679e312d..ddcc11e3919da 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -924,15 +924,19 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self,
switch (action) {
case NODE_REMOVED_LAST_MEMORY:
mutex_lock(&memory_tier_lock);
- if (clear_node_memory_tier(nn->nid))
+ if (clear_node_memory_tier(nn->nid)) {
establish_demotion_targets();
+ update_memcg_toptier_limits();
+ }
mutex_unlock(&memory_tier_lock);
break;
case NODE_ADDED_FIRST_MEMORY:
mutex_lock(&memory_tier_lock);
memtier = set_node_memory_tier(nn->nid);
- if (!IS_ERR(memtier))
+ if (!IS_ERR(memtier)) {
establish_demotion_targets();
+ update_memcg_toptier_limits();
+ }
mutex_unlock(&memory_tier_lock);
break;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 6/9 v2] mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
` (3 preceding siblings ...)
2026-04-23 20:34 ` [RFC PATCH 5/9 v2] mm/memcontrol: Set toptier limits proportional to memory limits Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 7/9 v2] mm/memcontrol: Make memory.low and memory.min tier-aware Joshua Hahn
` (3 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Chris Li, Kairui Song, Johannes Weiner,
Michal Hocko, Roman Gushchin, Shakeel Butt, Kemeng Shi, Nhat Pham,
Baoquan He, Barry Song, Youngjun Park, Muchun Song, Qi Zheng,
Axel Rasmussen, Yuanchu Xie, Wei Xu, David Hildenbrand,
Lorenzo Stoakes, cgroups, linux-kernel, kernel-team
Add a new nodemask parameter to try_to_free_mem_cgroup_pages to allow
selective reclaim on certain nodes. This new function signature can be
used in future patches to selectively perform reclaim on toptier and
place downward pressure when toptier limits are breached but memcg-wide
limits are not yet breached.
All callers pass NULL to the new nodemask, so there are no functional
changes with this patch.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
include/linux/swap.h | 3 ++-
mm/memcontrol-v1.c | 6 ++++--
mm/memcontrol.c | 11 +++++++----
mm/vmscan.c | 11 ++++++-----
4 files changed, 19 insertions(+), 12 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1930f81e6be4d..493dd99f3165a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -367,7 +367,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
unsigned int reclaim_options,
- int *swappiness);
+ int *swappiness,
+ nodemask_t *allowed);
extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
gfp_t gfp_mask, bool noswap,
pg_data_t *pgdat,
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 433bba9dfe715..03df1cc71842c 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1500,7 +1500,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
}
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
+ memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
+ NULL, NULL)) {
ret = -EBUSY;
break;
}
@@ -1532,7 +1533,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
return -EINTR;
if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
- MEMCG_RECLAIM_MAY_SWAP, NULL))
+ MEMCG_RECLAIM_MAY_SWAP,
+ NULL, NULL))
nr_retries--;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3acb06388405c..3fb1ee1d18603 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2123,7 +2123,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
MEMCG_RECLAIM_MAY_SWAP,
- NULL);
+ NULL, NULL);
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2432,7 +2432,8 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
- gfp_mask, reclaim_options, NULL);
+ gfp_mask, reclaim_options,
+ NULL, NULL);
psi_memstall_leave(&pflags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -4591,7 +4592,8 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
}
reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+ NULL, NULL);
if (!reclaimed && !nr_retries--)
break;
@@ -4651,7 +4653,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
if (nr_reclaims) {
if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
- GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
+ GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
+ NULL, NULL))
nr_reclaims--;
continue;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a8c8fcccbfc9..615aa0c899dad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6807,7 +6807,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
unsigned int reclaim_options,
- int *swappiness)
+ int *swappiness, nodemask_t *allowed)
{
unsigned long nr_reclaimed;
unsigned int noreclaim_flag;
@@ -6823,6 +6823,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
.may_unmap = 1,
.may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP),
.proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE),
+ .nodemask = allowed,
};
/*
* Traverse the ZONELIST_FALLBACK zonelist of the current node to put
@@ -6848,7 +6849,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
unsigned long nr_pages,
gfp_t gfp_mask,
unsigned int reclaim_options,
- int *swappiness)
+ int *swappiness, nodemask_t *allowed)
{
return 0;
}
@@ -7964,9 +7965,9 @@ int user_proactive_reclaim(char *buf,
reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
MEMCG_RECLAIM_PROACTIVE;
reclaimed = try_to_free_mem_cgroup_pages(memcg,
- batch_size, gfp_mask,
- reclaim_options,
- swappiness == -1 ? NULL : &swappiness);
+ batch_size, gfp_mask, reclaim_options,
+ swappiness == -1 ? NULL : &swappiness,
+ NULL);
} else {
struct scan_control sc = {
.gfp_mask = current_gfp_context(gfp_mask),
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 7/9 v2] mm/memcontrol: Make memory.low and memory.min tier-aware
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
` (4 preceding siblings ...)
2026-04-23 20:34 ` [RFC PATCH 6/9 v2] mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 8/9 v2] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
` (2 subsequent siblings)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Muchun Song, cgroups, linux-kernel, kernel-team
On machines serving multiple workloads whose memory is isolated via
the memory cgroup controller, it is currently impossible to enforce a
fair distribution of toptier memory among the workloads, as the only
enforceable limits have to do with total memory footprint, but not where
that memory resides.
This makes ensuring a consistent and baseline performance difficult, as
each workload's performance is heavily impacted by workload-external
factors such as which other workloads are co-located in the same host,
and the order at which different workloads are started.
Extend the existing memory.{low, min} protection to be tier-aware in
order to enforce proportional best-effort and guaranteed memory
protection of toptier memory.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
include/linux/memcontrol.h | 8 ++++++++
mm/memcontrol.c | 3 +++
2 files changed, 11 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6bcb866440075..2222b390ebf10 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -624,6 +624,10 @@ static inline bool mem_cgroup_below_low(struct mem_cgroup *target,
if (mem_cgroup_unprotected(target, memcg))
return false;
+ if (mem_cgroup_tiered_limits() && READ_ONCE(memcg->toptier.elow) >=
+ page_counter_read(&memcg->toptier))
+ return true;
+
return READ_ONCE(memcg->memory.elow) >=
page_counter_read(&memcg->memory);
}
@@ -634,6 +638,10 @@ static inline bool mem_cgroup_below_min(struct mem_cgroup *target,
if (mem_cgroup_unprotected(target, memcg))
return false;
+ if (mem_cgroup_tiered_limits() && READ_ONCE(memcg->toptier.emin) >=
+ page_counter_read(&memcg->toptier))
+ return true;
+
return READ_ONCE(memcg->memory.emin) >=
page_counter_read(&memcg->memory);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3fb1ee1d18603..b115ff40e268d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4933,6 +4933,9 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
root = root_mem_cgroup;
page_counter_calculate_protection(&root->memory, &memcg->memory, recursive_protection);
+ if (mem_cgroup_tiered_limits())
+ page_counter_calculate_protection(&root->toptier,
+ &memcg->toptier, recursive_protection);
}
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 8/9 v2] mm/memcontrol: Make memory.high tier-aware
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
` (5 preceding siblings ...)
2026-04-23 20:34 ` [RFC PATCH 7/9 v2] mm/memcontrol: Make memory.low and memory.min tier-aware Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 9/9 v2] mm/memcontrol: Make memory.max tier-aware Joshua Hahn
2026-05-11 15:56 ` [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware David Hildenbrand (Arm)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Muchun Song, cgroups, linux-kernel, kernel-team
On machines serving multiple workloads whose memory is isolated via the
memory cgroup controller, it is currently impossible to enforce a fair
distribution of toptier memory among the workloads, as the limits only
enforce total memory footprint, but not where that memory resides.
This makes ensuring consistent baseline performance difficult, as each
workload's performance is heavily impacted by workload-external factors
such as which other workloads are co-located in the same host, and the
order in which the workloads are started.
Extend the existing memory.high protection to be tier-aware.
Depending on the combination of limit breaches, selectively reclaim on
toptier nodes: when memory.high is breached, perform reclaim on all
nodes. When memory.high is safe but toptier.high is breached, perform
targeted reclaim on toptier nodes only.
Also, throttle allocations when toptier is breached as well, making sure
not to double-penalize when both toptier and memory limits are met.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/memcontrol.c | 82 +++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 72 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b115ff40e268d..e5f39830d250d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2112,10 +2112,25 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
do {
unsigned long pflags;
+ nodemask_t toptier_nodes;
+ nodemask_t *reclaim_targets = NULL;
if (page_counter_read(&memcg->memory) <=
- READ_ONCE(memcg->memory.high))
- continue;
+ READ_ONCE(memcg->memory.high)) {
+ if (!mem_cgroup_tiered_limits())
+ continue;
+
+ /*
+ * Even if the memcg is under the memory limit, toptier
+ * may have breached the toptier limit. Engage
+ * targeted reclaim on toptier nodes if so.
+ */
+ if (page_counter_read(&memcg->toptier) <=
+ READ_ONCE(memcg->toptier.high))
+ continue;
+ get_toptier_nodemask(&toptier_nodes);
+ reclaim_targets = &toptier_nodes;
+ }
memcg_memory_event(memcg, MEMCG_HIGH);
@@ -2123,7 +2138,7 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
gfp_mask,
MEMCG_RECLAIM_MAY_SWAP,
- NULL, NULL);
+ NULL, reclaim_targets);
psi_memstall_leave(&pflags);
} while ((memcg = parent_mem_cgroup(memcg)) &&
!mem_cgroup_is_root(memcg));
@@ -2224,6 +2239,23 @@ static u64 mem_find_max_overage(struct mem_cgroup *memcg)
return max_overage;
}
+static u64 toptier_find_max_overage(struct mem_cgroup *memcg)
+{
+ u64 overage, max_overage = 0;
+
+ if (!mem_cgroup_tiered_limits())
+ return 0;
+
+ do {
+ overage = calculate_overage(page_counter_read(&memcg->toptier),
+ READ_ONCE(memcg->toptier.high));
+ max_overage = max(overage, max_overage);
+ } while ((memcg = parent_mem_cgroup(memcg)) &&
+ !mem_cgroup_is_root(memcg));
+
+ return max_overage;
+}
+
static u64 swap_find_max_overage(struct mem_cgroup *memcg)
{
u64 overage, max_overage = 0;
@@ -2326,6 +2358,14 @@ void __mem_cgroup_handle_over_high(gfp_t gfp_mask)
penalty_jiffies = calculate_high_delay(memcg, nr_pages,
mem_find_max_overage(memcg));
+ /*
+ * Don't double-penalize for toptier high overage if memory.high
+ * overage penalization has already been accounted for.
+ */
+ if (!penalty_jiffies)
+ penalty_jiffies += calculate_high_delay(memcg, nr_pages,
+ toptier_find_max_overage(memcg));
+
penalty_jiffies += calculate_high_delay(memcg, nr_pages,
swap_find_max_overage(memcg));
@@ -2522,22 +2562,26 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
do {
bool mem_high, swap_high;
+ bool toptier_high = false;
mem_high = page_counter_read(&memcg->memory) >
READ_ONCE(memcg->memory.high);
swap_high = page_counter_read(&memcg->swap) >
READ_ONCE(memcg->swap.high);
+ toptier_high = mem_cgroup_tiered_limits() &&
+ page_counter_read(&memcg->toptier) >
+ READ_ONCE(memcg->toptier.high);
/* Don't bother a random interrupted task */
if (!in_task()) {
- if (mem_high) {
+ if (mem_high || toptier_high) {
schedule_work(&memcg->high_work);
break;
}
continue;
}
- if (mem_high || swap_high) {
+ if (mem_high || swap_high || toptier_high) {
/*
* The allocating tasks in this cgroup will need to do
* reclaim or be throttled to prevent further growth
@@ -4577,10 +4621,28 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);
- unsigned long reclaimed;
+ unsigned long reclaimed, charge;
+ nodemask_t toptier_nodes;
+ nodemask_t *reclaim_targets = NULL;
- if (nr_pages <= high)
- break;
+ if (nr_pages <= high) {
+ unsigned long toptier_nr_pages, toptier_high;
+
+ if (!mem_cgroup_tiered_limits())
+ break;
+
+ toptier_nr_pages = page_counter_read(&memcg->toptier);
+ toptier_high = READ_ONCE(memcg->toptier.high);
+
+ if (toptier_nr_pages <= toptier_high)
+ break;
+
+ get_toptier_nodemask(&toptier_nodes);
+ reclaim_targets = &toptier_nodes;
+ charge = toptier_nr_pages - toptier_high;
+ } else {
+ charge = nr_pages - high;
+ }
if (signal_pending(current))
break;
@@ -4591,9 +4653,9 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
continue;
}
- reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
+ reclaimed = try_to_free_mem_cgroup_pages(memcg, charge,
GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
- NULL, NULL);
+ NULL, reclaim_targets);
if (!reclaimed && !nr_retries--)
break;
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [RFC PATCH 9/9 v2] mm/memcontrol: Make memory.max tier-aware
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
` (6 preceding siblings ...)
2026-04-23 20:34 ` [RFC PATCH 8/9 v2] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
@ 2026-04-23 20:34 ` Joshua Hahn
2026-05-11 15:56 ` [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware David Hildenbrand (Arm)
8 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-04-23 20:34 UTC (permalink / raw)
To: linux-mm
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
Andrew Morton, Muchun Song, cgroups, linux-kernel, kernel-team
On machines serving multiple workloads whose memory is isolated via the
memory cgroup controller, it is currently impossible to enforce a fair
distribution of toptier memory among the workloads, as the limits only
enforce total memory footprint, but not where that memory resides.
This makes ensuring consistent baseline performance difficult, as each
workload's performance is heavily impacted by workload-external factors
such as which other workloads are co-located in the same host, and the
order in which the workloads are started.
Extend the existing memory.max protection to be tier-aware.
Depending on the combination of limit breaches, selectively reclaim on
toptier nodes: when memory.max is breached, perform reclaim on all
nodes. When memory.max is safe but toptier.max is breached, perform
targeted reclaim on toptier nodes only.
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
mm/memcontrol.c | 56 ++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 44 insertions(+), 12 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e5f39830d250d..d8d67ada993ff 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1518,6 +1518,15 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
if (count < limit)
margin = limit - count;
+ if (mem_cgroup_tiered_limits()) {
+ count = page_counter_read(&memcg->toptier);
+ limit = READ_ONCE(memcg->toptier.max);
+ if (count < limit)
+ margin = min(margin, limit - count);
+ else
+ margin = 0;
+ }
+
if (do_memsw_account()) {
count = page_counter_read(&memcg->memsw);
limit = READ_ONCE(memcg->memsw.max);
@@ -2424,11 +2433,12 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
bool raised_max_event = false;
unsigned long pflags;
bool allow_spinning = gfpflags_allow_spinning(gfp_mask);
- bool toptier_charged;
+ nodemask_t toptier_nodes;
+ nodemask_t *reclaim_nodes;
retry:
reclaim_options = MEMCG_RECLAIM_MAY_SWAP;
- toptier_charged = false;
+ reclaim_nodes = NULL;
if (do_memsw_account() &&
!page_counter_try_charge(&memcg->memsw, nr_pages, &counter)) {
@@ -2438,13 +2448,20 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
}
if (toptier &&
- page_counter_try_charge(&memcg->toptier, nr_pages, &counter))
- toptier_charged = true;
+ !page_counter_try_charge(&memcg->toptier, nr_pages, &counter)) {
+ get_toptier_nodemask(&toptier_nodes);
+ reclaim_nodes = &toptier_nodes;
+ mem_over_limit = mem_cgroup_from_counter(counter, toptier);
+
+ if (do_memsw_account())
+ page_counter_uncharge(&memcg->memsw, nr_pages);
+ goto reclaim;
+ }
if (page_counter_try_charge(&memcg->memory, nr_pages, &counter))
goto done_restock;
- if (toptier_charged)
+ if (toptier)
page_counter_uncharge(&memcg->toptier, nr_pages);
if (do_memsw_account())
page_counter_uncharge(&memcg->memsw, nr_pages);
@@ -2473,7 +2490,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
psi_memstall_enter(&pflags);
nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
gfp_mask, reclaim_options,
- NULL, NULL);
+ NULL, reclaim_nodes);
psi_memstall_leave(&pflags);
if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -4683,7 +4700,8 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
unsigned int nr_reclaims = MAX_RECLAIM_RETRIES;
bool drained = false;
- unsigned long max;
+ unsigned long max, toptier_max = PAGE_COUNTER_MAX;
+ nodemask_t toptier_nodes;
int err;
buf = strstrip(buf);
@@ -4692,16 +4710,30 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
return err;
xchg(&memcg->memory.max, max);
- if (mem_cgroup_tiered_limits())
- xchg(&memcg->toptier.max, page_counter_max_or_scale(max));
+ if (mem_cgroup_tiered_limits()) {
+ toptier_max = page_counter_max_or_scale(max);
+ xchg(&memcg->toptier.max, toptier_max);
+ get_toptier_nodemask(&toptier_nodes);
+ }
if (of->file->f_flags & O_NONBLOCK)
goto out;
for (;;) {
unsigned long nr_pages = page_counter_read(&memcg->memory);
+ unsigned long nr_toptier = page_counter_read(&memcg->toptier);
+ unsigned long to_reclaim = 0;
+ nodemask_t *reclaim_nodes = NULL;
+
+ if (nr_pages > max) {
+ to_reclaim = nr_pages - max;
+ } else if (mem_cgroup_tiered_limits() &&
+ nr_toptier > toptier_max) {
+ to_reclaim = nr_toptier - toptier_max;
+ reclaim_nodes = &toptier_nodes;
+ }
- if (nr_pages <= max)
+ if (!to_reclaim)
break;
if (signal_pending(current))
@@ -4714,9 +4746,9 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
}
if (nr_reclaims) {
- if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
+ if (!try_to_free_mem_cgroup_pages(memcg, to_reclaim,
GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP,
- NULL, NULL))
+ NULL, reclaim_nodes))
nr_reclaims--;
continue;
}
--
2.52.0
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
` (7 preceding siblings ...)
2026-04-23 20:34 ` [RFC PATCH 9/9 v2] mm/memcontrol: Make memory.max tier-aware Joshua Hahn
@ 2026-05-11 15:56 ` David Hildenbrand (Arm)
2026-05-11 20:03 ` Joshua Hahn
8 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Arm) @ 2026-05-11 15:56 UTC (permalink / raw)
To: Joshua Hahn, linux-mm
Cc: Tejun Heo, Johannes Weiner, Michal Koutny, Michal Hocko,
Roman Gushchin, Shakeel Butt, Andrew Morton, Chris Li,
Kairui Song, Muchun Song, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Kemeng Shi,
Nhat Pham, Baoquan He, Barry Song, Youngjun Park, Qi Zheng,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Kaiyang Zhao, David Rientjes,
Yiannis Nikolakopoulos, Rao, Bharata Bhasker, cgroups,
linux-kernel, kernel-team
On 4/23/26 22:34, Joshua Hahn wrote:
> INTRODUCTION
> ============
> Memory cgroups provide an interface that allow multiple works on a host to
> co-exist via weak and strong memory isolation guarantees. This works, because
> for the most part, all memory has equal utility. Isolating a cgroup’s memory
> footprint restricts how much it can hurt other workloads competing for memory,
> or protects it from other cgroups looking for more memory.
>
> However, on systems with tiered memory (e.g. CXL), memory utility is no longer
> homogeneous; toptier and lowtier memory provide different performance
> characteristics and have different scarcity, meaning memory footprint no longer
> serves as an accurate representation of a cgroup’s consumption of the system’s
> limited resources. As an extreme example, a cgroup with 10G of toptier
> (e.g. DRAM) memory and a cgroup with 10G of lowtier (e.g. CXL) memory both
> appear to be consuming the same amount of system resources from memcg’s
> perspective, despite the performance asymmetry between the two workloads.
>
> Therefore on tiered systems, memory isolation cannot currently happen, as
> workloads that are well-behaved within their memcg limits may still hurt the
> performance of other well-behaving workloads by hogging more than its
> “fair share” of toptier memory.
>
> Introduce tier-aware memcg limits, which establish independent toptier limits
> that scale with the memory limits and the ratio of toptier:total memory
> available on the system.
>
> INTERFACE
> =========
> This series introduces only one adjustable knob to userspace; a new cgroup mount
> option “memory_tiered_limits” which toggles whether the cgroup mount will scale
> toptier limits. It also introduces 4 new read-only sysfs entries per-cgroup:
> memory.toptier_{min, low, high, max}.
>
> The new toptier memory limits are scaled according to the amount of toptier
> memory and total memory available on the system as such:
>
> memory.toptier_high = (toptier_mem / total_mem) * memory.high
>
> For instance, on a host with 100GB memory, with 75G toptier and 25G CXL, the
> “toptier ratio” would be 75 / 100 = 0.75. A cgroup with the following memcg
> limits {min: 8G, low: 12G, high: 20G, max: 24G} might see toptier limits scaled
> at {min: 6G, low: 9G, high: 15G, max: 18G}.
Assume you have a bigger hierarchy (HBP, DRAM, CXL), or assume you have multiple
NUMA nodes with a hierarchy each.
Your proposal doesn't really seem to be very versatile, or am I wrong?
--
Cheers,
David
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware
2026-05-11 15:56 ` [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware David Hildenbrand (Arm)
@ 2026-05-11 20:03 ` Joshua Hahn
0 siblings, 0 replies; 11+ messages in thread
From: Joshua Hahn @ 2026-05-11 20:03 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, Tejun Heo, Johannes Weiner, Michal Koutny, Michal Hocko,
Roman Gushchin, Shakeel Butt, Andrew Morton, Chris Li,
Kairui Song, Muchun Song, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Kemeng Shi,
Nhat Pham, Baoquan He, Barry Song, Youngjun Park, Qi Zheng,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Kaiyang Zhao, David Rientjes,
Yiannis Nikolakopoulos, Rao, Bharata Bhasker, cgroups,
linux-kernel, kernel-team
On Mon, May 11, 2026 at 5:56 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 4/23/26 22:34, Joshua Hahn wrote:
> > INTRODUCTION
> > ============
> > Memory cgroups provide an interface that allow multiple works on a host to
> > co-exist via weak and strong memory isolation guarantees. This works, because
> > for the most part, all memory has equal utility. Isolating a cgroup’s memory
> > footprint restricts how much it can hurt other workloads competing for memory,
> > or protects it from other cgroups looking for more memory.
> >
> > However, on systems with tiered memory (e.g. CXL), memory utility is no longer
> > homogeneous; toptier and lowtier memory provide different performance
> > characteristics and have different scarcity, meaning memory footprint no longer
> > serves as an accurate representation of a cgroup’s consumption of the system’s
> > limited resources. As an extreme example, a cgroup with 10G of toptier
> > (e.g. DRAM) memory and a cgroup with 10G of lowtier (e.g. CXL) memory both
> > appear to be consuming the same amount of system resources from memcg’s
> > perspective, despite the performance asymmetry between the two workloads.
> >
> > Therefore on tiered systems, memory isolation cannot currently happen, as
> > workloads that are well-behaved within their memcg limits may still hurt the
> > performance of other well-behaving workloads by hogging more than its
> > “fair share” of toptier memory.
> >
> > Introduce tier-aware memcg limits, which establish independent toptier limits
> > that scale with the memory limits and the ratio of toptier:total memory
> > available on the system.
> >
> > INTERFACE
> > =========
> > This series introduces only one adjustable knob to userspace; a new cgroup mount
> > option “memory_tiered_limits” which toggles whether the cgroup mount will scale
> > toptier limits. It also introduces 4 new read-only sysfs entries per-cgroup:
> > memory.toptier_{min, low, high, max}.
> >
> > The new toptier memory limits are scaled according to the amount of toptier
> > memory and total memory available on the system as such:
> >
> > memory.toptier_high = (toptier_mem / total_mem) * memory.high
> >
> > For instance, on a host with 100GB memory, with 75G toptier and 25G CXL, the
> > “toptier ratio” would be 75 / 100 = 0.75. A cgroup with the following memcg
> > limits {min: 8G, low: 12G, high: 20G, max: 24G} might see toptier limits scaled
> > at {min: 6G, low: 9G, high: 15G, max: 18G}.
Hi David!!
It was great seeing you at LSFMMBPF. I didn't get a chance to have a
conversation with you at Zagreb but hopefully I will be less shy and say
hello next conference : -)
> Assume you have a bigger hierarchy (HBP, DRAM, CXL), or assume you have multiple
> NUMA nodes with a hierarchy each.
>
> Your proposal doesn't really seem to be very versatile, or am I wrong?
Let me address these comments separately!
First, for the multi-numa-per-tier case, I think this is already pretty well
handled by my series. Once we realize that a memcg is consuming too much memory
from a tier, we trigger reclaim from that memcg via
try_to_free_mem_cgroup_pages,
which as far as I can tell already handles the multi-numa per memcg case.
Other than restricting the scan_control's nodemask to target the nodes
from that tier, I don't think there's anything else to be done.
Next for the 3+ tier case, I think this is a lot more scalable than it seems
at first. This series depends on another RFC that I sent out [1],
which pushes the concept of "stock" from memcg to page_counter, which means
that it is more scalable to just add more page counters to each memcg.
This means that each tier would just need another page_counter to track its
memory usage, and we trigger selective reclaim on the tier that is being
targeted via the scan control nodemask we introduce in this series.
At my talk in LSFMMBPF, Usama noted that the user-visible API should probably
remain the same, no matter what. The way I have currently established the
memcg files aren't really scalable, so Usama suggested turning the
"memory.toptier_XXX" sysfs files to "memory.tiered_XXX", which would include
a newline-separated / space-separated list of per-tier limits. Something like:
$ cat memory.tiered_max
tier_0 20971520
tier_1 31457280
...
So we have a way to make both the user-facing side stable, and the internals
also more scalable.
With that said, I've opted to leave the internals to 2 tiers for now -- I think
it is not too late to add the generalization series when we start seeing
3+ tier systems out there in the wild. My goal was to introduce tieredness,
and we can work towards generalization in a future work.
On that note, it seems like in general mm is aware of 3+ tiers, but most of the
existing work revolves around distinguishing between toptier/everything else.
I got this impression from reading mm/memory-tiers.c -- but please feel free
to correct me if you feel like I have the wrong idea here : -)
So perhaps the generalization work would benefit from first introducing more
general tier awareness (not just toptier vs. rest) in memory-tiers.c.
What do you think? Does this approach of introducing toptier restriction for
now, and then generalizing in future work make sense to you?
Thanks again for your interest. Have a great day!
Joshua
[1] https://lore.kernel.org/all/20260410210742.550489-1-joshua.hahnjy@gmail.com/
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2026-05-11 20:03 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 20:34 [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 1/9 v2] cgroup: Introduce memory_tiered_limits cgroup mount option Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 3/9 v2] mm/memcontrol: Refactor page_counter charging in try_charge_memcg Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 4/9 v2] mm/memcontrol: charge/uncharge toptier memory to mem_cgroup Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 5/9 v2] mm/memcontrol: Set toptier limits proportional to memory limits Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 6/9 v2] mm/vmscan, memcontrol: Add nodemask to try_to_free_mem_cgroup_pages Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 7/9 v2] mm/memcontrol: Make memory.low and memory.min tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 8/9 v2] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-04-23 20:34 ` [RFC PATCH 9/9 v2] mm/memcontrol: Make memory.max tier-aware Joshua Hahn
2026-05-11 15:56 ` [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware David Hildenbrand (Arm)
2026-05-11 20:03 ` Joshua Hahn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox