* Re: [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware
2026-05-11 15:56 ` [RFC PATCH 0/9 v2] mm/memcontrol: Make memory cgroup limits tier-aware David Hildenbrand (Arm)
@ 2026-05-11 20:03 ` Joshua Hahn
0 siblings, 0 replies; 2+ messages in thread
From: Joshua Hahn @ 2026-05-11 20:03 UTC (permalink / raw)
To: David Hildenbrand (Arm)
Cc: linux-mm, Tejun Heo, Johannes Weiner, Michal Koutny, Michal Hocko,
Roman Gushchin, Shakeel Butt, Andrew Morton, Chris Li,
Kairui Song, Muchun Song, Lorenzo Stoakes, Liam R. Howlett,
Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Kemeng Shi,
Nhat Pham, Baoquan He, Barry Song, Youngjun Park, Qi Zheng,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Kaiyang Zhao, David Rientjes,
Yiannis Nikolakopoulos, Rao, Bharata Bhasker, cgroups,
linux-kernel, kernel-team
On Mon, May 11, 2026 at 5:56 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 4/23/26 22:34, Joshua Hahn wrote:
> > INTRODUCTION
> > ============
> > Memory cgroups provide an interface that allow multiple works on a host to
> > co-exist via weak and strong memory isolation guarantees. This works, because
> > for the most part, all memory has equal utility. Isolating a cgroup’s memory
> > footprint restricts how much it can hurt other workloads competing for memory,
> > or protects it from other cgroups looking for more memory.
> >
> > However, on systems with tiered memory (e.g. CXL), memory utility is no longer
> > homogeneous; toptier and lowtier memory provide different performance
> > characteristics and have different scarcity, meaning memory footprint no longer
> > serves as an accurate representation of a cgroup’s consumption of the system’s
> > limited resources. As an extreme example, a cgroup with 10G of toptier
> > (e.g. DRAM) memory and a cgroup with 10G of lowtier (e.g. CXL) memory both
> > appear to be consuming the same amount of system resources from memcg’s
> > perspective, despite the performance asymmetry between the two workloads.
> >
> > Therefore on tiered systems, memory isolation cannot currently happen, as
> > workloads that are well-behaved within their memcg limits may still hurt the
> > performance of other well-behaving workloads by hogging more than its
> > “fair share” of toptier memory.
> >
> > Introduce tier-aware memcg limits, which establish independent toptier limits
> > that scale with the memory limits and the ratio of toptier:total memory
> > available on the system.
> >
> > INTERFACE
> > =========
> > This series introduces only one adjustable knob to userspace; a new cgroup mount
> > option “memory_tiered_limits” which toggles whether the cgroup mount will scale
> > toptier limits. It also introduces 4 new read-only sysfs entries per-cgroup:
> > memory.toptier_{min, low, high, max}.
> >
> > The new toptier memory limits are scaled according to the amount of toptier
> > memory and total memory available on the system as such:
> >
> > memory.toptier_high = (toptier_mem / total_mem) * memory.high
> >
> > For instance, on a host with 100GB memory, with 75G toptier and 25G CXL, the
> > “toptier ratio” would be 75 / 100 = 0.75. A cgroup with the following memcg
> > limits {min: 8G, low: 12G, high: 20G, max: 24G} might see toptier limits scaled
> > at {min: 6G, low: 9G, high: 15G, max: 18G}.
Hi David!!
It was great seeing you at LSFMMBPF. I didn't get a chance to have a
conversation with you at Zagreb but hopefully I will be less shy and say
hello next conference : -)
> Assume you have a bigger hierarchy (HBP, DRAM, CXL), or assume you have multiple
> NUMA nodes with a hierarchy each.
>
> Your proposal doesn't really seem to be very versatile, or am I wrong?
Let me address these comments separately!
First, for the multi-numa-per-tier case, I think this is already pretty well
handled by my series. Once we realize that a memcg is consuming too much memory
from a tier, we trigger reclaim from that memcg via
try_to_free_mem_cgroup_pages,
which as far as I can tell already handles the multi-numa per memcg case.
Other than restricting the scan_control's nodemask to target the nodes
from that tier, I don't think there's anything else to be done.
Next for the 3+ tier case, I think this is a lot more scalable than it seems
at first. This series depends on another RFC that I sent out [1],
which pushes the concept of "stock" from memcg to page_counter, which means
that it is more scalable to just add more page counters to each memcg.
This means that each tier would just need another page_counter to track its
memory usage, and we trigger selective reclaim on the tier that is being
targeted via the scan control nodemask we introduce in this series.
At my talk in LSFMMBPF, Usama noted that the user-visible API should probably
remain the same, no matter what. The way I have currently established the
memcg files aren't really scalable, so Usama suggested turning the
"memory.toptier_XXX" sysfs files to "memory.tiered_XXX", which would include
a newline-separated / space-separated list of per-tier limits. Something like:
$ cat memory.tiered_max
tier_0 20971520
tier_1 31457280
...
So we have a way to make both the user-facing side stable, and the internals
also more scalable.
With that said, I've opted to leave the internals to 2 tiers for now -- I think
it is not too late to add the generalization series when we start seeing
3+ tier systems out there in the wild. My goal was to introduce tieredness,
and we can work towards generalization in a future work.
On that note, it seems like in general mm is aware of 3+ tiers, but most of the
existing work revolves around distinguishing between toptier/everything else.
I got this impression from reading mm/memory-tiers.c -- but please feel free
to correct me if you feel like I have the wrong idea here : -)
So perhaps the generalization work would benefit from first introducing more
general tier awareness (not just toptier vs. rest) in memory-tiers.c.
What do you think? Does this approach of introducing toptier restriction for
now, and then generalizing in future work make sense to you?
Thanks again for your interest. Have a great day!
Joshua
[1] https://lore.kernel.org/all/20260410210742.550489-1-joshua.hahnjy@gmail.com/
^ permalink raw reply [flat|nested] 2+ messages in thread