From: Donet Tom <donettom@linux.ibm.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware
Date: Tue, 24 Mar 2026 21:36:22 +0530 [thread overview]
Message-ID: <537ea1c6-e631-4d13-8169-1a1b96834762@linux.ibm.com> (raw)
In-Reply-To: <20260324154414.195150-1-joshua.hahnjy@gmail.com>
On 3/24/26 9:14 PM, Joshua Hahn wrote:
> On Tue, 24 Mar 2026 16:21:06 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
>
>> On 2/24/26 4:08 AM, Joshua Hahn wrote:
>>> On machines serving multiple workloads whose memory is isolated via the
>>> memory cgroup controller, it is currently impossible to enforce a fair
>>> distribution of toptier memory among the workloads, as the only
>>> enforcable limits have to do with total memory footprint, but not where
>>> that memory resides.
>>>
>>> This makes ensuring a consistent and baseline performance difficult, as
>>> each workload's performance is heavily impacted by workload-external
>>> factors wuch as which other workloads are co-located in the same host,
>>> and the order at which different workloads are started.
>>>
>>> Extend the existing memory.high protection to be tier-aware in the
>>> charging and enforcement to limit toptier-hogging for workloads.
>>>
>>> Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
>>> which can be used to selectively reclaim from memory at the
>>> memcg-tier interection of a cgroup.
>>>
>>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> ---
>>> include/linux/swap.h | 3 +-
>>> mm/memcontrol-v1.c | 6 ++--
>>> mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++++++++-------
>>> mm/vmscan.c | 11 +++---
>>> 4 files changed, 84 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0effe3cc50f5..c6037ac7bf6e 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>> unsigned long nr_pages,
>>> gfp_t gfp_mask,
>>> unsigned int reclaim_options,
>>> - int *swappiness);
>>> + int *swappiness,
>>> + nodemask_t *allowed);
>>> extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>>> gfp_t gfp_mask, bool noswap,
>>> pg_data_t *pgdat,
>>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>>> index 0b39ba608109..29630c7f3567 100644
>>> --- a/mm/memcontrol-v1.c
>>> +++ b/mm/memcontrol-v1.c
>>> @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>>> }
>>>
>>> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>>> - memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
>>> + memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
>>> + NULL, NULL)) {
>>> ret = -EBUSY;
>>> break;
>>> }
>>> @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>>> return -EINTR;
>>>
>>> if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>>> - MEMCG_RECLAIM_MAY_SWAP, NULL))
>>> + MEMCG_RECLAIM_MAY_SWAP,
>>> + NULL, NULL))
>>> nr_retries--;
>>> }
>>>
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 8aa7ae361a73..ebd4a1b73c51 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>>>
>>> do {
>>> unsigned long pflags;
>>> -
>>> - if (page_counter_read(&memcg->memory) <=
>>> - READ_ONCE(memcg->memory.high))
>>> + nodemask_t toptier_nodes, *reclaim_nodes;
>>> + bool mem_high_ok, toptier_high_ok;
>>> +
>>> + mt_get_toptier_nodemask(&toptier_nodes, NULL);
>>> + mem_high_ok = page_counter_read(&memcg->memory) <=
>>> + READ_ONCE(memcg->memory.high);
>>> + toptier_high_ok = !(tier_aware_memcg_limits &&
>>> + mem_cgroup_toptier_usage(memcg) >
>>> + page_counter_toptier_high(&memcg->memory));
>>> + if (mem_high_ok && toptier_high_ok)
>>> continue;
>>>
>>> + if (mem_high_ok && !toptier_high_ok)
>>> + reclaim_nodes = &toptier_nodes;
>>> + else
>>> + reclaim_nodes = NULL;
>>
>> IIUC The intent of this patch is to partition cgroup memory such that
>> 0 → toptier_high is backed by higher-tier memory, and
>> toptier_high → max is backed by lower-tier memory.
>>
>> Based on this:
>>
>> 1.If top-tier usage exceeds toptier_high, pages should be
>> demoted to the lower tier.
>>
>> 2. If lower-tier usage exceeds (max - toptier_high), pages
>> should be swapped out.
>>
>> 3. If total memory usage exceeds max, demotion should be
>> avoided and reclaim should directly swap out pages.
>>
>> I think we are only handling case (1) in this patch. When
>> mem_high_ok && !toptier_high_ok, we are reclaiming pages (demotion first)
>>
>> However, if !mem_high_ok, the memcg reclaim path works as if
>> there is no memory tiering in cgroup. This can lead to more demotion
>> and may eventually result in OOM.
>>
>> Should we also handle cases (2) and (3) in this patch?
> Hello Donet! I hope you are doing well.
>
> For the second condition, should pages be swapped out? If a workload
> is using 0 toptier memory (extreme case, let's say they haven't set
> memory.low) then lower-tier should be able to use all the way up to
> max memory.
>
> Maybe you mean if lowtier_usage exceeds (max - toptier_usage) pages
> should be swapped out? But if we rearrange this
>
> lowtier_usage >= max - toptier_usage
> lowtier_usage + toptier_usage >= max
> total_usage >= max
>
> And this is just the memory.max check and is already handled by
> existing reclaim semantics : -)
>
> I think case 3 is a bit more nuanced. If we directly swap out from
> high tier and skip demotions, this is introducing a priority inversion
> since memory in toptier should be hotter than memory in lowtier, so
> we should prefer to swap out the colder memory in lowtier before
> swapping out memory in toptier.
>
> The idea was discussed at length at [1]. It also feels like an orthogonal
> discussion since the behavior isn't related to toptier high or low
> behaviors.
>
> Please let me know what you think. Thank you, I hope you have a great day!
Thanks, Joshua, for your clarification.
[1] disabled demotion from memcg. With memcg limits now being
tier-aware, I was thinking about how to handle the demotion
issue. You are right that this is a separate topic not related to this.
[1]
https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/
> Joshua
>
> [1] https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/
>
next prev parent reply other threads:[~2026-03-24 16:06 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-03-11 22:05 ` Bing Jiao
2026-03-12 19:44 ` Joshua Hahn
2026-03-24 10:51 ` Donet Tom
2026-03-24 15:23 ` Gregory Price
2026-03-24 15:46 ` Donet Tom
2026-03-24 15:44 ` Joshua Hahn
2026-03-24 16:06 ` Donet Tom [this message]
2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
2026-02-24 16:13 ` Joshua Hahn
2026-02-24 18:49 ` Gregory Price
2026-02-24 20:03 ` Kaiyang Zhao
2026-02-26 8:04 ` Michal Hocko
2026-02-26 16:08 ` Joshua Hahn
2026-03-24 10:30 ` Donet Tom
2026-03-24 14:58 ` Joshua Hahn
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=537ea1c6-e631-4d13-8169-1a1b96834762@linux.ibm.com \
--to=donettom@linux.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=cgroups@vger.kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=joshua.hahnjy@gmail.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lorenzo.stoakes@oracle.com \
--cc=mhocko@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox