Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Donet Tom <donettom@linux.ibm.com>
To: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@meta.com
Subject: Re: [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware
Date: Tue, 24 Mar 2026 21:36:22 +0530	[thread overview]
Message-ID: <537ea1c6-e631-4d13-8169-1a1b96834762@linux.ibm.com> (raw)
In-Reply-To: <20260324154414.195150-1-joshua.hahnjy@gmail.com>


On 3/24/26 9:14 PM, Joshua Hahn wrote:
> On Tue, 24 Mar 2026 16:21:06 +0530 Donet Tom <donettom@linux.ibm.com> wrote:
>
>> On 2/24/26 4:08 AM, Joshua Hahn wrote:
>>> On machines serving multiple workloads whose memory is isolated via the
>>> memory cgroup controller, it is currently impossible to enforce a fair
>>> distribution of toptier memory among the workloads, as the only
>>> enforcable limits have to do with total memory footprint, but not where
>>> that memory resides.
>>>
>>> This makes ensuring a consistent and baseline performance difficult, as
>>> each workload's performance is heavily impacted by workload-external
>>> factors wuch as which other workloads are co-located in the same host,
>>> and the order at which different workloads are started.
>>>
>>> Extend the existing memory.high protection to be tier-aware in the
>>> charging and enforcement to limit toptier-hogging for workloads.
>>>
>>> Also, add a new nodemask parameter to try_to_free_mem_cgroup_pages,
>>> which can be used to selectively reclaim from memory at the
>>> memcg-tier interection of a cgroup.
>>>
>>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> ---
>>>    include/linux/swap.h |  3 +-
>>>    mm/memcontrol-v1.c   |  6 ++--
>>>    mm/memcontrol.c      | 85 +++++++++++++++++++++++++++++++++++++-------
>>>    mm/vmscan.c          | 11 +++---
>>>    4 files changed, 84 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index 0effe3cc50f5..c6037ac7bf6e 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -368,7 +368,8 @@ extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>>>    						  unsigned long nr_pages,
>>>    						  gfp_t gfp_mask,
>>>    						  unsigned int reclaim_options,
>>> -						  int *swappiness);
>>> +						  int *swappiness,
>>> +						  nodemask_t *allowed);
>>>    extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>>>    						gfp_t gfp_mask, bool noswap,
>>>    						pg_data_t *pgdat,
>>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>>> index 0b39ba608109..29630c7f3567 100644
>>> --- a/mm/memcontrol-v1.c
>>> +++ b/mm/memcontrol-v1.c
>>> @@ -1497,7 +1497,8 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>>>    		}
>>>    
>>>    		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>>> -				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
>>> +				memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP,
>>> +				NULL, NULL)) {
>>>    			ret = -EBUSY;
>>>    			break;
>>>    		}
>>> @@ -1529,7 +1530,8 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>>>    			return -EINTR;
>>>    
>>>    		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
>>> -						  MEMCG_RECLAIM_MAY_SWAP, NULL))
>>> +						  MEMCG_RECLAIM_MAY_SWAP,
>>> +						  NULL, NULL))
>>>    			nr_retries--;
>>>    	}
>>>    
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index 8aa7ae361a73..ebd4a1b73c51 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -2184,18 +2184,30 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>>>    
>>>    	do {
>>>    		unsigned long pflags;
>>> -
>>> -		if (page_counter_read(&memcg->memory) <=
>>> -		    READ_ONCE(memcg->memory.high))
>>> +		nodemask_t toptier_nodes, *reclaim_nodes;
>>> +		bool mem_high_ok, toptier_high_ok;
>>> +
>>> +		mt_get_toptier_nodemask(&toptier_nodes, NULL);
>>> +		mem_high_ok = page_counter_read(&memcg->memory) <=
>>> +			      READ_ONCE(memcg->memory.high);
>>> +		toptier_high_ok = !(tier_aware_memcg_limits &&
>>> +				    mem_cgroup_toptier_usage(memcg) >
>>> +				    page_counter_toptier_high(&memcg->memory));
>>> +		if (mem_high_ok && toptier_high_ok)
>>>    			continue;
>>>    
>>> +		if (mem_high_ok && !toptier_high_ok)
>>> +			reclaim_nodes = &toptier_nodes;
>>> +		else
>>> +			reclaim_nodes = NULL;
>>
>> IIUC The intent of this patch is to partition cgroup memory such that
>> 0 → toptier_high is backed by higher-tier memory, and
>> toptier_high → max is backed by lower-tier memory.
>>
>> Based on this:
>>
>> 1.If top-tier usage exceeds toptier_high, pages should be
>>     demoted to the lower tier.
>>
>> 2. If lower-tier usage exceeds (max - toptier_high), pages
>>     should be swapped out.
>>
>> 3. If total memory usage exceeds max, demotion should be
>>     avoided and reclaim should directly swap out pages.
>>
>> I think we are only handling case (1) in this patch. When
>> mem_high_ok && !toptier_high_ok, we are reclaiming pages (demotion first)
>>
>> However, if !mem_high_ok, the memcg reclaim path works as if
>> there is no memory tiering  in cgroup. This can lead to more demotion
>> and may eventually result in OOM.
>>
>> Should we also handle cases (2) and (3) in this patch?
> Hello Donet! I hope you are doing well.
>
> For the second condition, should pages be swapped out? If a workload
> is using 0 toptier memory (extreme case, let's say they haven't set
> memory.low) then lower-tier should be able to use all the way up to
> max memory.
>
> Maybe you mean if lowtier_usage exceeds (max - toptier_usage) pages
> should be swapped out? But if we rearrange this
>
>                  lowtier_usage >= max - toptier_usage
> lowtier_usage + toptier_usage >= max
>                    total_usage >= max
>
> And this is just the memory.max check and is already handled by
> existing reclaim semantics : -)
>
> I think case 3 is a bit more nuanced. If we directly swap out from
> high tier and skip demotions, this is introducing a priority inversion
> since memory in toptier should be hotter than memory in lowtier, so
> we should prefer to swap out the colder memory in lowtier before
> swapping out memory in toptier.
>
> The idea was discussed at length at [1]. It also feels like an orthogonal
> discussion since the behavior isn't related to toptier high or low
> behaviors.
>
> Please let me know what you think. Thank you, I hope you have a great day!


Thanks, Joshua, for your clarification.

[1] disabled demotion from memcg. With memcg limits now being
tier-aware, I was thinking about how to handle the demotion
issue. You are right that this is a separate topic not related to this.

[1] 
https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/


> Joshua
>
> [1] https://lore.kernel.org/linux-mm/20260317230720.990329-3-bingjiao@google.com/
>

next prev parent reply	other threads:[~2026-03-24 16:06 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-23 22:38 [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 1/6] mm/memory-tiers: Introduce tier-aware memcg limit sysfs Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 2/6] mm/page_counter: Introduce tiered memory awareness to page_counter Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 3/6] mm/memory-tiers, memcontrol: Introduce toptier capacity updates Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 4/6] mm/memcontrol: Charge and uncharge from toptier Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 5/6] mm/memcontrol, page_counter: Make memory.low tier-aware Joshua Hahn
2026-02-23 22:38 ` [RFC PATCH 6/6] mm/memcontrol: Make memory.high tier-aware Joshua Hahn
2026-03-11 22:05   ` Bing Jiao
2026-03-12 19:44     ` Joshua Hahn
2026-03-24 10:51   ` Donet Tom
2026-03-24 15:23     ` Gregory Price
2026-03-24 15:46       ` Donet Tom
2026-03-24 15:44     ` Joshua Hahn
2026-03-24 16:06       ` Donet Tom [this message]
2026-02-24 11:27 ` [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware Michal Hocko
2026-02-24 16:13   ` Joshua Hahn
2026-02-24 18:49     ` Gregory Price
2026-02-24 20:03       ` Kaiyang Zhao
2026-02-26  8:04     ` Michal Hocko
2026-02-26 16:08       ` Joshua Hahn
2026-03-24 10:30 ` Donet Tom
2026-03-24 14:58   ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=537ea1c6-e631-4d13-8169-1a1b96834762@linux.ibm.com \
    --to=donettom@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox