All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qi Zheng <qi.zheng@linux.dev>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: hannes@cmpxchg.org, hughd@google.com, mhocko@suse.com,
	roman.gushchin@linux.dev, muchun.song@linux.dev,
	david@kernel.org, lorenzo.stoakes@oracle.com, ziy@nvidia.com,
	harry.yoo@oracle.com, yosry.ahmed@linux.dev,
	imran.f.khan@oracle.com, kamalesh.babulal@oracle.com,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	chenridong@huaweicloud.com, mkoutny@suse.com,
	akpm@linux-foundation.org, hamzamahfooz@linux.microsoft.com,
	apais@linux.microsoft.com, lance.yang@linux.dev, bhe@redhat.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org, Qi Zheng <zhengqi.arch@bytedance.com>
Subject: Re: [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats
Date: Tue, 10 Feb 2026 14:47:51 +0800	[thread overview]
Message-ID: <0673b72c-8d7c-4bfb-a8b2-da5ae5bb5f00@linux.dev> (raw)
In-Reply-To: <aYabQii_-9EVdgub@linux.dev>



On 2/7/26 10:19 AM, Shakeel Butt wrote:
> On Thu, Feb 05, 2026 at 05:01:48PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> To resolve the dying memcg issue, we need to reparent LRU folios of child
>> memcg to its parent memcg. This could cause problems for non-hierarchical
>> stats.
>>
>> As Yosry Ahmed pointed out:
>>
>> ```
>> In short, if memory is charged to a dying cgroup at the time of
>> reparenting, when the memory gets uncharged the stats updates will occur
>> at the parent. This will update both hierarchical and non-hierarchical
>> stats of the parent, which would corrupt the parent's non-hierarchical
>> stats (because those counters were never incremented when the memory was
>> charged).
>> ```
>>
>> Now we have the following two types of non-hierarchical stats, and they
>> are only used in CONFIG_MEMCG_V1:
>>
>> a. memcg->vmstats->state_local[i]
>> b. pn->lruvec_stats->state_local[i]
>>
>> To ensure that these non-hierarchical stats work properly, we need to
>> reparent these non-hierarchical stats after reparenting LRU folios. To
>> this end, this commit makes the following preparations:
>>
>> 1. implement reparent_state_local() to reparent non-hierarchical stats
>> 2. make css_killed_work_fn() to be called in rcu work, and implement
>>     get_non_dying_memcg_start() and get_non_dying_memcg_end() to avoid race
>>     between mod_memcg_state()/mod_memcg_lruvec_state()
>>     and reparent_state_local()
>> 3. change these non-hierarchical stats to atomic_long_t type to avoid race
>>     between mem_cgroup_stat_aggregate() and reparent_state_local()
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Overall looks good just a couple of comments.
> 
>> ---
>>   include/linux/memcontrol.h |   4 ++
>>   kernel/cgroup/cgroup.c     |   8 +--
>>   mm/memcontrol-v1.c         |  16 ++++++
>>   mm/memcontrol-v1.h         |   3 +
>>   mm/memcontrol.c            | 113 ++++++++++++++++++++++++++++++++++---
>>   5 files changed, 132 insertions(+), 12 deletions(-)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 3970c102fe741..a4f6ab7eb98d6 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -957,12 +957,16 @@ static inline void mod_memcg_page_state(struct page *page,
>>   
>>   unsigned long memcg_events(struct mem_cgroup *memcg, int event);
>>   unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
>> +void reparent_memcg_state_local(struct mem_cgroup *memcg,
>> +				struct mem_cgroup *parent, int idx);
> 
> Put the above in mm/memcontrol-v1.h file.

OK.

> 
>>   unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>>   bool memcg_stat_item_valid(int idx);
>>   bool memcg_vm_event_item_valid(enum vm_event_item idx);
>>   unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
>>   unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>>   				      enum node_stat_item idx);
>> +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
>> +				       struct mem_cgroup *parent, int idx);
> 
> Put the above in mm/memcontrol-v1.h file.

OK.

> 
>>   
>>   void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
>>   void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 94788bd1fdf0e..dbf94a77018e6 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -6043,8 +6043,8 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
>>    */
>>   static void css_killed_work_fn(struct work_struct *work)
>>   {
>> -	struct cgroup_subsys_state *css =
>> -		container_of(work, struct cgroup_subsys_state, destroy_work);
>> +	struct cgroup_subsys_state *css = container_of(to_rcu_work(work),
>> +				struct cgroup_subsys_state, destroy_rwork);
>>   
>>   	cgroup_lock();
>>   
>> @@ -6065,8 +6065,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
>>   		container_of(ref, struct cgroup_subsys_state, refcnt);
>>   
>>   	if (atomic_dec_and_test(&css->online_cnt)) {
>> -		INIT_WORK(&css->destroy_work, css_killed_work_fn);
>> -		queue_work(cgroup_offline_wq, &css->destroy_work);
>> +		INIT_RCU_WORK(&css->destroy_rwork, css_killed_work_fn);
>> +		queue_rcu_work(cgroup_offline_wq, &css->destroy_rwork);
>>   	}
>>   }
>>   
>> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
>> index c6078cd7f7e53..a427bb205763b 100644
>> --- a/mm/memcontrol-v1.c
>> +++ b/mm/memcontrol-v1.c
>> @@ -1887,6 +1887,22 @@ static const unsigned int memcg1_events[] = {
>>   	PGMAJFAULT,
>>   };
>>   
>> +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++)
>> +		reparent_memcg_state_local(memcg, parent, memcg1_stats[i]);
>> +}
>> +
>> +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> +	int i;
>> +
>> +	for (i = 0; i < NR_LRU_LISTS; i++)
>> +		reparent_memcg_lruvec_state_local(memcg, parent, i);
>> +}
>> +
>>   void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>>   {
>>   	unsigned long memory, memsw;
>> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
>> index eb3c3c1056574..45528195d3578 100644
>> --- a/mm/memcontrol-v1.h
>> +++ b/mm/memcontrol-v1.h
>> @@ -41,6 +41,7 @@ static inline bool do_memsw_account(void)
>>   
>>   unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
>>   unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx);
>> +void mod_memcg_page_state_local(struct mem_cgroup *memcg, int idx, unsigned long val);
>>   unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
>>   bool memcg1_alloc_events(struct mem_cgroup *memcg);
>>   void memcg1_free_events(struct mem_cgroup *memcg);
>> @@ -73,6 +74,8 @@ void memcg1_uncharge_batch(struct mem_cgroup *memcg, unsigned long pgpgout,
>>   			   unsigned long nr_memory, int nid);
>>   
>>   void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
>> +void reparent_memcg1_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
>> +void reparent_memcg1_lruvec_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent);
>>   
>>   void memcg1_account_kmem(struct mem_cgroup *memcg, int nr_pages);
>>   static inline bool memcg1_tcpmem_active(struct mem_cgroup *memcg)
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index c9b5dfd822d0a..e7d4e4ff411b6 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -225,6 +225,26 @@ static inline struct obj_cgroup *__memcg_reparent_objcgs(struct mem_cgroup *memc
>>   	return objcg;
>>   }
>>   
>> +#ifdef CONFIG_MEMCG_V1
>> +static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force);
>> +
>> +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> +	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>> +		return;
>> +
>> +	__mem_cgroup_flush_stats(memcg, true);
>> +
>> +	/* The following counts are all non-hierarchical and need to be reparented. */
>> +	reparent_memcg1_state_local(memcg, parent);
>> +	reparent_memcg1_lruvec_state_local(memcg, parent);
>> +}
>> +#else
>> +static inline void reparent_state_local(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>> +{
>> +}
>> +#endif
>> +
>>   static inline void reparent_locks(struct mem_cgroup *memcg, struct mem_cgroup *parent)
>>   {
>>   	spin_lock_irq(&objcg_lock);
>> @@ -407,7 +427,7 @@ struct lruvec_stats {
>>   	long state[NR_MEMCG_NODE_STAT_ITEMS];
>>   
>>   	/* Non-hierarchical (CPU aggregated) state */
>> -	long state_local[NR_MEMCG_NODE_STAT_ITEMS];
>> +	atomic_long_t state_local[NR_MEMCG_NODE_STAT_ITEMS];
>>   
>>   	/* Pending child counts during tree propagation */
>>   	long state_pending[NR_MEMCG_NODE_STAT_ITEMS];
>> @@ -450,7 +470,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>>   		return 0;
>>   
>>   	pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
>> -	x = READ_ONCE(pn->lruvec_stats->state_local[i]);
>> +	x = atomic_long_read(&(pn->lruvec_stats->state_local[i]));
>>   #ifdef CONFIG_SMP
>>   	if (x < 0)
>>   		x = 0;
>> @@ -458,6 +478,27 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>>   	return x;
>>   }
>>   
> 
> Please put the following function under CONFIG_MEMCG_V1. Just move it in
> the same block as reparent_state_local().

OK, will try to do it.

> 
>> +void reparent_memcg_lruvec_state_local(struct mem_cgroup *memcg,
>> +				       struct mem_cgroup *parent, int idx)
>> +{
>> +	int i = memcg_stats_index(idx);
>> +	int nid;
>> +
>> +	if (WARN_ONCE(BAD_STAT_IDX(i), "%s: missing stat item %d\n", __func__, idx))
>> +		return;
>> +
>> +	for_each_node(nid) {
>> +		struct lruvec *child_lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>> +		struct lruvec *parent_lruvec = mem_cgroup_lruvec(parent, NODE_DATA(nid));
>> +		struct mem_cgroup_per_node *parent_pn;
>> +		unsigned long value = lruvec_page_state_local(child_lruvec, idx);
>> +
>> +		parent_pn = container_of(parent_lruvec, struct mem_cgroup_per_node, lruvec);
>> +
>> +		atomic_long_add(value, &(parent_pn->lruvec_stats->state_local[i]));
>> +	}
>> +}
>> +
> 
> [...]
> 
>>   
>> +#ifdef CONFIG_MEMCG_V1
>> +/*
>> + * Used in mod_memcg_state() and mod_memcg_lruvec_state() to avoid race with
>> + * reparenting of non-hierarchical state_locals.
>> + */
>> +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
>> +{
>> +	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>> +		return memcg;
>> +
>> +	rcu_read_lock();
>> +
>> +	while (memcg_is_dying(memcg))
>> +		memcg = parent_mem_cgroup(memcg);
>> +
>> +	return memcg;
>> +}
>> +
>> +static inline void get_non_dying_memcg_end(void)
>> +{
>> +	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
>> +		return;
>> +
>> +	rcu_read_unlock();
>> +}
>> +#else
>> +static inline struct mem_cgroup *get_non_dying_memcg_start(struct mem_cgroup *memcg)
>> +{
>> +	return memcg;
>> +}
>> +
>> +static inline void get_non_dying_memcg_end(void)
>> +{
>> +}
>> +#endif
> 
> Add the usage of these start and end functions in mod_memcg_state() and
> mod_memcg_lruvec_state() in this patch.

Using these two function will change the behavior of mod_memcg_state()
and mod_memcg_lruvec_state(), but LRU folios has not yet been
reparented.

To ensure the patch itself is error-free, I chose to place the usage of
these two function in patch #30.

Thanks,
Qi

> 


  reply	other threads:[~2026-02-10  6:49 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-05  8:54 [PATCH v4 00/31] Eliminate Dying Memory Cgroup Qi Zheng
2026-02-05  8:54 ` [PATCH v4 01/31] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2026-02-05  8:54 ` [PATCH v4 02/31] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
2026-02-05  8:54 ` [PATCH v4 03/31] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
2026-02-05  8:54 ` [PATCH v4 04/31] mm: vmscan: prepare for the refactoring the move_folios_to_lru() Qi Zheng
2026-02-05  8:54 ` [PATCH v4 05/31] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
2026-02-05  8:54 ` [PATCH v4 06/31] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
2026-02-05  8:54 ` [PATCH v4 07/31] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
2026-02-05  9:01 ` [PATCH v4 08/31] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 09/31] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 10/31] writeback: prevent memory cgroup release in writeback module Qi Zheng
2026-02-05  9:01 ` [PATCH v4 11/31] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 12/31] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
2026-02-05  9:01 ` [PATCH v4 13/31] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 14/31] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
2026-02-05  9:01 ` [PATCH v4 15/31] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 16/31] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 17/31] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 18/31] mm: zswap: prevent memory cgroup release in zswap_compress() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 19/31] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 20/31] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 21/31] mm: swap: prevent lruvec release in lru_gen_clear_refs() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 22/31] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 23/31] mm: do not open-code lruvec lock Qi Zheng
2026-02-05  9:01 ` [PATCH v4 24/31] mm: memcontrol: prepare for reparenting LRU pages for " Qi Zheng
2026-02-05 15:02   ` kernel test robot
2026-02-05 15:02   ` kernel test robot
2026-02-06  6:13     ` Qi Zheng
2026-02-06 23:34       ` Shakeel Butt
2026-02-05  9:01 ` [PATCH v4 25/31] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
2026-02-07  1:28   ` Shakeel Butt
2026-02-05  9:01 ` [PATCH v4 26/31] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
2026-02-12  8:46   ` Harry Yoo
2026-02-15  7:28     ` Qi Zheng
2026-02-05  9:01 ` [PATCH v4 27/31] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
2026-02-05  9:01 ` [PATCH v4 28/31] mm: workingset: use lruvec_lru_size() to get the number of lru pages Qi Zheng
2026-02-07  1:48   ` Shakeel Butt
2026-02-07  3:59   ` Muchun Song
2026-02-05  9:01 ` [PATCH v4 29/31] mm: memcontrol: prepare for reparenting non-hierarchical stats Qi Zheng
2026-02-07  2:19   ` Shakeel Butt
2026-02-10  6:47     ` Qi Zheng [this message]
2026-02-11  0:38       ` Shakeel Butt
2026-02-05  9:01 ` [PATCH v4 30/31] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
2026-02-07 19:59   ` Usama Arif
2026-02-07 22:25   ` Shakeel Butt
2026-02-09  3:49     ` Qi Zheng
2026-02-09 17:53       ` Shakeel Butt
2026-02-10  3:11         ` Qi Zheng
2026-02-05  9:01 ` [PATCH v4 31/31] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
2026-02-07 22:26   ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0673b72c-8d7c-4bfb-a8b2-da5ae5bb5f00@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=apais@linux.microsoft.com \
    --cc=axelrasmussen@google.com \
    --cc=bhe@redhat.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=hamzamahfooz@linux.microsoft.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=hughd@google.com \
    --cc=imran.f.khan@oracle.com \
    --cc=kamalesh.babulal@oracle.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=weixugc@google.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.