[PATCH RFC 00/28] Eliminate Dying Memory Cgroup

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
@ 2025-04-15  2:45 Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Muchun Song
                   ` (30 more replies)
  0 siblings, 31 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

This patchset is based on v6.15-rc2. It functions correctly only when
CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
during rebasing onto the latest code. For more details and assistance, refer
to the "Challenges" section. This is the reason for adding the RFC tag.

## Introduction

This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].

## Background

The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].

Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.

File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.

## Fundamentals

A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.

1. The first step  to be taken is to identify all users of both functions
   (folio_memcg() and folio_lruvec()) who are not concerned about binding
   stability and implement appropriate measures (such as holding a RCU read
   lock or temporarily obtaining a reference to the memory cgroup for a
   brief period) to prevent the release of the memory cgroup.

2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
   how to ensure the binding stability from the user's perspective of
   folio_lruvec().

   struct lruvec *folio_lruvec_lock(struct folio *folio)
   {
           struct lruvec *lruvec;

           rcu_read_lock();
   retry:
           lruvec = folio_lruvec(folio);
           spin_lock(&lruvec->lru_lock);
           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
                   spin_unlock(&lruvec->lru_lock);
                   goto retry;
           }

           return lruvec;
   }

   From the perspective of memory cgroup removal, the entire reparenting
   process (altering the binding relationship between folio and its memory
   cgroup and moving the LRU lists to its parental memory cgroup) should be
   carried out under both the lruvec lock of the memory cgroup being removed
   and the lruvec lock of its parent.

3. Thirdly, another lock that requires the same approach is the split-queue
   lock of THP.

4. Finally, transfer the LRU pages to the object cgroup without holding a
   reference to the original memory cgroup.

## Challenges

In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
LRU lists (i.e., two active lists for anonymous and file folios, and two
inactive lists for anonymous and file folios). Due to the symmetry of the
LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
to its parent memory cgroup during the reparenting process.

In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.

1. The first question is how to move the LRU lists from a memory cgroup to
   its parent memory cgroup during the reparenting process. This is due to
   the fact that the quantity of LRU lists (aka generations) may differ
   between a child memory cgroup and its parent memory cgroup.

2. The second question is how to make the process of reparenting more
   efficient, since each folio charged to a memory cgroup stores its
   generation counter into its ->flags. And the generation counter may
   differ between a child memory cgroup and its parent memory cgroup because
   the values of ->min_seq and ->max_seq are not identical. Should those
   generation counters be updated correspondingly?

I am uncertain about how to handle them appropriately as I am not an
expert at MGLRU. I would appreciate it if you could offer some suggestions.
Moreover, if you are willing to directly provide your patches, I would be
glad to incorporate them into this patchset.

## Compositions

Patches 1-8 involve code refactoring and cleanup with the aim of
facilitating the transfer LRU folios to object cgroup infrastructures.

Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
enabling the ability that LRU folios could be charged to it and aligning
the behavior of object-cgroup-related APIs with that of the memory cgroup.

Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
being released.

Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
released.

Patches 24-25 implement the core mechanism to guarantee binding stability
between the folio and its corresponding memory cgroup while holding lruvec
lock or split-queue lock of THP.

Patches 26-27 are intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup.

Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
ensure correct folio operations in the future.

## Effect

Finally, it can be observed that the quantity of dying memory cgroups will
not experience a significant increase if the following test script is
executed to reproduce the issue.

```bash
#!/bin/bash

# Create a temporary file 'temp' filled with zero bytes
dd if=/dev/zero of=temp bs=4096 count=1

# Display memory-cgroup info from /proc/cgroups
cat /proc/cgroups | grep memory

for i in {0..2000}
do
    mkdir /sys/fs/cgroup/memory/test$i
    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs

    # Append 'temp' file content to 'log'
    cat temp >> log

    echo $$ > /sys/fs/cgroup/memory/cgroup.procs

    # Potentially create a dying memory cgroup
    rmdir /sys/fs/cgroup/memory/test$i
done

# Display memory-cgroup info after test
cat /proc/cgroups | grep memory

rm -f temp log
```

## References

[1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
[2] https://lwn.net/Articles/895431/
[3] https://github.com/systemd/systemd/pull/36827

Muchun Song (28):
  mm: memcontrol: remove dead code of checking parent memory cgroup
  mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock
    holding
  mm: workingset: use folio_lruvec() in workingset_refault()
  mm: rename unlock_page_lruvec_irq and its variants
  mm: thp: replace folio_memcg() with folio_memcg_charged()
  mm: thp: introduce folio_split_queue_lock and its variants
  mm: thp: use folio_batch to handle THP splitting in
    deferred_split_scan()
  mm: vmscan: refactor move_folios_to_lru()
  mm: memcontrol: allocate object cgroup for non-kmem case
  mm: memcontrol: return root object cgroup for root memory cgroup
  mm: memcontrol: prevent memory cgroup release in
    get_mem_cgroup_from_folio()
  buffer: prevent memory cgroup release in folio_alloc_buffers()
  writeback: prevent memory cgroup release in writeback module
  mm: memcontrol: prevent memory cgroup release in
    count_memcg_folio_events()
  mm: page_io: prevent memory cgroup release in page_io module
  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  mm: mglru: prevent memory cgroup release in mglru
  mm: memcontrol: prevent memory cgroup release in
    mem_cgroup_swap_full()
  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  mm: workingset: prevent lruvec release in workingset_refault()
  mm: zswap: prevent lruvec release in zswap_folio_swapin()
  mm: swap: prevent lruvec release in swap module
  mm: workingset: prevent lruvec release in workingset_activation()
  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  mm: thp: prepare for reparenting LRU pages for split queue lock
  mm: memcontrol: introduce memcg_reparent_ops
  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
    folios
  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers

 fs/buffer.c                      |   4 +-
 fs/fs-writeback.c                |  22 +-
 include/linux/memcontrol.h       | 190 ++++++------
 include/linux/mm_inline.h        |   6 +
 include/trace/events/writeback.h |   3 +
 mm/compaction.c                  |  43 ++-
 mm/huge_memory.c                 | 218 +++++++++-----
 mm/memcontrol-v1.c               |  15 +-
 mm/memcontrol.c                  | 476 +++++++++++++++++++------------
 mm/migrate.c                     |   2 +
 mm/mlock.c                       |   2 +-
 mm/page_io.c                     |   8 +-
 mm/percpu.c                      |   2 +-
 mm/shrinker.c                    |   6 +-
 mm/swap.c                        |  22 +-
 mm/vmscan.c                      |  73 ++---
 mm/workingset.c                  |  26 +-
 mm/zswap.c                       |   2 +
 18 files changed, 696 insertions(+), 424 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 14:35   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding Muchun Song
                   ` (29 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

Since the no-hierarchy mode has been deprecated after the commit:

  commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").

As a result, parent_mem_cgroup() will not return NULL except when passing
the root memcg, and the root memcg cannot be offline. Hence, it's safe to
remove the check on the returned value of parent_mem_cgroup(). Remove the
corresponding dead code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol.c | 5 -----
 mm/shrinker.c   | 6 +-----
 2 files changed, 1 insertion(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 421740f1bcdc..61488e45cab2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3196,9 +3196,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 		return;
 
 	parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		parent = root_mem_cgroup;
-
 	memcg_reparent_list_lrus(memcg, parent);
 
 	/*
@@ -3489,8 +3486,6 @@ struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
 			break;
 		}
 		memcg = parent_mem_cgroup(memcg);
-		if (!memcg)
-			memcg = root_mem_cgroup;
 	}
 	return memcg;
 }
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 4a93fd433689..e8e092a2f7f4 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -286,14 +286,10 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
 {
 	int nid, index, offset;
 	long nr;
-	struct mem_cgroup *parent;
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 	struct shrinker_info *child_info, *parent_info;
 	struct shrinker_info_unit *child_unit, *parent_unit;
 
-	parent = parent_mem_cgroup(memcg);
-	if (!parent)
-		parent = root_mem_cgroup;
-
 	/* Prevent from concurrent shrinker_info expand */
 	mutex_lock(&shrinker_mutex);
 	for_each_node(nid) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup
  2025-04-15  2:45 ` [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Muchun Song
@ 2025-04-17 14:35   ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 14:35 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:05AM +0800, Muchun Song wrote:
> Since the no-hierarchy mode has been deprecated after the commit:
> 
>   commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").
> 
> As a result, parent_mem_cgroup() will not return NULL except when passing
> the root memcg, and the root memcg cannot be offline. Hence, it's safe to
> remove the check on the returned value of parent_mem_cgroup(). Remove the
> corresponding dead code.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 14:48   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault() Muchun Song
                   ` (28 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

If a folio isn't charged to the memory cgroup, holding an rcu read lock
is needless. Users only want to know its charge status, so use
folio_memcg_charged() here.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 61488e45cab2..0fc76d50bc23 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -797,20 +797,17 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
 			     int val)
 {
-	struct mem_cgroup *memcg;
 	pg_data_t *pgdat = folio_pgdat(folio);
 	struct lruvec *lruvec;
 
-	rcu_read_lock();
-	memcg = folio_memcg(folio);
-	/* Untracked pages have no memcg, no lruvec. Update only the node */
-	if (!memcg) {
-		rcu_read_unlock();
+	if (!folio_memcg_charged(folio)) {
+		/* Untracked pages have no memcg, no lruvec. Update only the node */
 		__mod_node_page_state(pgdat, idx, val);
 		return;
 	}
 
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	rcu_read_lock();
+	lruvec = mem_cgroup_lruvec(folio_memcg(folio), pgdat);
 	__mod_lruvec_state(lruvec, idx, val);
 	rcu_read_unlock();
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding
  2025-04-15  2:45 ` [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding Muchun Song
@ 2025-04-17 14:48   ` Johannes Weiner
  2025-04-18  2:38     ` Muchun Song
  0 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 14:48 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:06AM +0800, Muchun Song wrote:
> If a folio isn't charged to the memory cgroup, holding an rcu read lock
> is needless. Users only want to know its charge status, so use
> folio_memcg_charged() here.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  mm/memcontrol.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 61488e45cab2..0fc76d50bc23 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -797,20 +797,17 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
>  			     int val)
>  {
> -	struct mem_cgroup *memcg;
>  	pg_data_t *pgdat = folio_pgdat(folio);
>  	struct lruvec *lruvec;
>  
> -	rcu_read_lock();
> -	memcg = folio_memcg(folio);
> -	/* Untracked pages have no memcg, no lruvec. Update only the node */
> -	if (!memcg) {
> -		rcu_read_unlock();
> +	if (!folio_memcg_charged(folio)) {
> +		/* Untracked pages have no memcg, no lruvec. Update only the node */
>  		__mod_node_page_state(pgdat, idx, val);
>  		return;
>  	}
>  
> -	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +	rcu_read_lock();
> +	lruvec = mem_cgroup_lruvec(folio_memcg(folio), pgdat);
>  	__mod_lruvec_state(lruvec, idx, val);
>  	rcu_read_unlock();

Hm, but untracked pages are the rare exception. It would seem better
for that case to take the rcu_read_lock() unnecessarily, than it is to
look up folio->memcg_data twice in the fast path?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding
  2025-04-17 14:48   ` Johannes Weiner
@ 2025-04-18  2:38     ` Muchun Song
  0 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-18  2:38 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Muchun Song, mhocko, roman.gushchin, shakeel.butt, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais



> On Apr 17, 2025, at 22:48, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> On Tue, Apr 15, 2025 at 10:45:06AM +0800, Muchun Song wrote:
>> If a folio isn't charged to the memory cgroup, holding an rcu read lock
>> is needless. Users only want to know its charge status, so use
>> folio_memcg_charged() here.
>> 
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>> mm/memcontrol.c | 11 ++++-------
>> 1 file changed, 4 insertions(+), 7 deletions(-)
>> 
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 61488e45cab2..0fc76d50bc23 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -797,20 +797,17 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>> void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx,
>>      int val)
>> {
>> - 	struct mem_cgroup *memcg;
>> 	pg_data_t *pgdat = folio_pgdat(folio);
>> 	struct lruvec *lruvec;
>> 
>> - 	rcu_read_lock();
>> - 	memcg = folio_memcg(folio);
>> - 	/* Untracked pages have no memcg, no lruvec. Update only the node */
>> - 	if (!memcg) {
>> - 		rcu_read_unlock();
>> + 	if (!folio_memcg_charged(folio)) {
>> + 		/* Untracked pages have no memcg, no lruvec. Update only the node */
>> 		__mod_node_page_state(pgdat, idx, val);
>> 		return;
>> 	}
>> 
>> - 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
>> + 	rcu_read_lock();
>> + 	lruvec = mem_cgroup_lruvec(folio_memcg(folio), pgdat);
>> 	__mod_lruvec_state(lruvec, idx, val);
>> 	rcu_read_unlock();
> 
> Hm, but untracked pages are the rare exception. It would seem better
> for that case to take the rcu_read_lock() unnecessarily, than it is to
> look up folio->memcg_data twice in the fast path?

Yep, you are right. I'll drop this next version. Thanks.

Muchun,
Thanks.




^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 14:52   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants Muchun Song
                   ` (27 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

Use folio_lruvec() to simplify the code.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/workingset.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index 4841ae8af411..ebafc0eaafba 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -534,8 +534,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 void workingset_refault(struct folio *folio, void *shadow)
 {
 	bool file = folio_is_file_lru(folio);
-	struct pglist_data *pgdat;
-	struct mem_cgroup *memcg;
 	struct lruvec *lruvec;
 	bool workingset;
 	long nr;
@@ -557,10 +555,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 	 * locked to guarantee folio_memcg() stability throughout.
 	 */
 	nr = folio_nr_pages(folio);
-	memcg = folio_memcg(folio);
-	pgdat = folio_pgdat(folio);
-	lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
+	lruvec = folio_lruvec(folio);
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset, true))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault()
  2025-04-15  2:45 ` [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault() Muchun Song
@ 2025-04-17 14:52   ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 14:52 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:07AM +0800, Muchun Song wrote:
> Use folio_lruvec() to simplify the code.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (2 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 14:53   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged() Muchun Song
                   ` (26 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

It is inappropriate to use folio_lruvec_lock() variants in conjunction with
unlock_page_lruvec() variants, as this involves the inconsistent operation of
locking a folio while unlocking a page. To rectify this, the functions
unlock_page_lruvec{_irq, _irqrestore} are renamed to lruvec_unlock{_irq,
_irqrestore}.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h | 10 +++++-----
 mm/compaction.c            | 14 +++++++-------
 mm/huge_memory.c           |  2 +-
 mm/mlock.c                 |  2 +-
 mm/swap.c                  | 12 ++++++------
 mm/vmscan.c                |  4 ++--
 6 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..a045819bcf40 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1510,17 +1510,17 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 	return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
 }
 
-static inline void unlock_page_lruvec(struct lruvec *lruvec)
+static inline void lruvec_unlock(struct lruvec *lruvec)
 {
 	spin_unlock(&lruvec->lru_lock);
 }
 
-static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+static inline void lruvec_unlock_irq(struct lruvec *lruvec)
 {
 	spin_unlock_irq(&lruvec->lru_lock);
 }
 
-static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
 		unsigned long flags)
 {
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
@@ -1542,7 +1542,7 @@ static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio,
 		if (folio_matches_lruvec(folio, locked_lruvec))
 			return locked_lruvec;
 
-		unlock_page_lruvec_irq(locked_lruvec);
+		lruvec_unlock_irq(locked_lruvec);
 	}
 
 	return folio_lruvec_lock_irq(folio);
@@ -1556,7 +1556,7 @@ static inline void folio_lruvec_relock_irqsave(struct folio *folio,
 		if (folio_matches_lruvec(folio, *lruvecp))
 			return;
 
-		unlock_page_lruvec_irqrestore(*lruvecp, *flags);
+		lruvec_unlock_irqrestore(*lruvecp, *flags);
 	}
 
 	*lruvecp = folio_lruvec_lock_irqsave(folio, flags);
diff --git a/mm/compaction.c b/mm/compaction.c
index 139f00c0308a..ce45d633ddad 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -946,7 +946,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (!(low_pfn % COMPACT_CLUSTER_MAX)) {
 			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 				locked = NULL;
 			}
 
@@ -997,7 +997,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			}
 			/* for alloc_contig case */
 			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 				locked = NULL;
 			}
 
@@ -1089,7 +1089,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			if (unlikely(__PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
-					unlock_page_lruvec_irqrestore(locked, flags);
+					lruvec_unlock_irqrestore(locked, flags);
 					locked = NULL;
 				}
 
@@ -1194,7 +1194,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		/* If we already hold the lock, we can skip some rechecking */
 		if (lruvec != locked) {
 			if (locked)
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 
 			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
 			locked = lruvec;
@@ -1262,7 +1262,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 isolate_fail_put:
 		/* Avoid potential deadlock in freeing page under lru_lock */
 		if (locked) {
-			unlock_page_lruvec_irqrestore(locked, flags);
+			lruvec_unlock_irqrestore(locked, flags);
 			locked = NULL;
 		}
 		folio_put(folio);
@@ -1278,7 +1278,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		if (nr_isolated) {
 			if (locked) {
-				unlock_page_lruvec_irqrestore(locked, flags);
+				lruvec_unlock_irqrestore(locked, flags);
 				locked = NULL;
 			}
 			putback_movable_pages(&cc->migratepages);
@@ -1310,7 +1310,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 isolate_abort:
 	if (locked)
-		unlock_page_lruvec_irqrestore(locked, flags);
+		lruvec_unlock_irqrestore(locked, flags);
 	if (folio) {
 		folio_set_lru(folio);
 		folio_put(folio);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47682d1ab7..df66aa4bc4c2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3605,7 +3605,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	folio_ref_unfreeze(origin_folio, 1 +
 		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
 
-	unlock_page_lruvec(lruvec);
+	lruvec_unlock(lruvec);
 
 	if (swap_cache)
 		xa_unlock(&swap_cache->i_pages);
diff --git a/mm/mlock.c b/mm/mlock.c
index 3cb72b579ffd..86cad963edb7 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -205,7 +205,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
 	}
 
 	if (lruvec)
-		unlock_page_lruvec_irq(lruvec);
+		lruvec_unlock_irq(lruvec);
 	folios_put(fbatch);
 }
 
diff --git a/mm/swap.c b/mm/swap.c
index 77b2d5997873..ee19e171857d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -91,7 +91,7 @@ static void page_cache_release(struct folio *folio)
 
 	__page_cache_release(folio, &lruvec, &flags);
 	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+		lruvec_unlock_irqrestore(lruvec, flags);
 }
 
 void __folio_put(struct folio *folio)
@@ -171,7 +171,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 	}
 
 	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+		lruvec_unlock_irqrestore(lruvec, flags);
 	folios_put(fbatch);
 }
 
@@ -343,7 +343,7 @@ void folio_activate(struct folio *folio)
 
 	lruvec = folio_lruvec_lock_irq(folio);
 	lru_activate(lruvec, folio);
-	unlock_page_lruvec_irq(lruvec);
+	lruvec_unlock_irq(lruvec);
 	folio_set_lru(folio);
 }
 #endif
@@ -953,7 +953,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 
 		if (folio_is_zone_device(folio)) {
 			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec_unlock_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
 			if (folio_ref_sub_and_test(folio, nr_refs))
@@ -967,7 +967,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		/* hugetlb has its own memcg */
 		if (folio_test_hugetlb(folio)) {
 			if (lruvec) {
-				unlock_page_lruvec_irqrestore(lruvec, flags);
+				lruvec_unlock_irqrestore(lruvec, flags);
 				lruvec = NULL;
 			}
 			free_huge_folio(folio);
@@ -981,7 +981,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
 		j++;
 	}
 	if (lruvec)
-		unlock_page_lruvec_irqrestore(lruvec, flags);
+		lruvec_unlock_irqrestore(lruvec, flags);
 	if (!j) {
 		folio_batch_reinit(folios);
 		return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b620d74b0f66..a76b3cee043d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1847,7 +1847,7 @@ bool folio_isolate_lru(struct folio *folio)
 		folio_get(folio);
 		lruvec = folio_lruvec_lock_irq(folio);
 		lruvec_del_folio(lruvec, folio);
-		unlock_page_lruvec_irq(lruvec);
+		lruvec_unlock_irq(lruvec);
 		ret = true;
 	}
 
@@ -7681,7 +7681,7 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
 	if (lruvec) {
 		__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
 		__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
-		unlock_page_lruvec_irq(lruvec);
+		lruvec_unlock_irq(lruvec);
 	} else if (pgscanned) {
 		count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants
  2025-04-15  2:45 ` [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants Muchun Song
@ 2025-04-17 14:53   ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 14:53 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:08AM +0800, Muchun Song wrote:
> It is inappropriate to use folio_lruvec_lock() variants in conjunction with
> unlock_page_lruvec() variants, as this involves the inconsistent operation of
> locking a folio while unlocking a page. To rectify this, the functions
> unlock_page_lruvec{_irq, _irqrestore} are renamed to lruvec_unlock{_irq,
> _irqrestore}.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (3 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 14:54   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants Muchun Song
                   ` (25 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

folio_memcg_charged() is intended for use when the user is unconcerned
about the returned memcg pointer. It is more efficient than folio_memcg().
Therefore, replace folio_memcg() with folio_memcg_charged().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index df66aa4bc4c2..a81e89987ca2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4048,7 +4048,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 	bool unqueued = false;
 
 	WARN_ON_ONCE(folio_ref_count(folio));
-	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged()
  2025-04-15  2:45 ` [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged() Muchun Song
@ 2025-04-17 14:54   ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 14:54 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:09AM +0800, Muchun Song wrote:
> folio_memcg_charged() is intended for use when the user is unconcerned
> about the returned memcg pointer. It is more efficient than folio_memcg().
> Therefore, replace folio_memcg() with folio_memcg_charged().
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (4 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 14:58   ` Johannes Weiner
  2025-04-18 19:50   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Muchun Song
                   ` (24 subsequent siblings)
  30 siblings, 2 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In future memcg removal, the binding between a folio and a memcg may change,
making the split lock within the memcg unstable when held.

A new approach is required to reparent the split queue to its parent. This
patch starts introducing a unified way to acquire the split lock for future
work.

It's a code-only refactoring with no functional changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h |  10 ++++
 mm/huge_memory.c           | 100 +++++++++++++++++++++++++++----------
 2 files changed, 83 insertions(+), 27 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a045819bcf40..bb4f203733f3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1639,6 +1639,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 void reparent_shrinker_deferred(struct mem_cgroup *memcg);
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+	return shrinker->id;
+}
 #else
 #define mem_cgroup_sockets_enabled 0
 static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
@@ -1652,6 +1657,11 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
 				    int nid, int shrinker_id)
 {
 }
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+	return -1;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a81e89987ca2..70820fa75c1f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1059,26 +1059,75 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 #ifdef CONFIG_MEMCG
 static inline
-struct deferred_split *get_deferred_split_queue(struct folio *folio)
+struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
+					   struct deferred_split *queue)
+{
+	if (mem_cgroup_disabled())
+		return NULL;
+	if (&NODE_DATA(folio_nid(folio))->deferred_split_queue == queue)
+		return NULL;
+	return container_of(queue, struct mem_cgroup, deferred_split_queue);
+}
+
+static inline struct deferred_split *folio_memcg_split_queue(struct folio *folio)
 {
 	struct mem_cgroup *memcg = folio_memcg(folio);
-	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
 
-	if (memcg)
-		return &memcg->deferred_split_queue;
-	else
-		return &pgdat->deferred_split_queue;
+	return memcg ? &memcg->deferred_split_queue : NULL;
 }
 #else
 static inline
-struct deferred_split *get_deferred_split_queue(struct folio *folio)
+struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
+					   struct deferred_split *queue)
 {
-	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	return NULL;
+}
 
-	return &pgdat->deferred_split_queue;
+static inline struct deferred_split *folio_memcg_split_queue(struct folio *folio)
+{
+	return NULL;
 }
 #endif
 
+static struct deferred_split *folio_split_queue(struct folio *folio)
+{
+	struct deferred_split *queue = folio_memcg_split_queue(folio);
+
+	return queue ? : &NODE_DATA(folio_nid(folio))->deferred_split_queue;
+}
+
+static struct deferred_split *folio_split_queue_lock(struct folio *folio)
+{
+	struct deferred_split *queue;
+
+	queue = folio_split_queue(folio);
+	spin_lock(&queue->split_queue_lock);
+
+	return queue;
+}
+
+static struct deferred_split *
+folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
+{
+	struct deferred_split *queue;
+
+	queue = folio_split_queue(folio);
+	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+	return queue;
+}
+
+static inline void split_queue_unlock(struct deferred_split *queue)
+{
+	spin_unlock(&queue->split_queue_lock);
+}
+
+static inline void split_queue_unlock_irqrestore(struct deferred_split *queue,
+						 unsigned long flags)
+{
+	spin_unlock_irqrestore(&queue->split_queue_lock, flags);
+}
+
 static inline bool is_transparent_hugepage(const struct folio *folio)
 {
 	if (!folio_test_large(folio))
@@ -3723,7 +3772,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
 		struct list_head *list, bool uniform_split)
 {
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+	struct deferred_split *ds_queue;
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
 	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
@@ -3857,7 +3906,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
-	spin_lock(&ds_queue->split_queue_lock);
+	ds_queue = folio_split_queue_lock(folio);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
 		if (folio_order(folio) > 1 &&
 		    !list_empty(&folio->_deferred_list)) {
@@ -3875,7 +3924,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			 */
 			list_del_init(&folio->_deferred_list);
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		split_queue_unlock(ds_queue);
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -3896,7 +3945,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 				split_at, lock_at, list, end, &xas, mapping,
 				uniform_split);
 	} else {
-		spin_unlock(&ds_queue->split_queue_lock);
+		split_queue_unlock(ds_queue);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
@@ -4050,8 +4099,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 	WARN_ON_ONCE(folio_ref_count(folio));
 	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
-	ds_queue = get_deferred_split_queue(folio);
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
 		if (folio_test_partially_mapped(folio)) {
@@ -4062,7 +4110,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 		list_del_init(&folio->_deferred_list);
 		unqueued = true;
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 
 	return unqueued;	/* useful for debug warnings */
 }
@@ -4070,10 +4118,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
 void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
-#ifdef CONFIG_MEMCG
-	struct mem_cgroup *memcg = folio_memcg(folio);
-#endif
+	struct deferred_split *ds_queue;
 	unsigned long flags;
 
 	/*
@@ -4096,7 +4141,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (folio_test_swapcache(folio))
 		return;
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
 	if (partially_mapped) {
 		if (!folio_test_partially_mapped(folio)) {
 			folio_set_partially_mapped(folio);
@@ -4111,15 +4156,16 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
 	}
 	if (list_empty(&folio->_deferred_list)) {
+		struct mem_cgroup *memcg;
+
+		memcg = folio_split_queue_memcg(folio, ds_queue);
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
-#ifdef CONFIG_MEMCG
 		if (memcg)
 			set_shrinker_bit(memcg, folio_nid(folio),
-					 deferred_split_shrinker->id);
-#endif
+					 shrinker_id(deferred_split_shrinker));
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 }
 
 static unsigned long deferred_split_count(struct shrinker *shrink,
@@ -4202,7 +4248,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		if (!--sc->nr_to_scan)
 			break;
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 
 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
 		bool did_split = false;
@@ -4251,7 +4297,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	list_splice_tail(&list, &ds_queue->split_queue);
 	ds_queue->split_queue_len -= removed;
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 
 	if (prev)
 		folio_put(prev);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants
  2025-04-15  2:45 ` [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants Muchun Song
@ 2025-04-17 14:58   ` Johannes Weiner
  2025-04-18 19:50   ` Johannes Weiner
  1 sibling, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 14:58 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:10AM +0800, Muchun Song wrote:
> In future memcg removal, the binding between a folio and a memcg may change,
> making the split lock within the memcg unstable when held.
> 
> A new approach is required to reparent the split queue to its parent. This
> patch starts introducing a unified way to acquire the split lock for future
> work.
> 
> It's a code-only refactoring with no functional changes.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants
  2025-04-15  2:45 ` [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants Muchun Song
  2025-04-17 14:58   ` Johannes Weiner
@ 2025-04-18 19:50   ` Johannes Weiner
  2025-04-19 14:20     ` Muchun Song
  1 sibling, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2025-04-18 19:50 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:10AM +0800, Muchun Song wrote:
> @@ -4202,7 +4248,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		if (!--sc->nr_to_scan)
>  			break;
>  	}
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	split_queue_unlock_irqrestore(ds_queue, flags);
>  
>  	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>  		bool did_split = false;
> @@ -4251,7 +4297,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	list_splice_tail(&list, &ds_queue->split_queue);
>  	ds_queue->split_queue_len -= removed;
> -	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> +	split_queue_unlock_irqrestore(ds_queue, flags);

These just tripped up in my testing. You use the new helpers for
unlock, but not for the lock path. That's fine in this patch, but when
"mm: thp: prepare for reparenting LRU pages for split queue lock" adds
the rcu locking to the helpers, it results in missing rcu read locks:

[  108.814880]
[  108.816378] =====================================
[  108.821069] WARNING: bad unlock balance detected!
[  108.825762] 6.15.0-rc2-00028-g570c8034f057 #192 Not tainted
[  108.831323] -------------------------------------
[  108.836016] cc1/2031 is trying to release lock (rcu_read_lock) at:
[  108.842181] [<ffffffff815f9d05>] deferred_split_scan+0x235/0x4b0
[  108.848179] but there are no more locks to release!
[  108.853046]
[  108.853046] other info that might help us debug this:
[  108.859553] 2 locks held by cc1/2031:
[  108.863211]  #0: ffff88801ddbbd88 (vm_lock){....}-{0:0}, at: do_user_addr_fault+0x19c/0x6b0
[  108.871544]  #1: ffffffff83042400 (fs_reclaim){....}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x337/0xf20
[  108.881511]
[  108.881511] stack backtrace:
[  108.885862] CPU: 4 UID: 0 PID: 2031 Comm: cc1 Not tainted 6.15.0-rc2-00028-g570c8034f057 #192 PREEMPT(voluntary)
[  108.885865] Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.80 12/25/2019
[  108.885866] Call Trace:
[  108.885867]  <TASK>
[  108.885868]  dump_stack_lvl+0x57/0x80
[  108.885871]  ? deferred_split_scan+0x235/0x4b0
[  108.885874]  print_unlock_imbalance_bug.part.0+0xfb/0x110
[  108.885877]  ? deferred_split_scan+0x235/0x4b0
[  108.885878]  lock_release+0x258/0x3e0
[  108.885880]  ? deferred_split_scan+0x85/0x4b0
[  108.885881]  deferred_split_scan+0x23a/0x4b0
[  108.885885]  ? find_held_lock+0x32/0x80
[  108.885886]  ? local_clock_noinstr+0x9/0xd0
[  108.885887]  ? lock_release+0x17e/0x3e0
[  108.885889]  do_shrink_slab+0x155/0x480
[  108.885891]  shrink_slab+0x33c/0x480
[  108.885892]  ? shrink_slab+0x1c1/0x480
[  108.885893]  shrink_node+0x324/0x840
[  108.885895]  do_try_to_free_pages+0xdf/0x550
[  108.885897]  try_to_free_pages+0xeb/0x260
[  108.885899]  __alloc_pages_slowpath.constprop.0+0x35c/0xf20
[  108.885901]  __alloc_frozen_pages_noprof+0x339/0x360
[  108.885903]  __folio_alloc_noprof+0x10/0x90
[  108.885904]  __handle_mm_fault+0xca5/0x1930
[  108.885906]  handle_mm_fault+0xb6/0x310
[  108.885908]  do_user_addr_fault+0x21e/0x6b0
[  108.885910]  exc_page_fault+0x62/0x1d0
[  108.885911]  asm_exc_page_fault+0x22/0x30
[  108.885912] RIP: 0033:0xf64890
[  108.885914] Code: 4e 64 31 d2 b9 01 00 00 00 31 f6 4c 89 45 98 e8 66 b3 88 ff 4c 8b 45 98 bf 28 00 00 00 b9 08 00 00 00 49 8b 70 18 48 8b 56 58 <48> 89 10 48 8b 13 48 89 46 58 c7 46 60 00 00 00 00 e9 62 01 00 00
[  108.885915] RSP: 002b:00007ffcf3c7d920 EFLAGS: 00010206
[  108.885916] RAX: 00007f7bf07c5000 RBX: 00007ffcf3c7d9a0 RCX: 0000000000000008
[  108.885917] RDX: 00007f7bf06aa000 RSI: 00007f7bf09dd400 RDI: 0000000000000028
[  108.885917] RBP: 00007ffcf3c7d990 R08: 00007f7bf080c540 R09: 0000000000000007
[  108.885918] R10: 000000000000009a R11: 000000003e969900 R12: 00007f7bf07bbe70
[  108.885918] R13: 0000000000000000 R14: 00007f7bf07bbec0 R15: 00007ffcf3c7d930
[  108.885920]  </TASK>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants
  2025-04-18 19:50   ` Johannes Weiner
@ 2025-04-19 14:20     ` Muchun Song
  0 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-19 14:20 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Muchun Song, mhocko, roman.gushchin, shakeel.butt, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais



> On Apr 19, 2025, at 03:50, Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> On Tue, Apr 15, 2025 at 10:45:10AM +0800, Muchun Song wrote:
>> @@ -4202,7 +4248,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> if (!--sc->nr_to_scan)
>> break;
>> }
>> - 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> + 	split_queue_unlock_irqrestore(ds_queue, flags);
>> 
>> 	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
>> 		bool did_split = false;
>> @@ -4251,7 +4297,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> 	list_splice_tail(&list, &ds_queue->split_queue);
>> 	ds_queue->split_queue_len -= removed;
>> - 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
>> + 	split_queue_unlock_irqrestore(ds_queue, flags);
> 
> These just tripped up in my testing. You use the new helpers for
> unlock, but not for the lock path. That's fine in this patch, but when
> "mm: thp: prepare for reparenting LRU pages for split queue lock" adds
> the rcu locking to the helpers, it results in missing rcu read locks:

Good catch! Thanks for pointing out. You are right, I shouldn't use the
new unlock helpers here without the corresponding new lock helpers. I'll
revert this change in this function.

Muchun,
Thanks.

> 
> [  108.814880]
> [  108.816378] =====================================
> [  108.821069] WARNING: bad unlock balance detected!
> [  108.825762] 6.15.0-rc2-00028-g570c8034f057 #192 Not tainted
> [  108.831323] -------------------------------------
> [  108.836016] cc1/2031 is trying to release lock (rcu_read_lock) at:
> [  108.842181] [<ffffffff815f9d05>] deferred_split_scan+0x235/0x4b0
> [  108.848179] but there are no more locks to release!
> [  108.853046]
> [  108.853046] other info that might help us debug this:
> [  108.859553] 2 locks held by cc1/2031:
> [  108.863211]  #0: ffff88801ddbbd88 (vm_lock){....}-{0:0}, at: do_user_addr_fault+0x19c/0x6b0
> [  108.871544]  #1: ffffffff83042400 (fs_reclaim){....}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0x337/0xf20
> [  108.881511]
> [  108.881511] stack backtrace:
> [  108.885862] CPU: 4 UID: 0 PID: 2031 Comm: cc1 Not tainted 6.15.0-rc2-00028-g570c8034f057 #192 PREEMPT(voluntary)
> [  108.885865] Hardware name: Micro-Star International Co., Ltd. MS-7B98/Z390-A PRO (MS-7B98), BIOS 1.80 12/25/2019
> [  108.885866] Call Trace:
> [  108.885867]  <TASK>
> [  108.885868]  dump_stack_lvl+0x57/0x80
> [  108.885871]  ? deferred_split_scan+0x235/0x4b0
> [  108.885874]  print_unlock_imbalance_bug.part.0+0xfb/0x110
> [  108.885877]  ? deferred_split_scan+0x235/0x4b0
> [  108.885878]  lock_release+0x258/0x3e0
> [  108.885880]  ? deferred_split_scan+0x85/0x4b0
> [  108.885881]  deferred_split_scan+0x23a/0x4b0
> [  108.885885]  ? find_held_lock+0x32/0x80
> [  108.885886]  ? local_clock_noinstr+0x9/0xd0
> [  108.885887]  ? lock_release+0x17e/0x3e0
> [  108.885889]  do_shrink_slab+0x155/0x480
> [  108.885891]  shrink_slab+0x33c/0x480
> [  108.885892]  ? shrink_slab+0x1c1/0x480
> [  108.885893]  shrink_node+0x324/0x840
> [  108.885895]  do_try_to_free_pages+0xdf/0x550
> [  108.885897]  try_to_free_pages+0xeb/0x260
> [  108.885899]  __alloc_pages_slowpath.constprop.0+0x35c/0xf20
> [  108.885901]  __alloc_frozen_pages_noprof+0x339/0x360
> [  108.885903]  __folio_alloc_noprof+0x10/0x90
> [  108.885904]  __handle_mm_fault+0xca5/0x1930
> [  108.885906]  handle_mm_fault+0xb6/0x310
> [  108.885908]  do_user_addr_fault+0x21e/0x6b0
> [  108.885910]  exc_page_fault+0x62/0x1d0
> [  108.885911]  asm_exc_page_fault+0x22/0x30
> [  108.885912] RIP: 0033:0xf64890
> [  108.885914] Code: 4e 64 31 d2 b9 01 00 00 00 31 f6 4c 89 45 98 e8 66 b3 88 ff 4c 8b 45 98 bf 28 00 00 00 b9 08 00 00 00 49 8b 70 18 48 8b 56 58 <48> 89 10 48 8b 13 48 89 46 58 c7 46 60 00 00 00 00 e9 62 01 00 00
> [  108.885915] RSP: 002b:00007ffcf3c7d920 EFLAGS: 00010206
> [  108.885916] RAX: 00007f7bf07c5000 RBX: 00007ffcf3c7d9a0 RCX: 0000000000000008
> [  108.885917] RDX: 00007f7bf06aa000 RSI: 00007f7bf09dd400 RDI: 0000000000000028
> [  108.885917] RBP: 00007ffcf3c7d990 R08: 00007f7bf080c540 R09: 0000000000000007
> [  108.885918] R10: 000000000000009a R11: 000000003e969900 R12: 00007f7bf07bbe70
> [  108.885918] R13: 0000000000000000 R14: 00007f7bf07bbec0 R15: 00007ffcf3c7d930
> [  108.885920]  </TASK>


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (5 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-30 14:37   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru() Muchun Song
                   ` (23 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

The maintenance of the folio->_deferred_list is intricate because it's
reused in a local list.

Here are some peculiarities:

   1) When a folio is removed from its split queue and added to a local
      on-stack list in deferred_split_scan(), the ->split_queue_len isn't
      updated, leading to an inconsistency between it and the actual
      number of folios in the split queue.

   2) When the folio is split via split_folio() later, it's removed from
      the local list while holding the split queue lock. At this time,
      this lock protects the local list, not the split queue.

   3) To handle the race condition with a third-party freeing or migrating
      the preceding folio, we must ensure there's always one safe (with
      raised refcount) folio before by delaying its folio_put(). More
      details can be found in commit e66f3185fa04. It's rather tricky.

We can use the folio_batch infrastructure to handle this clearly. In this
case, ->split_queue_len will be consistent with the real number of folios
in the split queue. If list_empty(&folio->_deferred_list) returns false,
it's clear the folio must be in its split queue (not in a local list
anymore).

In the future, we aim to reparent LRU folios during memcg offline to
eliminate dying memory cgroups. This patch prepares for using
folio_split_queue_lock_irqsave() as folio memcg may change then.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/huge_memory.c | 69 +++++++++++++++++++++---------------------------
 1 file changed, 30 insertions(+), 39 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 70820fa75c1f..d2bc943a40e8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4220,40 +4220,47 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
-	struct folio *folio, *next, *prev = NULL;
-	int split = 0, removed = 0;
+	struct folio *folio, *next;
+	int split = 0, i;
+	struct folio_batch fbatch;
+	bool done;
 
 #ifdef CONFIG_MEMCG
 	if (sc->memcg)
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
-
+	folio_batch_init(&fbatch);
+retry:
+	done = true;
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
 		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
+			folio_batch_add(&fbatch, folio);
+		} else if (folio_test_partially_mapped(folio)) {
 			/* We lost race with folio_put() */
-			if (folio_test_partially_mapped(folio)) {
-				folio_clear_partially_mapped(folio);
-				mod_mthp_stat(folio_order(folio),
-					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
-			}
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+			folio_clear_partially_mapped(folio);
+			mod_mthp_stat(folio_order(folio),
+				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
+		list_del_init(&folio->_deferred_list);
+		ds_queue->split_queue_len--;
 		if (!--sc->nr_to_scan)
 			break;
+		if (folio_batch_space(&fbatch) == 0) {
+			done = false;
+			break;
+		}
 	}
 	split_queue_unlock_irqrestore(ds_queue, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+	for (i = 0; i < folio_batch_count(&fbatch); i++) {
 		bool did_split = false;
 		bool underused = false;
+		struct deferred_split *fqueue;
 
+		folio = fbatch.folios[i];
 		if (!folio_test_partially_mapped(folio)) {
 			underused = thp_underused(folio);
 			if (!underused)
@@ -4269,39 +4276,23 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		}
 		folio_unlock(folio);
 next:
+		if (did_split || !folio_test_partially_mapped(folio))
+			continue;
 		/*
-		 * split_folio() removes folio from list on success.
 		 * Only add back to the queue if folio is partially mapped.
 		 * If thp_underused returns false, or if split_folio fails
 		 * in the case it was underused, then consider it used and
 		 * don't add it back to split_queue.
 		 */
-		if (did_split) {
-			; /* folio already removed from list */
-		} else if (!folio_test_partially_mapped(folio)) {
-			list_del_init(&folio->_deferred_list);
-			removed++;
-		} else {
-			/*
-			 * That unlocked list_del_init() above would be unsafe,
-			 * unless its folio is separated from any earlier folios
-			 * left on the list (which may be concurrently unqueued)
-			 * by one safe folio with refcount still raised.
-			 */
-			swap(folio, prev);
-		}
-		if (folio)
-			folio_put(folio);
+		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
+		list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
+		fqueue->split_queue_len++;
+		split_queue_unlock_irqrestore(fqueue, flags);
 	}
+	folios_put(&fbatch);
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
-	ds_queue->split_queue_len -= removed;
-	split_queue_unlock_irqrestore(ds_queue, flags);
-
-	if (prev)
-		folio_put(prev);
-
+	if (!done)
+		goto retry;
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
 	 * This can happen if pages were freed under us.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-04-15  2:45 ` [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Muchun Song
@ 2025-04-30 14:37   ` Johannes Weiner
  2025-05-06  6:44     ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2025-04-30 14:37 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais, Hugh Dickins

On Tue, Apr 15, 2025 at 10:45:11AM +0800, Muchun Song wrote:
> The maintenance of the folio->_deferred_list is intricate because it's
> reused in a local list.
> 
> Here are some peculiarities:
> 
>    1) When a folio is removed from its split queue and added to a local
>       on-stack list in deferred_split_scan(), the ->split_queue_len isn't
>       updated, leading to an inconsistency between it and the actual
>       number of folios in the split queue.
> 
>    2) When the folio is split via split_folio() later, it's removed from
>       the local list while holding the split queue lock. At this time,
>       this lock protects the local list, not the split queue.
> 
>    3) To handle the race condition with a third-party freeing or migrating
>       the preceding folio, we must ensure there's always one safe (with
>       raised refcount) folio before by delaying its folio_put(). More
>       details can be found in commit e66f3185fa04. It's rather tricky.
> 
> We can use the folio_batch infrastructure to handle this clearly. In this
> case, ->split_queue_len will be consistent with the real number of folios
> in the split queue. If list_empty(&folio->_deferred_list) returns false,
> it's clear the folio must be in its split queue (not in a local list
> anymore).
> 
> In the future, we aim to reparent LRU folios during memcg offline to
> eliminate dying memory cgroups. This patch prepares for using
> folio_split_queue_lock_irqsave() as folio memcg may change then.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

This is a very nice simplification. And getting rid of the stack list
and its subtle implication on all the various current and future
list_empty(&folio->_deferred_list) checks should be much more robust.

However, I think there is one snag related to this:

> ---
>  mm/huge_memory.c | 69 +++++++++++++++++++++---------------------------
>  1 file changed, 30 insertions(+), 39 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 70820fa75c1f..d2bc943a40e8 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4220,40 +4220,47 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  	struct pglist_data *pgdata = NODE_DATA(sc->nid);
>  	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
>  	unsigned long flags;
> -	LIST_HEAD(list);
> -	struct folio *folio, *next, *prev = NULL;
> -	int split = 0, removed = 0;
> +	struct folio *folio, *next;
> +	int split = 0, i;
> +	struct folio_batch fbatch;
> +	bool done;
>  
>  #ifdef CONFIG_MEMCG
>  	if (sc->memcg)
>  		ds_queue = &sc->memcg->deferred_split_queue;
>  #endif
> -
> +	folio_batch_init(&fbatch);
> +retry:
> +	done = true;
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>  	/* Take pin on all head pages to avoid freeing them under us */
>  	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>  							_deferred_list) {
>  		if (folio_try_get(folio)) {
> -			list_move(&folio->_deferred_list, &list);
> -		} else {
> +			folio_batch_add(&fbatch, folio);
> +		} else if (folio_test_partially_mapped(folio)) {
>  			/* We lost race with folio_put() */
> -			if (folio_test_partially_mapped(folio)) {
> -				folio_clear_partially_mapped(folio);
> -				mod_mthp_stat(folio_order(folio),
> -					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> -			}
> -			list_del_init(&folio->_deferred_list);
> -			ds_queue->split_queue_len--;
> +			folio_clear_partially_mapped(folio);
> +			mod_mthp_stat(folio_order(folio),
> +				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
>  		}
> +		list_del_init(&folio->_deferred_list);
> +		ds_queue->split_queue_len--;
>  		if (!--sc->nr_to_scan)
>  			break;
> +		if (folio_batch_space(&fbatch) == 0) {
> +			done = false;
> +			break;
> +		}
>  	}
>  	split_queue_unlock_irqrestore(ds_queue, flags);
>  
> -	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
> +	for (i = 0; i < folio_batch_count(&fbatch); i++) {
>  		bool did_split = false;
>  		bool underused = false;
> +		struct deferred_split *fqueue;
>  
> +		folio = fbatch.folios[i];
>  		if (!folio_test_partially_mapped(folio)) {
>  			underused = thp_underused(folio);
>  			if (!underused)
> @@ -4269,39 +4276,23 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>  		}
>  		folio_unlock(folio);
>  next:
> +		if (did_split || !folio_test_partially_mapped(folio))
> +			continue;

There IS a list_empty() check in the splitting code that we actually
relied on, for cleaning up the partially_mapped state and counter:

		    !list_empty(&folio->_deferred_list)) {
			ds_queue->split_queue_len--;
			if (folio_test_partially_mapped(folio)) {
				folio_clear_partially_mapped(folio);
				mod_mthp_stat(folio_order(folio),
					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
			}
			/*
			 * Reinitialize page_deferred_list after removing the
			 * page from the split_queue, otherwise a subsequent
			 * split will see list corruption when checking the
			 * page_deferred_list.
			 */
			list_del_init(&folio->_deferred_list);

With the folios isolated up front, it looks like you need to handle
this from the shrinker.

Otherwise this looks correct to me. But this code is subtle, I would
feel much better if Hugh (CC-ed) could take a look as well.

Thanks!

>  		/*
> -		 * split_folio() removes folio from list on success.
>  		 * Only add back to the queue if folio is partially mapped.
>  		 * If thp_underused returns false, or if split_folio fails
>  		 * in the case it was underused, then consider it used and
>  		 * don't add it back to split_queue.
>  		 */
> -		if (did_split) {
> -			; /* folio already removed from list */
> -		} else if (!folio_test_partially_mapped(folio)) {
> -			list_del_init(&folio->_deferred_list);
> -			removed++;
> -		} else {
> -			/*
> -			 * That unlocked list_del_init() above would be unsafe,
> -			 * unless its folio is separated from any earlier folios
> -			 * left on the list (which may be concurrently unqueued)
> -			 * by one safe folio with refcount still raised.
> -			 */
> -			swap(folio, prev);
> -		}
> -		if (folio)
> -			folio_put(folio);
> +		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
> +		list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
> +		fqueue->split_queue_len++;
> +		split_queue_unlock_irqrestore(fqueue, flags);
>  	}
> +	folios_put(&fbatch);
>  
> -	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -	list_splice_tail(&list, &ds_queue->split_queue);
> -	ds_queue->split_queue_len -= removed;
> -	split_queue_unlock_irqrestore(ds_queue, flags);
> -
> -	if (prev)
> -		folio_put(prev);
> -
> +	if (!done)
> +		goto retry;
>  	/*
>  	 * Stop shrinker if we didn't split any page, but the queue is empty.
>  	 * This can happen if pages were freed under us.
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-04-30 14:37   ` Johannes Weiner
@ 2025-05-06  6:44     ` Hugh Dickins
  2025-05-06 21:44       ` Hugh Dickins
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2025-05-06  6:44 UTC (permalink / raw)
  To: Johannes Weiner, Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais, Hugh Dickins

On Wed, 30 Apr 2025, Johannes Weiner wrote:
> On Tue, Apr 15, 2025 at 10:45:11AM +0800, Muchun Song wrote:
> > The maintenance of the folio->_deferred_list is intricate because it's
> > reused in a local list.
> > 
> > Here are some peculiarities:
> > 
> >    1) When a folio is removed from its split queue and added to a local
> >       on-stack list in deferred_split_scan(), the ->split_queue_len isn't
> >       updated, leading to an inconsistency between it and the actual
> >       number of folios in the split queue.
> > 
> >    2) When the folio is split via split_folio() later, it's removed from
> >       the local list while holding the split queue lock. At this time,
> >       this lock protects the local list, not the split queue.
> > 
> >    3) To handle the race condition with a third-party freeing or migrating
> >       the preceding folio, we must ensure there's always one safe (with
> >       raised refcount) folio before by delaying its folio_put(). More
> >       details can be found in commit e66f3185fa04. It's rather tricky.
> > 
> > We can use the folio_batch infrastructure to handle this clearly. In this
> > case, ->split_queue_len will be consistent with the real number of folios
> > in the split queue. If list_empty(&folio->_deferred_list) returns false,
> > it's clear the folio must be in its split queue (not in a local list
> > anymore).
> > 
> > In the future, we aim to reparent LRU folios during memcg offline to
> > eliminate dying memory cgroups. This patch prepares for using
> > folio_split_queue_lock_irqsave() as folio memcg may change then.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> 
> This is a very nice simplification. And getting rid of the stack list
> and its subtle implication on all the various current and future
> list_empty(&folio->_deferred_list) checks should be much more robust.
> 
> However, I think there is one snag related to this:
>...
> There IS a list_empty() check in the splitting code that we actually
> relied on, for cleaning up the partially_mapped state and counter:
> 
> 		    !list_empty(&folio->_deferred_list)) {
> 			ds_queue->split_queue_len--;
> 			if (folio_test_partially_mapped(folio)) {
> 				folio_clear_partially_mapped(folio);
> 				mod_mthp_stat(folio_order(folio),
> 					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> 			}
> 			/*
> 			 * Reinitialize page_deferred_list after removing the
> 			 * page from the split_queue, otherwise a subsequent
> 			 * split will see list corruption when checking the
> 			 * page_deferred_list.
> 			 */
> 			list_del_init(&folio->_deferred_list);
> 
> With the folios isolated up front, it looks like you need to handle
> this from the shrinker.

Good catch.  I loaded up patches 01-07/28 on top of 6.15-rc5 yesterday,
and after a good run of 12 hours on this laptop, indeed I can see
vmstat nr_anon_partially_mapped 78299, whereas it usually ends up at 0.

> 
> Otherwise this looks correct to me. But this code is subtle, I would
> feel much better if Hugh (CC-ed) could take a look as well.

However... I was intending to run it for 12 hours on the workstation,
but after 11 hours and 35 minutes, that crashed with list_del corruption,
kernel BUG at lib/list_debug.c:65! from deferred_split_scan()'s
list_del_init().

I've not yet put together the explanation: I am deeply suspicious of
the change to when list_empty() becomes true (the block Hannes shows
above is not the only such: (__)folio_unqueue_deferred_split() and
migrate_pages_batch() consult it too), but each time I think I have
the explanation, it's ruled out by folio_try_get()'s reference.

And aside from the crash (I don't suppose 6.15-rc5 is responsible,
or that patches 08-28/28 would fix it), I'm not so sure that this
patch is really an improvement (folio reference held for longer, and
list lock taken more often when split fails: maybe not important, but
I'm also not so keen on adding in fbatch myself).  I didn't spend very
long looking through the patches, but maybe this 07/28 is not essential?

Let me try again to work out what's wrong tomorrow,
Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-05-06  6:44     ` Hugh Dickins
@ 2025-05-06 21:44       ` Hugh Dickins
  2025-05-07  3:30         ` Muchun Song
  0 siblings, 1 reply; 69+ messages in thread
From: Hugh Dickins @ 2025-05-06 21:44 UTC (permalink / raw)
  To: Muchun Song
  Cc: Johannes Weiner, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, david, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, linux-kernel, cgroups, linux-mm, hamzamahfooz,
	apais, Hugh Dickins

On Mon, 5 May 2025, Hugh Dickins wrote:
...
> 
> However... I was intending to run it for 12 hours on the workstation,
> but after 11 hours and 35 minutes, that crashed with list_del corruption,
> kernel BUG at lib/list_debug.c:65! from deferred_split_scan()'s
> list_del_init().
> 
> I've not yet put together the explanation: I am deeply suspicious of
> the change to when list_empty() becomes true (the block Hannes shows
> above is not the only such: (__)folio_unqueue_deferred_split() and
> migrate_pages_batch() consult it too), but each time I think I have
> the explanation, it's ruled out by folio_try_get()'s reference.
> 
> And aside from the crash (I don't suppose 6.15-rc5 is responsible,
> or that patches 08-28/28 would fix it), I'm not so sure that this
> patch is really an improvement (folio reference held for longer, and
> list lock taken more often when split fails: maybe not important, but
> I'm also not so keen on adding in fbatch myself).  I didn't spend very
> long looking through the patches, but maybe this 07/28 is not essential?

The BUG would be explained by deferred_split_folio(): that is still using
list_empty(&folio->_deferred_list) to decide whether the folio needs to be
added to the _deferred_list (else is already there).  With the 07/28 mods,
it's liable to add THP to the _deferred_list while deferred_split_scan()
holds that THP in its local fbatch.  I haven't tried to go through all the
ways in which that may go horribly wrong (or be harmless), but one of them
is deferred_split_scan() after failed split doing a second list_add_tail()
on that THP: no!  I won't think about fixes, I'll  move on to other tasks.

Or does that get changed in 08-28/28? I've not looked.

Hugh

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-05-06 21:44       ` Hugh Dickins
@ 2025-05-07  3:30         ` Muchun Song
  0 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-05-07  3:30 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Muchun Song, Johannes Weiner, mhocko, roman.gushchin,
	shakeel.butt, akpm, david, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, linux-kernel, cgroups, linux-mm, hamzamahfooz,
	apais



> On May 7, 2025, at 05:44, Hugh Dickins <hughd@google.com> wrote:
> 
> On Mon, 5 May 2025, Hugh Dickins wrote:
> ...
>> 
>> However... I was intending to run it for 12 hours on the workstation,
>> but after 11 hours and 35 minutes, that crashed with list_del corruption,
>> kernel BUG at lib/list_debug.c:65! from deferred_split_scan()'s
>> list_del_init().
>> 
>> I've not yet put together the explanation: I am deeply suspicious of
>> the change to when list_empty() becomes true (the block Hannes shows
>> above is not the only such: (__)folio_unqueue_deferred_split() and
>> migrate_pages_batch() consult it too), but each time I think I have
>> the explanation, it's ruled out by folio_try_get()'s reference.
>> 
>> And aside from the crash (I don't suppose 6.15-rc5 is responsible,
>> or that patches 08-28/28 would fix it), I'm not so sure that this
>> patch is really an improvement (folio reference held for longer, and
>> list lock taken more often when split fails: maybe not important, but
>> I'm also not so keen on adding in fbatch myself).  I didn't spend very
>> long looking through the patches, but maybe this 07/28 is not essential?

Hi Hugh,

Really thanks for your time to look at this patch. 07/28 is actually a
necessary change in this series.

> 
> The BUG would be explained by deferred_split_folio(): that is still using
> list_empty(&folio->_deferred_list) to decide whether the folio needs to be
> added to the _deferred_list (else is already there).  With the 07/28 mods,
> it's liable to add THP to the _deferred_list while deferred_split_scan()
> holds that THP in its local fbatch.  I haven't tried to go through all the
> ways in which that may go horribly wrong (or be harmless), but one of them
> is deferred_split_scan() after failed split doing a second list_add_tail()
> on that THP: no!  I won't think about fixes, I'll  move on to other tasks.

Thanks for your analysis. I'll look at it deeply.

> 
> Or does that get changed in 08-28/28? I've not looked.

No. 08-28/28 did not change anything related to THP _deferred_list.

Muchun,
Thanks.

> 
> Hugh


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (6 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-30 14:49   ` Johannes Weiner
  2025-04-15  2:45 ` [PATCH RFC 09/28] mm: memcontrol: allocate object cgroup for non-kmem case Muchun Song
                   ` (22 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In a subsequent patch, we'll reparent the LRU folios. The folios that are
moved to the appropriate LRU list can undergo reparenting during the
move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
a lruvec lock. Instead, we should utilize the more general interface of
folio_lruvec_relock_irq() to obtain the correct lruvec lock.

This patch involves only code refactoring and doesn't introduce any
functional changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/vmscan.c | 51 +++++++++++++++++++++++++--------------------------
 1 file changed, 25 insertions(+), 26 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a76b3cee043d..eac5e6e70660 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1901,24 +1901,27 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
 /*
  * move_folios_to_lru() moves folios from private @list to appropriate LRU list.
  *
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate lruvec.
+ *
+ * Note: The caller must not hold any lruvec lock.
  */
-static unsigned int move_folios_to_lru(struct lruvec *lruvec,
-		struct list_head *list)
+static unsigned int move_folios_to_lru(struct list_head *list)
 {
 	int nr_pages, nr_moved = 0;
+	struct lruvec *lruvec = NULL;
 	struct folio_batch free_folios;
 
 	folio_batch_init(&free_folios);
 	while (!list_empty(list)) {
 		struct folio *folio = lru_to_folio(list);
 
+		lruvec = folio_lruvec_relock_irq(folio, lruvec);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 		list_del(&folio->lru);
 		if (unlikely(!folio_evictable(folio))) {
-			spin_unlock_irq(&lruvec->lru_lock);
+			lruvec_unlock_irq(lruvec);
 			folio_putback_lru(folio);
-			spin_lock_irq(&lruvec->lru_lock);
+			lruvec = NULL;
 			continue;
 		}
 
@@ -1940,19 +1943,15 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 
 			folio_unqueue_deferred_split(folio);
 			if (folio_batch_add(&free_folios, folio) == 0) {
-				spin_unlock_irq(&lruvec->lru_lock);
+				lruvec_unlock_irq(lruvec);
 				mem_cgroup_uncharge_folios(&free_folios);
 				free_unref_folios(&free_folios);
-				spin_lock_irq(&lruvec->lru_lock);
+				lruvec = NULL;
 			}
 
 			continue;
 		}
 
-		/*
-		 * All pages were isolated from the same lruvec (and isolation
-		 * inhibits memcg migration).
-		 */
 		VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
@@ -1961,11 +1960,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
 			workingset_age_nonresident(lruvec, nr_pages);
 	}
 
+	if (lruvec)
+		lruvec_unlock_irq(lruvec);
+
 	if (free_folios.nr) {
-		spin_unlock_irq(&lruvec->lru_lock);
 		mem_cgroup_uncharge_folios(&free_folios);
 		free_unref_folios(&free_folios);
-		spin_lock_irq(&lruvec->lru_lock);
 	}
 
 	return nr_moved;
@@ -2033,9 +2033,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false);
 
-	spin_lock_irq(&lruvec->lru_lock);
-	move_folios_to_lru(lruvec, &folio_list);
+	move_folios_to_lru(&folio_list);
 
+	local_irq_disable();
 	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
 					stat.nr_demoted);
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -2044,7 +2044,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 		__count_vm_events(item, nr_reclaimed);
 	__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
 	__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
-	spin_unlock_irq(&lruvec->lru_lock);
+	local_irq_enable();
 
 	lru_note_cost(lruvec, file, stat.nr_pageout, nr_scanned - nr_reclaimed);
 
@@ -2183,16 +2183,15 @@ static void shrink_active_list(unsigned long nr_to_scan,
 	/*
 	 * Move folios back to the lru list.
 	 */
-	spin_lock_irq(&lruvec->lru_lock);
-
-	nr_activate = move_folios_to_lru(lruvec, &l_active);
-	nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
+	nr_activate = move_folios_to_lru(&l_active);
+	nr_deactivate = move_folios_to_lru(&l_inactive);
 
+	local_irq_disable();
 	__count_vm_events(PGDEACTIVATE, nr_deactivate);
 	__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
 
 	__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
-	spin_unlock_irq(&lruvec->lru_lock);
+	local_irq_enable();
 
 	if (nr_rotated)
 		lru_note_cost(lruvec, file, 0, nr_rotated);
@@ -4723,14 +4722,15 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 			set_mask_bits(&folio->flags, LRU_REFS_FLAGS, BIT(PG_active));
 	}
 
-	spin_lock_irq(&lruvec->lru_lock);
-
-	move_folios_to_lru(lruvec, &list);
+	move_folios_to_lru(&list);
 
+	local_irq_disable();
 	walk = current->reclaim_state->mm_walk;
 	if (walk && walk->batched) {
 		walk->lruvec = lruvec;
+		spin_lock(&lruvec->lru_lock);
 		reset_batch_size(walk);
+		spin_unlock(&lruvec->lru_lock);
 	}
 
 	__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
@@ -4741,8 +4741,7 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap
 		__count_vm_events(item, reclaimed);
 	__count_memcg_events(memcg, item, reclaimed);
 	__count_vm_events(PGSTEAL_ANON + type, reclaimed);
-
-	spin_unlock_irq(&lruvec->lru_lock);
+	local_irq_enable();
 
 	list_splice_init(&clean, &list);
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru()
  2025-04-15  2:45 ` [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru() Muchun Song
@ 2025-04-30 14:49   ` Johannes Weiner
  0 siblings, 0 replies; 69+ messages in thread
From: Johannes Weiner @ 2025-04-30 14:49 UTC (permalink / raw)
  To: Muchun Song
  Cc: mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:12AM +0800, Muchun Song wrote:
> In a subsequent patch, we'll reparent the LRU folios. The folios that are
> moved to the appropriate LRU list can undergo reparenting during the
> move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
> a lruvec lock. Instead, we should utilize the more general interface of
> folio_lruvec_relock_irq() to obtain the correct lruvec lock.
> 
> This patch involves only code refactoring and doesn't introduce any
> functional changes.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 09/28] mm: memcontrol: allocate object cgroup for non-kmem case
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (7 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup Muchun Song
                   ` (21 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

Pagecache pages are charged at allocation time and hold a reference
to the original memory cgroup until reclaimed. Depending on memory
pressure, page sharing patterns between different cgroups and cgroup
creation/destruction rates, many dying memory cgroups can be pinned
by pagecache pages, reducing page reclaim efficiency and wasting
memory. Converting LRU folios and most other raw memory cgroup pins
to the object cgroup direction can fix this long-living problem.

As a result, the objcg infrastructure is no longer solely applicable
to the kmem case. In this patch, we extend the scope of the objcg
infrastructure beyond the kmem case, enabling LRU folios to reuse
it for folio charging purposes.

It should be noted that LRU folios are not accounted for at the root
level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
the folio->memcg_data of LRU folios always points to a valid pointer.
However, the root_mem_cgroup does not possess an object cgroup.
Therefore, we also allocate an object cgroup for the root_mem_cgroup.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 50 +++++++++++++++++++++++--------------------------
 1 file changed, 23 insertions(+), 27 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0fc76d50bc23..a6362d11b46c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -193,10 +193,10 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
 	return objcg;
 }
 
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
-				  struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
 {
 	struct obj_cgroup *objcg, *iter;
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 
 	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
 
@@ -3156,30 +3156,17 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	return val;
 }
 
-static int memcg_online_kmem(struct mem_cgroup *memcg)
+static void memcg_online_kmem(struct mem_cgroup *memcg)
 {
-	struct obj_cgroup *objcg;
-
 	if (mem_cgroup_kmem_disabled())
-		return 0;
+		return;
 
 	if (unlikely(mem_cgroup_is_root(memcg)))
-		return 0;
-
-	objcg = obj_cgroup_alloc();
-	if (!objcg)
-		return -ENOMEM;
-
-	objcg->memcg = memcg;
-	rcu_assign_pointer(memcg->objcg, objcg);
-	obj_cgroup_get(objcg);
-	memcg->orig_objcg = objcg;
+		return;
 
 	static_branch_enable(&memcg_kmem_online_key);
 
 	memcg->kmemcg_id = memcg->id.id;
-
-	return 0;
 }
 
 static void memcg_offline_kmem(struct mem_cgroup *memcg)
@@ -3194,12 +3181,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 
 	parent = parent_mem_cgroup(memcg);
 	memcg_reparent_list_lrus(memcg, parent);
-
-	/*
-	 * Objcg's reparenting must be after list_lru's, make sure list_lru
-	 * helpers won't use parent's list_lru until child is drained.
-	 */
-	memcg_reparent_objcgs(memcg, parent);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -3711,9 +3692,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct obj_cgroup *objcg;
 
-	if (memcg_online_kmem(memcg))
-		goto remove_id;
+	memcg_online_kmem(memcg);
 
 	/*
 	 * A memcg must be visible for expand_shrinker_info()
@@ -3723,6 +3704,15 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (alloc_shrinker_info(memcg))
 		goto offline_kmem;
 
+	objcg = obj_cgroup_alloc();
+	if (!objcg)
+		goto free_shrinker;
+
+	objcg->memcg = memcg;
+	rcu_assign_pointer(memcg->objcg, objcg);
+	obj_cgroup_get(objcg);
+	memcg->orig_objcg = objcg;
+
 	if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
 		queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
 				   FLUSH_TIME);
@@ -3745,9 +3735,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	xa_store(&mem_cgroup_ids, memcg->id.id, memcg, GFP_KERNEL);
 
 	return 0;
+free_shrinker:
+	free_shrinker_info(memcg);
 offline_kmem:
 	memcg_offline_kmem(memcg);
-remove_id:
 	mem_cgroup_id_remove(memcg);
 	return -ENOMEM;
 }
@@ -3764,6 +3755,11 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	zswap_memcg_offline_cleanup(memcg);
 
 	memcg_offline_kmem(memcg);
+	/*
+	 * Objcg's reparenting must be after list_lru's above, make sure list_lru
+	 * helpers won't use parent's list_lru until child is drained.
+	 */
+	memcg_reparent_objcgs(memcg);
 	reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 	lru_gen_offline_memcg(memcg);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (8 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 09/28] mm: memcontrol: allocate object cgroup for non-kmem case Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-06-28  3:09   ` Chen Ridong
  2025-04-15  2:45 ` [PATCH RFC 11/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Muchun Song
                   ` (20 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

Memory cgroup functions such as get_mem_cgroup_from_folio() and
get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
even for the root memory cgroup. In contrast, the situation for
object cgroups has been different.

Previously, the root object cgroup couldn't be returned because
it didn't exist. Now that a valid root object cgroup exists, for
the sake of consistency, it's necessary to align the behavior of
object-cgroup-related operations with that of memory cgroup APIs.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 29 ++++++++++++++++++-------
 mm/memcontrol.c            | 44 ++++++++++++++++++++------------------
 mm/percpu.c                |  2 +-
 3 files changed, 45 insertions(+), 30 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bb4f203733f3..e74922d5755d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -319,6 +319,7 @@ struct mem_cgroup {
 #define MEMCG_CHARGE_BATCH 64U
 
 extern struct mem_cgroup *root_mem_cgroup;
+extern struct obj_cgroup *root_obj_cgroup;
 
 enum page_memcg_data_flags {
 	/* page->memcg_data is a pointer to an slabobj_ext vector */
@@ -528,6 +529,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return (memcg == root_mem_cgroup);
 }
 
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+	return objcg == root_obj_cgroup;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return !cgroup_subsys_enabled(memory_cgrp_subsys);
@@ -752,23 +758,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 
 static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
 {
+	if (obj_cgroup_is_root(objcg))
+		return true;
 	return percpu_ref_tryget(&objcg->refcnt);
 }
 
-static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
+				       unsigned long nr)
 {
-	percpu_ref_get(&objcg->refcnt);
+	if (!obj_cgroup_is_root(objcg))
+		percpu_ref_get_many(&objcg->refcnt, nr);
 }
 
-static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
-				       unsigned long nr)
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
 {
-	percpu_ref_get_many(&objcg->refcnt, nr);
+	obj_cgroup_get_many(objcg, 1);
 }
 
 static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 {
-	if (objcg)
+	if (objcg && !obj_cgroup_is_root(objcg))
 		percpu_ref_put(&objcg->refcnt);
 }
 
@@ -1101,6 +1110,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
 	return true;
 }
 
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+	return true;
+}
+
 static inline bool mem_cgroup_disabled(void)
 {
 	return true;
@@ -1684,8 +1698,7 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
 {
 	struct obj_cgroup *objcg = current_obj_cgroup();
 
-	if (objcg)
-		obj_cgroup_get(objcg);
+	obj_cgroup_get(objcg);
 
 	return objcg;
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a6362d11b46c..4aadc1b87db3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -81,6 +81,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly;
 EXPORT_SYMBOL(memory_cgrp_subsys);
 
 struct mem_cgroup *root_mem_cgroup __read_mostly;
+struct obj_cgroup *root_obj_cgroup __read_mostly;
 
 /* Active memory cgroup to use from an interrupt context */
 DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
@@ -2525,15 +2526,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
 
 static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 {
-	struct obj_cgroup *objcg = NULL;
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
 
-	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
-		objcg = rcu_dereference(memcg->objcg);
 		if (likely(objcg && obj_cgroup_tryget(objcg)))
-			break;
-		objcg = NULL;
+			return objcg;
 	}
-	return objcg;
+
+	return NULL;
 }
 
 static struct obj_cgroup *current_objcg_update(void)
@@ -2604,18 +2604,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 		 * Objcg reference is kept by the task, so it's safe
 		 * to use the objcg by the current task.
 		 */
-		return objcg;
+		return objcg ? : root_obj_cgroup;
 	}
 
 	memcg = this_cpu_read(int_active_memcg);
 	if (unlikely(memcg))
 		goto from_memcg;
 
-	return NULL;
+	return root_obj_cgroup;
 
 from_memcg:
-	objcg = NULL;
-	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
 		/*
 		 * Memcg pointer is protected by scope (see set_active_memcg())
 		 * and is pinning the corresponding objcg, so objcg can't go
@@ -2624,10 +2623,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
 		 */
 		objcg = rcu_dereference_check(memcg->objcg, 1);
 		if (likely(objcg))
-			break;
+			return objcg;
 	}
 
-	return objcg;
+	return root_obj_cgroup;
 }
 
 struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
@@ -2641,14 +2640,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 		objcg = __folio_objcg(folio);
 		obj_cgroup_get(objcg);
 	} else {
-		struct mem_cgroup *memcg;
-
 		rcu_read_lock();
-		memcg = __folio_memcg(folio);
-		if (memcg)
-			objcg = __get_obj_cgroup_from_memcg(memcg);
-		else
-			objcg = NULL;
+		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
 		rcu_read_unlock();
 	}
 	return objcg;
@@ -2733,7 +2726,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
 	int ret = 0;
 
 	objcg = current_obj_cgroup();
-	if (objcg) {
+	if (!obj_cgroup_is_root(objcg)) {
 		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
 		if (!ret) {
 			obj_cgroup_get(objcg);
@@ -3036,7 +3029,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
 	 * obj_cgroup_get() is used to get a permanent reference.
 	 */
 	objcg = current_obj_cgroup();
-	if (!objcg)
+	if (obj_cgroup_is_root(objcg))
 		return true;
 
 	/*
@@ -3708,6 +3701,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 	if (!objcg)
 		goto free_shrinker;
 
+	if (unlikely(mem_cgroup_is_root(memcg)))
+		root_obj_cgroup = objcg;
+
 	objcg->memcg = memcg;
 	rcu_assign_pointer(memcg->objcg, objcg);
 	obj_cgroup_get(objcg);
@@ -5302,6 +5298,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
+	if (obj_cgroup_is_root(objcg))
+		return;
+
 	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
 
 	/* PF_MEMALLOC context, charging must succeed */
@@ -5329,6 +5328,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return;
 
+	if (obj_cgroup_is_root(objcg))
+		return;
+
 	obj_cgroup_uncharge(objcg, size);
 
 	rcu_read_lock();
diff --git a/mm/percpu.c b/mm/percpu.c
index b35494c8ede2..3e54c6fca9bd 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
 		return true;
 
 	objcg = current_obj_cgroup();
-	if (!objcg)
+	if (obj_cgroup_is_root(objcg))
 		return true;
 
 	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-04-15  2:45 ` [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup Muchun Song
@ 2025-06-28  3:09   ` Chen Ridong
  2025-06-30  7:16     ` Muchun Song
  0 siblings, 1 reply; 69+ messages in thread
From: Chen Ridong @ 2025-06-28  3:09 UTC (permalink / raw)
  To: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, david, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais



On 2025/4/15 10:45, Muchun Song wrote:
> Memory cgroup functions such as get_mem_cgroup_from_folio() and
> get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
> even for the root memory cgroup. In contrast, the situation for
> object cgroups has been different.
> 
> Previously, the root object cgroup couldn't be returned because
> it didn't exist. Now that a valid root object cgroup exists, for
> the sake of consistency, it's necessary to align the behavior of
> object-cgroup-related operations with that of memory cgroup APIs.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/memcontrol.h | 29 ++++++++++++++++++-------
>  mm/memcontrol.c            | 44 ++++++++++++++++++++------------------
>  mm/percpu.c                |  2 +-
>  3 files changed, 45 insertions(+), 30 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index bb4f203733f3..e74922d5755d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -319,6 +319,7 @@ struct mem_cgroup {
>  #define MEMCG_CHARGE_BATCH 64U
>  
>  extern struct mem_cgroup *root_mem_cgroup;
> +extern struct obj_cgroup *root_obj_cgroup;
>  
>  enum page_memcg_data_flags {
>  	/* page->memcg_data is a pointer to an slabobj_ext vector */
> @@ -528,6 +529,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  	return (memcg == root_mem_cgroup);
>  }
>  
> +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
> +{
> +	return objcg == root_obj_cgroup;
> +}
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	return !cgroup_subsys_enabled(memory_cgrp_subsys);
> @@ -752,23 +758,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
>  
>  static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
>  {
> +	if (obj_cgroup_is_root(objcg))
> +		return true;
>  	return percpu_ref_tryget(&objcg->refcnt);
>  }
>  
> -static inline void obj_cgroup_get(struct obj_cgroup *objcg)
> +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> +				       unsigned long nr)
>  {
> -	percpu_ref_get(&objcg->refcnt);
> +	if (!obj_cgroup_is_root(objcg))
> +		percpu_ref_get_many(&objcg->refcnt, nr);
>  }
>  
> -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> -				       unsigned long nr)
> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
>  {
> -	percpu_ref_get_many(&objcg->refcnt, nr);
> +	obj_cgroup_get_many(objcg, 1);
>  }
>  
>  static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>  {
> -	if (objcg)
> +	if (objcg && !obj_cgroup_is_root(objcg))
>  		percpu_ref_put(&objcg->refcnt);
>  }
>  
> @@ -1101,6 +1110,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
>  	return true;
>  }
>  
> +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
> +{
> +	return true;
> +}
> +
>  static inline bool mem_cgroup_disabled(void)
>  {
>  	return true;
> @@ -1684,8 +1698,7 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
>  {
>  	struct obj_cgroup *objcg = current_obj_cgroup();
>  
> -	if (objcg)
> -		obj_cgroup_get(objcg);
> +	obj_cgroup_get(objcg);
>  
>  	return objcg;
>  }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a6362d11b46c..4aadc1b87db3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -81,6 +81,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly;
>  EXPORT_SYMBOL(memory_cgrp_subsys);
>  
>  struct mem_cgroup *root_mem_cgroup __read_mostly;
> +struct obj_cgroup *root_obj_cgroup __read_mostly;
>  
>  /* Active memory cgroup to use from an interrupt context */
>  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> @@ -2525,15 +2526,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
>  
>  static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
>  {
> -	struct obj_cgroup *objcg = NULL;
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
>  
> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> -		objcg = rcu_dereference(memcg->objcg);
>  		if (likely(objcg && obj_cgroup_tryget(objcg)))
> -			break;
> -		objcg = NULL;
> +			return objcg;
>  	}
> -	return objcg;
> +
> +	return NULL;
>  }
>  

It appears that the return NULL statement might be dead code in this
context. And would it be preferable to use return root_obj_cgroup instead?

Best regards,
Ridong

>  static struct obj_cgroup *current_objcg_update(void)
> @@ -2604,18 +2604,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>  		 * Objcg reference is kept by the task, so it's safe
>  		 * to use the objcg by the current task.
>  		 */
> -		return objcg;
> +		return objcg ? : root_obj_cgroup;
>  	}
>  
>  	memcg = this_cpu_read(int_active_memcg);
>  	if (unlikely(memcg))
>  		goto from_memcg;
>  
> -	return NULL;
> +	return root_obj_cgroup;
>  
>  from_memcg:
> -	objcg = NULL;
> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
>  		/*
>  		 * Memcg pointer is protected by scope (see set_active_memcg())
>  		 * and is pinning the corresponding objcg, so objcg can't go
> @@ -2624,10 +2623,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
>  		 */
>  		objcg = rcu_dereference_check(memcg->objcg, 1);
>  		if (likely(objcg))
> -			break;
> +			return objcg;
>  	}
>  
> -	return objcg;
> +	return root_obj_cgroup;
>  }
>  
>  struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
> @@ -2641,14 +2640,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
>  		objcg = __folio_objcg(folio);
>  		obj_cgroup_get(objcg);
>  	} else {
> -		struct mem_cgroup *memcg;
> -
>  		rcu_read_lock();
> -		memcg = __folio_memcg(folio);
> -		if (memcg)
> -			objcg = __get_obj_cgroup_from_memcg(memcg);
> -		else
> -			objcg = NULL;
> +		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
>  		rcu_read_unlock();
>  	}
>  	return objcg;
> @@ -2733,7 +2726,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
>  	int ret = 0;
>  
>  	objcg = current_obj_cgroup();
> -	if (objcg) {
> +	if (!obj_cgroup_is_root(objcg)) {
>  		ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
>  		if (!ret) {
>  			obj_cgroup_get(objcg);
> @@ -3036,7 +3029,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
>  	 * obj_cgroup_get() is used to get a permanent reference.
>  	 */
>  	objcg = current_obj_cgroup();
> -	if (!objcg)
> +	if (obj_cgroup_is_root(objcg))
>  		return true;
>  
>  	/*
> @@ -3708,6 +3701,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  	if (!objcg)
>  		goto free_shrinker;
>  
> +	if (unlikely(mem_cgroup_is_root(memcg)))
> +		root_obj_cgroup = objcg;
> +
>  	objcg->memcg = memcg;
>  	rcu_assign_pointer(memcg->objcg, objcg);
>  	obj_cgroup_get(objcg);
> @@ -5302,6 +5298,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return;
>  
> +	if (obj_cgroup_is_root(objcg))
> +		return;
> +
>  	VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
>  
>  	/* PF_MEMALLOC context, charging must succeed */
> @@ -5329,6 +5328,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>  	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
>  		return;
>  
> +	if (obj_cgroup_is_root(objcg))
> +		return;
> +
>  	obj_cgroup_uncharge(objcg, size);
>  
>  	rcu_read_lock();
> diff --git a/mm/percpu.c b/mm/percpu.c
> index b35494c8ede2..3e54c6fca9bd 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
>  		return true;
>  
>  	objcg = current_obj_cgroup();
> -	if (!objcg)
> +	if (obj_cgroup_is_root(objcg))
>  		return true;
>  
>  	if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: Re: [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup
  2025-06-28  3:09   ` Chen Ridong
@ 2025-06-30  7:16     ` Muchun Song
  0 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-06-30  7:16 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	muchun.song, Andrew Morton, Dave Chinner, Qi Zheng, yosry.ahmed,
	Nhat Pham, chengming.zhou, LKML, Cgroups,
	Linux Memory Management List, hamzamahfooz, apais

On Sat, Jun 28, 2025 at 11:09 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
>
>
> On 2025/4/15 10:45, Muchun Song wrote:
> > Memory cgroup functions such as get_mem_cgroup_from_folio() and
> > get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
> > even for the root memory cgroup. In contrast, the situation for
> > object cgroups has been different.
> >
> > Previously, the root object cgroup couldn't be returned because
> > it didn't exist. Now that a valid root object cgroup exists, for
> > the sake of consistency, it's necessary to align the behavior of
> > object-cgroup-related operations with that of memory cgroup APIs.
> >
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  include/linux/memcontrol.h | 29 ++++++++++++++++++-------
> >  mm/memcontrol.c            | 44 ++++++++++++++++++++------------------
> >  mm/percpu.c                |  2 +-
> >  3 files changed, 45 insertions(+), 30 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index bb4f203733f3..e74922d5755d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -319,6 +319,7 @@ struct mem_cgroup {
> >  #define MEMCG_CHARGE_BATCH 64U
> >
> >  extern struct mem_cgroup *root_mem_cgroup;
> > +extern struct obj_cgroup *root_obj_cgroup;
> >
> >  enum page_memcg_data_flags {
> >       /* page->memcg_data is a pointer to an slabobj_ext vector */
> > @@ -528,6 +529,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> >       return (memcg == root_mem_cgroup);
> >  }
> >
> > +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
> > +{
> > +     return objcg == root_obj_cgroup;
> > +}
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >       return !cgroup_subsys_enabled(memory_cgrp_subsys);
> > @@ -752,23 +758,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
> >
> >  static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
> >  {
> > +     if (obj_cgroup_is_root(objcg))
> > +             return true;
> >       return percpu_ref_tryget(&objcg->refcnt);
> >  }
> >
> > -static inline void obj_cgroup_get(struct obj_cgroup *objcg)
> > +static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> > +                                    unsigned long nr)
> >  {
> > -     percpu_ref_get(&objcg->refcnt);
> > +     if (!obj_cgroup_is_root(objcg))
> > +             percpu_ref_get_many(&objcg->refcnt, nr);
> >  }
> >
> > -static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
> > -                                    unsigned long nr)
> > +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
> >  {
> > -     percpu_ref_get_many(&objcg->refcnt, nr);
> > +     obj_cgroup_get_many(objcg, 1);
> >  }
> >
> >  static inline void obj_cgroup_put(struct obj_cgroup *objcg)
> >  {
> > -     if (objcg)
> > +     if (objcg && !obj_cgroup_is_root(objcg))
> >               percpu_ref_put(&objcg->refcnt);
> >  }
> >
> > @@ -1101,6 +1110,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
> >       return true;
> >  }
> >
> > +static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
> > +{
> > +     return true;
> > +}
> > +
> >  static inline bool mem_cgroup_disabled(void)
> >  {
> >       return true;
> > @@ -1684,8 +1698,7 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> >  {
> >       struct obj_cgroup *objcg = current_obj_cgroup();
> >
> > -     if (objcg)
> > -             obj_cgroup_get(objcg);
> > +     obj_cgroup_get(objcg);
> >
> >       return objcg;
> >  }
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index a6362d11b46c..4aadc1b87db3 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -81,6 +81,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly;
> >  EXPORT_SYMBOL(memory_cgrp_subsys);
> >
> >  struct mem_cgroup *root_mem_cgroup __read_mostly;
> > +struct obj_cgroup *root_obj_cgroup __read_mostly;
> >
> >  /* Active memory cgroup to use from an interrupt context */
> >  DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
> > @@ -2525,15 +2526,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
> >
> >  static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
> >  {
> > -     struct obj_cgroup *objcg = NULL;
> > +     for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> > +             struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
> >
> > -     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> > -             objcg = rcu_dereference(memcg->objcg);
> >               if (likely(objcg && obj_cgroup_tryget(objcg)))
> > -                     break;
> > -             objcg = NULL;
> > +                     return objcg;
> >       }
> > -     return objcg;
> > +
> > +     return NULL;
> >  }
> >
>
> It appears that the return NULL statement might be dead code in this
> context. And would it be preferable to use return root_obj_cgroup instead?

I do not think so. The parameter of @memcg could be NULL passed from
current_objcg_update(). Returning NULL in this case makes sense to me.
It is not reasonable to return root_obj_cgroup for a NULL memcg for me.

Muchun,
Thanks.

>
> Best regards,
> Ridong
>
> >  static struct obj_cgroup *current_objcg_update(void)
> > @@ -2604,18 +2604,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
> >                * Objcg reference is kept by the task, so it's safe
> >                * to use the objcg by the current task.
> >                */
> > -             return objcg;
> > +             return objcg ? : root_obj_cgroup;
> >       }
> >
> >       memcg = this_cpu_read(int_active_memcg);
> >       if (unlikely(memcg))
> >               goto from_memcg;
> >
> > -     return NULL;
> > +     return root_obj_cgroup;
> >
> >  from_memcg:
> > -     objcg = NULL;
> > -     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> > +     for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> >               /*
> >                * Memcg pointer is protected by scope (see set_active_memcg())
> >                * and is pinning the corresponding objcg, so objcg can't go
> > @@ -2624,10 +2623,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
> >                */
> >               objcg = rcu_dereference_check(memcg->objcg, 1);
> >               if (likely(objcg))
> > -                     break;
> > +                     return objcg;
> >       }
> >
> > -     return objcg;
> > +     return root_obj_cgroup;
> >  }
> >
> >  struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
> > @@ -2641,14 +2640,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
> >               objcg = __folio_objcg(folio);
> >               obj_cgroup_get(objcg);
> >       } else {
> > -             struct mem_cgroup *memcg;
> > -
> >               rcu_read_lock();
> > -             memcg = __folio_memcg(folio);
> > -             if (memcg)
> > -                     objcg = __get_obj_cgroup_from_memcg(memcg);
> > -             else
> > -                     objcg = NULL;
> > +             objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
> >               rcu_read_unlock();
> >       }
> >       return objcg;
> > @@ -2733,7 +2726,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
> >       int ret = 0;
> >
> >       objcg = current_obj_cgroup();
> > -     if (objcg) {
> > +     if (!obj_cgroup_is_root(objcg)) {
> >               ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
> >               if (!ret) {
> >                       obj_cgroup_get(objcg);
> > @@ -3036,7 +3029,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
> >        * obj_cgroup_get() is used to get a permanent reference.
> >        */
> >       objcg = current_obj_cgroup();
> > -     if (!objcg)
> > +     if (obj_cgroup_is_root(objcg))
> >               return true;
> >
> >       /*
> > @@ -3708,6 +3701,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
> >       if (!objcg)
> >               goto free_shrinker;
> >
> > +     if (unlikely(mem_cgroup_is_root(memcg)))
> > +             root_obj_cgroup = objcg;
> > +
> >       objcg->memcg = memcg;
> >       rcu_assign_pointer(memcg->objcg, objcg);
> >       obj_cgroup_get(objcg);
> > @@ -5302,6 +5298,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
> >       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> >               return;
> >
> > +     if (obj_cgroup_is_root(objcg))
> > +             return;
> > +
> >       VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
> >
> >       /* PF_MEMALLOC context, charging must succeed */
> > @@ -5329,6 +5328,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
> >       if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
> >               return;
> >
> > +     if (obj_cgroup_is_root(objcg))
> > +             return;
> > +
> >       obj_cgroup_uncharge(objcg, size);
> >
> >       rcu_read_lock();
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index b35494c8ede2..3e54c6fca9bd 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
> >               return true;
> >
> >       objcg = current_obj_cgroup();
> > -     if (!objcg)
> > +     if (obj_cgroup_is_root(objcg))
> >               return true;
> >
> >       if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 11/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (9 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 12/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Muchun Song
                   ` (19 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in get_mem_cgroup_from_folio().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4aadc1b87db3..4802ce1f49a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -983,14 +983,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
  */
 struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 
 	if (mem_cgroup_disabled())
 		return NULL;
 
+	if (!folio_memcg_charged(folio))
+		return root_mem_cgroup;
+
 	rcu_read_lock();
-	if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
-		memcg = root_mem_cgroup;
+retry:
+	memcg = folio_memcg(folio);
+	if (unlikely(!css_tryget(&memcg->css)))
+		goto retry;
 	rcu_read_unlock();
 	return memcg;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 12/28] buffer: prevent memory cgroup release in folio_alloc_buffers()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (10 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 11/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 13/28] writeback: prevent memory cgroup release in writeback module Muchun Song
                   ` (18 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the function get_mem_cgroup_from_folio() is
employed to safeguard against the release of the memory cgroup.
This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/buffer.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index c7abb4a029dc..d8dca9bf5e38 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -914,8 +914,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
 	long offset;
 	struct mem_cgroup *memcg, *old_memcg;
 
-	/* The folio lock pins the memcg */
-	memcg = folio_memcg(folio);
+	memcg = get_mem_cgroup_from_folio(folio);
 	old_memcg = set_active_memcg(memcg);
 
 	head = NULL;
@@ -936,6 +935,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
 	}
 out:
 	set_active_memcg(old_memcg);
+	mem_cgroup_put(memcg);
 	return head;
 /*
  * In case anything failed, we just free everything we got.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 13/28] writeback: prevent memory cgroup release in writeback module
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (11 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 12/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 14/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Muchun Song
                   ` (17 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the function get_mem_cgroup_css_from_folio()
and the rcu read lock are employed to safeguard against the release
of the memory cgroup.

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 fs/fs-writeback.c                | 22 +++++++++++-----------
 include/linux/memcontrol.h       |  9 +++++++--
 include/trace/events/writeback.h |  3 +++
 mm/memcontrol.c                  | 14 ++++++++------
 4 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index cc57367fb641..e3561d486bdb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -269,15 +269,13 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
 	if (inode_cgwb_enabled(inode)) {
 		struct cgroup_subsys_state *memcg_css;
 
-		if (folio) {
-			memcg_css = mem_cgroup_css_from_folio(folio);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
-		} else {
-			/* must pin memcg_css, see wb_get_create() */
+		/* must pin memcg_css, see wb_get_create() */
+		if (folio)
+			memcg_css = get_mem_cgroup_css_from_folio(folio);
+		else
 			memcg_css = task_get_css(current, memory_cgrp_id);
-			wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
-			css_put(memcg_css);
-		}
+		wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+		css_put(memcg_css);
 	}
 
 	if (!wb)
@@ -929,16 +927,16 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
 	if (!wbc->wb || wbc->no_cgroup_owner)
 		return;
 
-	css = mem_cgroup_css_from_folio(folio);
+	css = get_mem_cgroup_css_from_folio(folio);
 	/* dead cgroups shouldn't contribute to inode ownership arbitration */
 	if (!(css->flags & CSS_ONLINE))
-		return;
+		goto out;
 
 	id = css->id;
 
 	if (id == wbc->wb_id) {
 		wbc->wb_bytes += bytes;
-		return;
+		goto out;
 	}
 
 	if (id == wbc->wb_lcand_id)
@@ -951,6 +949,8 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
 		wbc->wb_tcand_bytes += bytes;
 	else
 		wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
+out:
+	css_put(css);
 }
 EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e74922d5755d..a9ef2087c735 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -874,7 +874,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
 	return match;
 }
 
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio);
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio);
 ino_t page_cgroup_ino(struct page *page);
 
 static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
@@ -1594,9 +1594,14 @@ static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
 	if (mem_cgroup_disabled())
 		return;
 
+	if (!folio_memcg_charged(folio))
+		return;
+
+	rcu_read_lock();
 	memcg = folio_memcg(folio);
-	if (unlikely(memcg && &memcg->css != wb->memcg_css))
+	if (unlikely(&memcg->css != wb->memcg_css))
 		mem_cgroup_track_foreign_dirty_slowpath(folio, wb);
+	rcu_read_unlock();
 }
 
 void mem_cgroup_flush_foreign(struct bdi_writeback *wb);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 0ff388131fc9..99665c79856b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -266,7 +266,10 @@ TRACE_EVENT(track_foreign_dirty,
 		__entry->ino		= inode ? inode->i_ino : 0;
 		__entry->memcg_id	= wb->memcg_css->id;
 		__entry->cgroup_ino	= __trace_wb_assign_cgroup(wb);
+
+		rcu_read_lock();
 		__entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup);
+		rcu_read_unlock();
 	),
 
 	TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4802ce1f49a4..09ecb5cb78f2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -229,7 +229,7 @@ DEFINE_STATIC_KEY_FALSE(memcg_bpf_enabled_key);
 EXPORT_SYMBOL(memcg_bpf_enabled_key);
 
 /**
- * mem_cgroup_css_from_folio - css of the memcg associated with a folio
+ * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio
  * @folio: folio of interest
  *
  * If memcg is bound to the default hierarchy, css of the memcg associated
@@ -239,14 +239,16 @@ EXPORT_SYMBOL(memcg_bpf_enabled_key);
  * If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
  * is returned.
  */
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio)
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 
-	if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		memcg = root_mem_cgroup;
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return &root_mem_cgroup->css;
 
-	return &memcg->css;
+	memcg = get_mem_cgroup_from_folio(folio);
+
+	return memcg ? &memcg->css : &root_mem_cgroup->css;
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 14/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (12 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 13/28] writeback: prevent memory cgroup release in writeback module Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 15/28] mm: page_io: prevent memory cgroup release in page_io module Muchun Song
                   ` (16 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in count_memcg_folio_events().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a9ef2087c735..01239147eb11 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -978,10 +978,15 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
 static inline void count_memcg_folio_events(struct folio *folio,
 		enum vm_event_item idx, unsigned long nr)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 
-	if (memcg)
-		count_memcg_events(memcg, idx, nr);
+	if (!folio_memcg_charged(folio))
+		return;
+
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	count_memcg_events(memcg, idx, nr);
+	rcu_read_unlock();
 }
 
 static inline void count_memcg_events_mm(struct mm_struct *mm,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 15/28] mm: page_io: prevent memory cgroup release in page_io module
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (13 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 14/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 16/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Muchun Song
                   ` (15 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in swap_writepage() and
bio_associate_blkg_from_page().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/page_io.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index 4bce19df557b..5894e2ff97ef 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -280,10 +280,14 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		folio_unlock(folio);
 		return 0;
 	}
+
+	rcu_read_lock();
 	if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
+		rcu_read_unlock();
 		folio_mark_dirty(folio);
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
+	rcu_read_unlock();
 
 	__swap_writepage(folio, wbc);
 	return 0;
@@ -308,11 +312,11 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
 	struct cgroup_subsys_state *css;
 	struct mem_cgroup *memcg;
 
-	memcg = folio_memcg(folio);
-	if (!memcg)
+	if (!folio_memcg_charged(folio))
 		return;
 
 	rcu_read_lock();
+	memcg = folio_memcg(folio);
 	css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
 	bio_associate_blkg_from_css(bio, css);
 	rcu_read_unlock();
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 16/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (14 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 15/28] mm: page_io: prevent memory cgroup release in page_io module Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 17/28] mm: mglru: prevent memory cgroup release in mglru Muchun Song
                   ` (14 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in folio_migrate_mapping().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/migrate.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/migrate.c b/mm/migrate.c
index f3ee6d8d5e2e..2ff1eaf39a9e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -565,6 +565,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct lruvec *old_lruvec, *new_lruvec;
 		struct mem_cgroup *memcg;
 
+		rcu_read_lock();
 		memcg = folio_memcg(folio);
 		old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
 		new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
@@ -592,6 +593,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 			__mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
 			__mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
 		}
+		rcu_read_unlock();
 	}
 	local_irq_enable();
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 17/28] mm: mglru: prevent memory cgroup release in mglru
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (15 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 16/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 18/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Muchun Song
                   ` (13 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mglru.

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/vmscan.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eac5e6e70660..fbba14094c6d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3451,8 +3451,10 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
 	if (folio_nid(folio) != pgdat->node_id)
 		return NULL;
 
+	rcu_read_lock();
 	if (folio_memcg(folio) != memcg)
-		return NULL;
+		folio = NULL;
+	rcu_read_unlock();
 
 	return folio;
 }
@@ -4194,10 +4196,10 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 	unsigned long addr = pvmw->address;
 	struct vm_area_struct *vma = pvmw->vma;
 	struct folio *folio = pfn_folio(pvmw->pfn);
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat = folio_pgdat(folio);
-	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
-	struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
+	struct lruvec *lruvec;
+	struct lru_gen_mm_state *mm_state;
 	DEFINE_MAX_SEQ(lruvec);
 	int gen = lru_gen_from_seq(max_seq);
 
@@ -4234,6 +4236,11 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 		}
 	}
 
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+	mm_state = get_mm_state(lruvec);
+
 	arch_enter_lazy_mmu_mode();
 
 	pte -= (addr - start) / PAGE_SIZE;
@@ -4270,6 +4277,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
 
 	arch_leave_lazy_mmu_mode();
 
+	rcu_read_unlock();
+
 	/* feedback from rmap walkers to page table walkers */
 	if (mm_state && suitable_to_scan(i, young))
 		update_bloom_filter(mm_state, max_seq, pvmw->pmd);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 18/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (16 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 17/28] mm: mglru: prevent memory cgroup release in mglru Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 19/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Muchun Song
                   ` (12 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mem_cgroup_swap_full().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/memcontrol.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 09ecb5cb78f2..694f19017699 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5097,17 +5097,21 @@ bool mem_cgroup_swap_full(struct folio *folio)
 	if (do_memsw_account())
 		return false;
 
-	memcg = folio_memcg(folio);
-	if (!memcg)
+	if (!folio_memcg_charged(folio))
 		return false;
 
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
 		unsigned long usage = page_counter_read(&memcg->swap);
 
 		if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
-		    usage * 2 >= READ_ONCE(memcg->swap.max))
+		    usage * 2 >= READ_ONCE(memcg->swap.max)) {
+			rcu_read_unlock();
 			return true;
+		}
 	}
+	rcu_read_unlock();
 
 	return false;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 19/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (17 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 18/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 20/28] mm: workingset: prevent lruvec release in workingset_refault() Muchun Song
                   ` (11 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.

In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in lru_gen_eviction().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/workingset.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index ebafc0eaafba..e14b9e33f161 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -241,11 +241,14 @@ static void *lru_gen_eviction(struct folio *folio)
 	int refs = folio_lru_refs(folio);
 	bool workingset = folio_test_workingset(folio);
 	int tier = lru_tier_from_refs(refs, workingset);
-	struct mem_cgroup *memcg = folio_memcg(folio);
+	struct mem_cgroup *memcg;
 	struct pglist_data *pgdat = folio_pgdat(folio);
+	unsigned short memcg_id;
 
 	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
 
+	rcu_read_lock();
+	memcg = folio_memcg(folio);
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
 	min_seq = READ_ONCE(lrugen->min_seq[type]);
@@ -253,8 +256,10 @@ static void *lru_gen_eviction(struct folio *folio)
 
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+	memcg_id = mem_cgroup_id(memcg);
+	rcu_read_unlock();
 
-	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+	return pack_shadow(memcg_id, pgdat, token, workingset);
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 20/28] mm: workingset: prevent lruvec release in workingset_refault()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (18 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 19/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Muchun Song
                   ` (10 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_refault().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/workingset.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index e14b9e33f161..ef89d18cb8cf 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
 	 * locked to guarantee folio_memcg() stability throughout.
 	 */
 	nr = folio_nr_pages(folio);
+	rcu_read_lock();
 	lruvec = folio_lruvec(folio);
 	mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
 
 	if (!workingset_test_recent(shadow, file, &workingset, true))
-		return;
+		goto out;
 
 	folio_set_active(folio);
 	workingset_age_nonresident(lruvec, nr);
@@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
 		lru_note_cost_refault(folio);
 		mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
 	}
+out:
+	rcu_read_unlock();
 }
 
 /**
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (19 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 20/28] mm: workingset: prevent lruvec release in workingset_refault() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-17 17:39   ` Nhat Pham
  2025-04-18  2:36   ` Chengming Zhou
  2025-04-15  2:45 ` [PATCH RFC 22/28] mm: swap: prevent lruvec release in swap module Muchun Song
                   ` (9 subsequent siblings)
  30 siblings, 2 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in zswap_folio_swapin().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/zswap.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 204fb59da33c..4a41c2371f3d 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -752,8 +752,10 @@ void zswap_folio_swapin(struct folio *folio)
 	struct lruvec *lruvec;
 
 	if (folio) {
+		rcu_read_lock();
 		lruvec = folio_lruvec(folio);
 		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins);
+		rcu_read_unlock();
 	}
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-04-15  2:45 ` [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Muchun Song
@ 2025-04-17 17:39   ` Nhat Pham
  2025-04-18  2:36   ` Chengming Zhou
  1 sibling, 0 replies; 69+ messages in thread
From: Nhat Pham @ 2025-04-17 17:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais

On Mon, Apr 14, 2025 at 7:47 PM Muchun Song <songmuchun@bytedance.com> wrote:
>
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
>
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in zswap_folio_swapin().
>
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

No objections from my end. AFAICT, wrapping this in rcu should not
break things, and we're in the slow path (disk swapping) anyway, so
should not be a problem.

Anyway:
Acked-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin()
  2025-04-15  2:45 ` [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Muchun Song
  2025-04-17 17:39   ` Nhat Pham
@ 2025-04-18  2:36   ` Chengming Zhou
  1 sibling, 0 replies; 69+ messages in thread
From: Chengming Zhou @ 2025-04-18  2:36 UTC (permalink / raw)
  To: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt,
	muchun.song, akpm, david, zhengqi.arch, yosry.ahmed, nphamcs
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On 2025/4/15 10:45, Muchun Song wrote:
> In the near future, a folio will no longer pin its corresponding
> memory cgroup. So an lruvec returned by folio_lruvec() could be
> released without the rcu read lock or a reference to its memory
> cgroup.
> 
> In the current patch, the rcu read lock is employed to safeguard
> against the release of the lruvec in zswap_folio_swapin().
> 
> This serves as a preparatory measure for the reparenting of the
> LRU pages.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>

It should be rare to race with folio reparenting process, so
it seems ok not to "reparent" this counter "nr_disk_swapins".

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks.

> ---
>   mm/zswap.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 204fb59da33c..4a41c2371f3d 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -752,8 +752,10 @@ void zswap_folio_swapin(struct folio *folio)
>   	struct lruvec *lruvec;
>   
>   	if (folio) {
> +		rcu_read_lock();
>   		lruvec = folio_lruvec(folio);
>   		atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins);
> +		rcu_read_unlock();
>   	}
>   }
>   

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 22/28] mm: swap: prevent lruvec release in swap module
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (20 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 23/28] mm: workingset: prevent lruvec release in workingset_activation() Muchun Song
                   ` (8 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in lru_note_cost_refault() and
lru_activate().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/swap.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index ee19e171857d..fbf887578dbe 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -291,8 +291,10 @@ void lru_note_cost(struct lruvec *lruvec, bool file,
 
 void lru_note_cost_refault(struct folio *folio)
 {
+	rcu_read_lock();
 	lru_note_cost(folio_lruvec(folio), folio_is_file_lru(folio),
 		      folio_nr_pages(folio), 0);
+	rcu_read_unlock();
 }
 
 static void lru_activate(struct lruvec *lruvec, struct folio *folio)
@@ -406,18 +408,20 @@ static void lru_gen_inc_refs(struct folio *folio)
 
 static bool lru_gen_clear_refs(struct folio *folio)
 {
-	struct lru_gen_folio *lrugen;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
+	unsigned long seq;
 
 	if (gen < 0)
 		return true;
 
 	set_mask_bits(&folio->flags, LRU_REFS_FLAGS | BIT(PG_workingset), 0);
 
-	lrugen = &folio_lruvec(folio)->lrugen;
+	rcu_read_lock();
+	seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]);
+	rcu_read_unlock();
 	/* whether can do without shuffling under the LRU lock */
-	return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
+	return gen == lru_gen_from_seq(seq);
 }
 
 #else /* !CONFIG_LRU_GEN */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 23/28] mm: workingset: prevent lruvec release in workingset_activation()
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (21 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 22/28] mm: swap: prevent lruvec release in swap module Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 24/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Muchun Song
                   ` (7 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.

In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_activation().

This serves as a preparatory measure for the reparenting of the
LRU pages.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/workingset.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/workingset.c b/mm/workingset.c
index ef89d18cb8cf..ec625eb7db69 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -595,8 +595,11 @@ void workingset_activation(struct folio *folio)
 	 * Filter non-memcg pages here, e.g. unmap can call
 	 * mark_page_accessed() on VDSO pages.
 	 */
-	if (mem_cgroup_disabled() || folio_memcg_charged(folio))
+	if (mem_cgroup_disabled() || folio_memcg_charged(folio)) {
+		rcu_read_lock();
 		workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
+		rcu_read_unlock();
+	}
 }
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 24/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (22 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 23/28] mm: workingset: prevent lruvec release in workingset_activation() Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 25/28] mm: thp: prepare for reparenting LRU pages for split queue lock Muchun Song
                   ` (6 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

The following diagram illustrates how to ensure the safety of the folio
lruvec lock when LRU folios undergo reparenting.

In the folio_lruvec_lock(folio) function:
```
    rcu_read_lock();
retry:
    lruvec = folio_lruvec(folio);
    /* There is a possibility of folio reparenting at this point. */
    spin_lock(&lruvec->lru_lock);
    if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
        /*
         * The wrong lruvec lock was acquired, and a retry is required.
         * This is because the folio resides on the parent memcg lruvec
         * list.
         */
        spin_unlock(&lruvec->lru_lock);
        goto retry;
    }

    /* Reaching here indicates that folio_memcg() is stable. */
```

In the memcg_reparent_objcgs(memcg) function:
```
    spin_lock(&lruvec->lru_lock);
    spin_lock(&lruvec_parent->lru_lock);
    /* Transfer folios from the lruvec list to the parent's. */
    spin_unlock(&lruvec_parent->lru_lock);
    spin_unlock(&lruvec->lru_lock);
```

After acquiring the lruvec lock, it is necessary to verify whether
the folio has been reparented. If reparenting has occurred, the new
lruvec lock must be reacquired. During the LRU folio reparenting
process, the lruvec lock will also be acquired (this will be
implemented in a subsequent patch). Therefore, folio_memcg() remains
unchanged while the lruvec lock is held.

Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio)
after the lruvec lock is acquired, the lruvec_memcg_debug() check is
redundant. Hence, it is removed.

This patch serves as a preparation for the reparenting of LRU folios.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 23 ++++++-----------
 mm/compaction.c            | 29 ++++++++++++++++-----
 mm/memcontrol.c            | 53 +++++++++++++++++++-------------------
 3 files changed, 58 insertions(+), 47 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 01239147eb11..27b23e464229 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -719,7 +719,11 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
  * folio_lruvec - return lruvec for isolating/putting an LRU folio
  * @folio: Pointer to the folio.
  *
- * This function relies on folio->mem_cgroup being stable.
+ * The user should hold an rcu read lock to protect lruvec associated with
+ * the folio from being released. But it does not prevent binding stability
+ * between the folio and the returned lruvec from being changed to its parent
+ * or ancestor (e.g. like folio_lruvec_lock() does that holds LRU lock to
+ * prevent the change).
  */
 static inline struct lruvec *folio_lruvec(struct folio *folio)
 {
@@ -742,15 +746,6 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
 						unsigned long *flags);
 
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio);
-#else
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-#endif
-
 static inline
 struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
 	return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1211,11 +1206,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
 	return &pgdat->__lruvec;
 }
 
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-
 static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
 {
 	return NULL;
@@ -1532,17 +1522,20 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
 static inline void lruvec_unlock(struct lruvec *lruvec)
 {
 	spin_unlock(&lruvec->lru_lock);
+	rcu_read_unlock();
 }
 
 static inline void lruvec_unlock_irq(struct lruvec *lruvec)
 {
 	spin_unlock_irq(&lruvec->lru_lock);
+	rcu_read_unlock();
 }
 
 static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
 		unsigned long flags)
 {
 	spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+	rcu_read_unlock();
 }
 
 /* Test requires a stable folio->memcg binding, see folio_memcg() */
diff --git a/mm/compaction.c b/mm/compaction.c
index ce45d633ddad..4abd1481d5de 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -551,6 +551,24 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
 	return true;
 }
 
+static struct lruvec *
+compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags,
+				  struct compact_control *cc)
+{
+	struct lruvec *lruvec;
+
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
+	compact_lock_irqsave(&lruvec->lru_lock, flags, cc);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+		goto retry;
+	}
+
+	return lruvec;
+}
+
 /*
  * Compaction requires the taking of some coarse locks that are potentially
  * very heavily contended. The lock should be periodically unlocked to avoid
@@ -872,7 +890,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 {
 	pg_data_t *pgdat = cc->zone->zone_pgdat;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
-	struct lruvec *lruvec;
+	struct lruvec *lruvec = NULL;
 	unsigned long flags = 0;
 	struct lruvec *locked = NULL;
 	struct folio *folio = NULL;
@@ -1189,18 +1207,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		if (!folio_test_clear_lru(folio))
 			goto isolate_fail_put;
 
-		lruvec = folio_lruvec(folio);
+		if (locked)
+			lruvec = folio_lruvec(folio);
 
 		/* If we already hold the lock, we can skip some rechecking */
-		if (lruvec != locked) {
+		if (lruvec != locked || !locked) {
 			if (locked)
 				lruvec_unlock_irqrestore(locked, flags);
 
-			compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+			lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc);
 			locked = lruvec;
 
-			lruvec_memcg_debug(lruvec, folio);
-
 			/*
 			 * Try get exclusive access under lock. If marked for
 			 * skip, the scan is aborted unless the current context
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 694f19017699..1f0c6e7b69cc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1196,23 +1196,6 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	}
 }
 
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-	struct mem_cgroup *memcg;
-
-	if (mem_cgroup_disabled())
-		return;
-
-	memcg = folio_memcg(folio);
-
-	if (!memcg)
-		VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio);
-	else
-		VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio);
-}
-#endif
-
 /**
  * folio_lruvec_lock - Lock the lruvec for a folio.
  * @folio: Pointer to the folio.
@@ -1222,14 +1205,20 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
  * - folio_test_lru false
  * - folio frozen (refcount of 0)
  *
- * Return: The lruvec this folio is on with its lock held.
+ * Return: The lruvec this folio is on with its lock held and rcu read lock held.
  */
 struct lruvec *folio_lruvec_lock(struct folio *folio)
 {
-	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lruvec *lruvec;
 
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
 	spin_lock(&lruvec->lru_lock);
-	lruvec_memcg_debug(lruvec, folio);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock(&lruvec->lru_lock);
+		goto retry;
+	}
 
 	return lruvec;
 }
@@ -1244,14 +1233,20 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
  * - folio frozen (refcount of 0)
  *
  * Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
  */
 struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
 {
-	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lruvec *lruvec;
 
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
 	spin_lock_irq(&lruvec->lru_lock);
-	lruvec_memcg_debug(lruvec, folio);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock_irq(&lruvec->lru_lock);
+		goto retry;
+	}
 
 	return lruvec;
 }
@@ -1267,15 +1262,21 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
  * - folio frozen (refcount of 0)
  *
  * Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
  */
 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
 		unsigned long *flags)
 {
-	struct lruvec *lruvec = folio_lruvec(folio);
+	struct lruvec *lruvec;
 
+	rcu_read_lock();
+retry:
+	lruvec = folio_lruvec(folio);
 	spin_lock_irqsave(&lruvec->lru_lock, *flags);
-	lruvec_memcg_debug(lruvec, folio);
+	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+		spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+		goto retry;
+	}
 
 	return lruvec;
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 25/28] mm: thp: prepare for reparenting LRU pages for split queue lock
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (23 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 24/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:45 ` [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
                   ` (5 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

Analogous to the mechanism employed for the lruvec lock, we adopt
an identical strategy to ensure the safety of the split queue lock
during the reparenting process of LRU folios.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 mm/huge_memory.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d2bc943a40e8..813334994f84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1100,8 +1100,14 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
 {
 	struct deferred_split *queue;
 
+	rcu_read_lock();
+retry:
 	queue = folio_split_queue(folio);
 	spin_lock(&queue->split_queue_lock);
+	if (unlikely(folio_split_queue_memcg(folio, queue) != folio_memcg(folio))) {
+		spin_unlock(&queue->split_queue_lock);
+		goto retry;
+	}
 
 	return queue;
 }
@@ -1111,8 +1117,14 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
 {
 	struct deferred_split *queue;
 
+	rcu_read_lock();
+retry:
 	queue = folio_split_queue(folio);
 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+	if (unlikely(folio_split_queue_memcg(folio, queue) != folio_memcg(folio))) {
+		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
+		goto retry;
+	}
 
 	return queue;
 }
@@ -1120,12 +1132,14 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
 static inline void split_queue_unlock(struct deferred_split *queue)
 {
 	spin_unlock(&queue->split_queue_lock);
+	rcu_read_unlock();
 }
 
 static inline void split_queue_unlock_irqrestore(struct deferred_split *queue,
 						 unsigned long flags)
 {
 	spin_unlock_irqrestore(&queue->split_queue_lock, flags);
+	rcu_read_unlock();
 }
 
 static inline bool is_transparent_hugepage(const struct folio *folio)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (24 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 25/28] mm: thp: prepare for reparenting LRU pages for split queue lock Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-06-30 12:47   ` Harry Yoo
  2025-04-15  2:45 ` [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Muchun Song
                   ` (4 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

In the previous patch, we established a method to ensure the safety of the
lruvec lock and the split queue lock during the reparenting of LRU folios.
The process involves the following steps:

    memcg_reparent_objcgs(memcg)
        1) lock
        // lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
        spin_lock(&lruvec->lru_lock);
        spin_lock(&lruvec_parent->lru_lock);

        2) relocate from current memcg to its parent
        // Move all the pages from the lruvec list to the parent lruvec list.

        3) unlock
        spin_unlock(&lruvec_parent->lru_lock);
        spin_unlock(&lruvec->lru_lock);

In addition to the folio lruvec lock, the deferred split queue lock
(specific to THP) also requires a similar approach. Therefore, we abstract
the three essential steps from the memcg_reparent_objcgs() function.

    memcg_reparent_objcgs(memcg)
        1) lock
        memcg_reparent_ops->lock(memcg, parent);

        2) relocate
        memcg_reparent_ops->relocate(memcg, reparent);

        3) unlock
        memcg_reparent_ops->unlock(memcg, reparent);

Currently, two distinct locks (such as the lruvec lock and the deferred
split queue lock) need to utilize this infrastructure. In the subsequent
patch, we will employ these APIs to ensure the safety of these locks
during the reparenting of LRU folios.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h | 20 ++++++++++++
 mm/memcontrol.c            | 62 ++++++++++++++++++++++++++++++--------
 2 files changed, 69 insertions(+), 13 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 27b23e464229..0e450623f8fa 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -311,6 +311,26 @@ struct mem_cgroup {
 	struct mem_cgroup_per_node *nodeinfo[];
 };
 
+struct memcg_reparent_ops {
+	/*
+	 * Note that interrupt is disabled before calling those callbacks,
+	 * so the interrupt should remain disabled when leaving those callbacks.
+	 */
+	void (*lock)(struct mem_cgroup *src, struct mem_cgroup *dst);
+	void (*relocate)(struct mem_cgroup *src, struct mem_cgroup *dst);
+	void (*unlock)(struct mem_cgroup *src, struct mem_cgroup *dst);
+};
+
+#define DEFINE_MEMCG_REPARENT_OPS(name)					\
+	const struct memcg_reparent_ops memcg_##name##_reparent_ops = {	\
+		.lock		= name##_reparent_lock,			\
+		.relocate	= name##_reparent_relocate,		\
+		.unlock		= name##_reparent_unlock,		\
+	}
+
+#define DECLARE_MEMCG_REPARENT_OPS(name)				\
+	extern const struct memcg_reparent_ops memcg_##name##_reparent_ops
+
 /*
  * size of first charge trial.
  * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f0c6e7b69cc..3fac51179186 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -194,24 +194,60 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
 	return objcg;
 }
 
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
+static void objcg_reparent_lock(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	spin_lock(&objcg_lock);
+}
+
+static void objcg_reparent_relocate(struct mem_cgroup *src, struct mem_cgroup *dst)
 {
 	struct obj_cgroup *objcg, *iter;
-	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 
-	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
+	objcg = rcu_replace_pointer(src->objcg, NULL, true);
+	/* 1) Ready to reparent active objcg. */
+	list_add(&objcg->list, &src->objcg_list);
+	/* 2) Reparent active objcg and already reparented objcgs to dst. */
+	list_for_each_entry(iter, &src->objcg_list, list)
+		WRITE_ONCE(iter->memcg, dst);
+	/* 3) Move already reparented objcgs to the dst's list */
+	list_splice(&src->objcg_list, &dst->objcg_list);
+}
 
-	spin_lock_irq(&objcg_lock);
+static void objcg_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	spin_unlock(&objcg_lock);
+}
 
-	/* 1) Ready to reparent active objcg. */
-	list_add(&objcg->list, &memcg->objcg_list);
-	/* 2) Reparent active objcg and already reparented objcgs to parent. */
-	list_for_each_entry(iter, &memcg->objcg_list, list)
-		WRITE_ONCE(iter->memcg, parent);
-	/* 3) Move already reparented objcgs to the parent's list */
-	list_splice(&memcg->objcg_list, &parent->objcg_list);
-
-	spin_unlock_irq(&objcg_lock);
+static DEFINE_MEMCG_REPARENT_OPS(objcg);
+
+static const struct memcg_reparent_ops *memcg_reparent_ops[] = {
+	&memcg_objcg_reparent_ops,
+};
+
+#define DEFINE_MEMCG_REPARENT_FUNC(phase)				\
+	static void memcg_reparent_##phase(struct mem_cgroup *src,	\
+					   struct mem_cgroup *dst)	\
+	{								\
+		int i;							\
+									\
+		for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)	\
+			memcg_reparent_ops[i]->phase(src, dst);		\
+	}
+
+DEFINE_MEMCG_REPARENT_FUNC(lock)
+DEFINE_MEMCG_REPARENT_FUNC(relocate)
+DEFINE_MEMCG_REPARENT_FUNC(unlock)
+
+static void memcg_reparent_objcgs(struct mem_cgroup *src)
+{
+	struct mem_cgroup *dst = parent_mem_cgroup(src);
+	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
+
+	local_irq_disable();
+	memcg_reparent_lock(src, dst);
+	memcg_reparent_relocate(src, dst);
+	memcg_reparent_unlock(src, dst);
+	local_irq_enable();
 
 	percpu_ref_kill(&objcg->refcnt);
 }
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops
  2025-04-15  2:45 ` [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
@ 2025-06-30 12:47   ` Harry Yoo
  2025-07-01 22:12     ` Harry Yoo
  0 siblings, 1 reply; 69+ messages in thread
From: Harry Yoo @ 2025-06-30 12:47 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:30AM +0800, Muchun Song wrote:
> In the previous patch, we established a method to ensure the safety of the
> lruvec lock and the split queue lock during the reparenting of LRU folios.
> The process involves the following steps:
> 
>     memcg_reparent_objcgs(memcg)
>         1) lock
>         // lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
>         spin_lock(&lruvec->lru_lock);
>         spin_lock(&lruvec_parent->lru_lock);
> 
>         2) relocate from current memcg to its parent
>         // Move all the pages from the lruvec list to the parent lruvec list.
> 
>         3) unlock
>         spin_unlock(&lruvec_parent->lru_lock);
>         spin_unlock(&lruvec->lru_lock);
> 
> In addition to the folio lruvec lock, the deferred split queue lock
> (specific to THP) also requires a similar approach. Therefore, we abstract
> the three essential steps from the memcg_reparent_objcgs() function.
> 
>     memcg_reparent_objcgs(memcg)
>         1) lock
>         memcg_reparent_ops->lock(memcg, parent);
> 
>         2) relocate
>         memcg_reparent_ops->relocate(memcg, reparent);
> 
>         3) unlock
>         memcg_reparent_ops->unlock(memcg, reparent);
> 
> Currently, two distinct locks (such as the lruvec lock and the deferred
> split queue lock) need to utilize this infrastructure. In the subsequent
> patch, we will employ these APIs to ensure the safety of these locks
> during the reparenting of LRU folios.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/memcontrol.h | 20 ++++++++++++
>  mm/memcontrol.c            | 62 ++++++++++++++++++++++++++++++--------
>  2 files changed, 69 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 27b23e464229..0e450623f8fa 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -311,6 +311,26 @@ struct mem_cgroup {
>  	struct mem_cgroup_per_node *nodeinfo[];
>  };
>  
> +struct memcg_reparent_ops {
> +	/*
> +	 * Note that interrupt is disabled before calling those callbacks,
> +	 * so the interrupt should remain disabled when leaving those callbacks.
> +	 */
> +	void (*lock)(struct mem_cgroup *src, struct mem_cgroup *dst);
> +	void (*relocate)(struct mem_cgroup *src, struct mem_cgroup *dst);
> +	void (*unlock)(struct mem_cgroup *src, struct mem_cgroup *dst);
> +};
> +
> +#define DEFINE_MEMCG_REPARENT_OPS(name)					\
> +	const struct memcg_reparent_ops memcg_##name##_reparent_ops = {	\
> +		.lock		= name##_reparent_lock,			\
> +		.relocate	= name##_reparent_relocate,		\
> +		.unlock		= name##_reparent_unlock,		\
> +	}
> +
> +#define DECLARE_MEMCG_REPARENT_OPS(name)				\
> +	extern const struct memcg_reparent_ops memcg_##name##_reparent_ops
> +
>  /*
>   * size of first charge trial.
>   * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 1f0c6e7b69cc..3fac51179186 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -194,24 +194,60 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
>  	return objcg;
>  }
>  
> -static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> +static void objcg_reparent_lock(struct mem_cgroup *src, struct mem_cgroup *dst)
> +{
> +	spin_lock(&objcg_lock);
> +}
> +
> +static void objcg_reparent_relocate(struct mem_cgroup *src, struct mem_cgroup *dst)
>  {
>  	struct obj_cgroup *objcg, *iter;
> -	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>  
> -	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> +	objcg = rcu_replace_pointer(src->objcg, NULL, true);
> +	/* 1) Ready to reparent active objcg. */
> +	list_add(&objcg->list, &src->objcg_list);
> +	/* 2) Reparent active objcg and already reparented objcgs to dst. */
> +	list_for_each_entry(iter, &src->objcg_list, list)
> +		WRITE_ONCE(iter->memcg, dst);
> +	/* 3) Move already reparented objcgs to the dst's list */
> +	list_splice(&src->objcg_list, &dst->objcg_list);
> +}
>  
> -	spin_lock_irq(&objcg_lock);
> +static void objcg_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst)
> +{
> +	spin_unlock(&objcg_lock);
> +}
>  
> -	/* 1) Ready to reparent active objcg. */
> -	list_add(&objcg->list, &memcg->objcg_list);
> -	/* 2) Reparent active objcg and already reparented objcgs to parent. */
> -	list_for_each_entry(iter, &memcg->objcg_list, list)
> -		WRITE_ONCE(iter->memcg, parent);
> -	/* 3) Move already reparented objcgs to the parent's list */
> -	list_splice(&memcg->objcg_list, &parent->objcg_list);
> -
> -	spin_unlock_irq(&objcg_lock);
> +static DEFINE_MEMCG_REPARENT_OPS(objcg);
> +
> +static const struct memcg_reparent_ops *memcg_reparent_ops[] = {
> +	&memcg_objcg_reparent_ops,
> +};
> +
> +#define DEFINE_MEMCG_REPARENT_FUNC(phase)				\
> +	static void memcg_reparent_##phase(struct mem_cgroup *src,	\
> +					   struct mem_cgroup *dst)	\
> +	{								\
> +		int i;							\
> +									\
> +		for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)	\
> +			memcg_reparent_ops[i]->phase(src, dst);		\
> +	}
> +
> +DEFINE_MEMCG_REPARENT_FUNC(lock)
> +DEFINE_MEMCG_REPARENT_FUNC(relocate)
> +DEFINE_MEMCG_REPARENT_FUNC(unlock)
> +
> +static void memcg_reparent_objcgs(struct mem_cgroup *src)
> +{
> +	struct mem_cgroup *dst = parent_mem_cgroup(src);
> +	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
> +
> +	local_irq_disable();
> +	memcg_reparent_lock(src, dst);
> +	memcg_reparent_relocate(src, dst);
> +	memcg_reparent_unlock(src, dst);
> +	local_irq_enable();

Hi,

It seems unnecessarily complicated to 1) acquire objcg, lruvec and
thp_sq locks, 2) call their ->relocate() callbacks, and
3) release those locks.

Why not simply do the following instead?

for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++) {
	local_irq_disable();
	memcg_reparent_ops[i]->lock(src, dst);
	memcg_reparent_ops[i]->relocate(src, dst);
	memcg_reparent_ops[i]->unlock(src, dst);
	local_irq_enable();
}

As there is no actual lock dependency between the three.

Or am I missing something important about the locking requirements?

-- 
Cheers,
Harry / Hyeonggon

>  
>  	percpu_ref_kill(&objcg->refcnt);
>  }
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops
  2025-06-30 12:47   ` Harry Yoo
@ 2025-07-01 22:12     ` Harry Yoo
  2025-07-07  9:29       ` [External] " Muchun Song
  0 siblings, 1 reply; 69+ messages in thread
From: Harry Yoo @ 2025-07-01 22:12 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Mon, Jun 30, 2025 at 09:47:25PM +0900, Harry Yoo wrote:
> On Tue, Apr 15, 2025 at 10:45:30AM +0800, Muchun Song wrote:
> > In the previous patch, we established a method to ensure the safety of the
> > lruvec lock and the split queue lock during the reparenting of LRU folios.
> > The process involves the following steps:
> > 
> >     memcg_reparent_objcgs(memcg)
> >         1) lock
> >         // lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
> >         spin_lock(&lruvec->lru_lock);
> >         spin_lock(&lruvec_parent->lru_lock);
> > 
> >         2) relocate from current memcg to its parent
> >         // Move all the pages from the lruvec list to the parent lruvec list.
> > 
> >         3) unlock
> >         spin_unlock(&lruvec_parent->lru_lock);
> >         spin_unlock(&lruvec->lru_lock);
> > 
> > In addition to the folio lruvec lock, the deferred split queue lock
> > (specific to THP) also requires a similar approach. Therefore, we abstract
> > the three essential steps from the memcg_reparent_objcgs() function.
> > 
> >     memcg_reparent_objcgs(memcg)
> >         1) lock
> >         memcg_reparent_ops->lock(memcg, parent);
> > 
> >         2) relocate
> >         memcg_reparent_ops->relocate(memcg, reparent);
> > 
> >         3) unlock
> >         memcg_reparent_ops->unlock(memcg, reparent);
> > 
> > Currently, two distinct locks (such as the lruvec lock and the deferred
> > split queue lock) need to utilize this infrastructure. In the subsequent
> > patch, we will employ these APIs to ensure the safety of these locks
> > during the reparenting of LRU folios.
> > 
> > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > ---
> >  include/linux/memcontrol.h | 20 ++++++++++++
> >  mm/memcontrol.c            | 62 ++++++++++++++++++++++++++++++--------
> >  2 files changed, 69 insertions(+), 13 deletions(-)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 27b23e464229..0e450623f8fa 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -311,6 +311,26 @@ struct mem_cgroup {
> >  	struct mem_cgroup_per_node *nodeinfo[];
> >  };
> >  
> > +struct memcg_reparent_ops {
> > +	/*
> > +	 * Note that interrupt is disabled before calling those callbacks,
> > +	 * so the interrupt should remain disabled when leaving those callbacks.
> > +	 */
> > +	void (*lock)(struct mem_cgroup *src, struct mem_cgroup *dst);
> > +	void (*relocate)(struct mem_cgroup *src, struct mem_cgroup *dst);
> > +	void (*unlock)(struct mem_cgroup *src, struct mem_cgroup *dst);
> > +};
> > +
> > +#define DEFINE_MEMCG_REPARENT_OPS(name)					\
> > +	const struct memcg_reparent_ops memcg_##name##_reparent_ops = {	\
> > +		.lock		= name##_reparent_lock,			\
> > +		.relocate	= name##_reparent_relocate,		\
> > +		.unlock		= name##_reparent_unlock,		\
> > +	}
> > +
> > +#define DECLARE_MEMCG_REPARENT_OPS(name)				\
> > +	extern const struct memcg_reparent_ops memcg_##name##_reparent_ops
> > +
> >  /*
> >   * size of first charge trial.
> >   * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 1f0c6e7b69cc..3fac51179186 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -194,24 +194,60 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
> >  	return objcg;
> >  }
> >  
> > -static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> > +static void objcg_reparent_lock(struct mem_cgroup *src, struct mem_cgroup *dst)
> > +{
> > +	spin_lock(&objcg_lock);
> > +}
> > +
> > +static void objcg_reparent_relocate(struct mem_cgroup *src, struct mem_cgroup *dst)
> >  {
> >  	struct obj_cgroup *objcg, *iter;
> > -	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> >  
> > -	objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> > +	objcg = rcu_replace_pointer(src->objcg, NULL, true);
> > +	/* 1) Ready to reparent active objcg. */
> > +	list_add(&objcg->list, &src->objcg_list);
> > +	/* 2) Reparent active objcg and already reparented objcgs to dst. */
> > +	list_for_each_entry(iter, &src->objcg_list, list)
> > +		WRITE_ONCE(iter->memcg, dst);
> > +	/* 3) Move already reparented objcgs to the dst's list */
> > +	list_splice(&src->objcg_list, &dst->objcg_list);
> > +}
> >  
> > -	spin_lock_irq(&objcg_lock);
> > +static void objcg_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst)
> > +{
> > +	spin_unlock(&objcg_lock);
> > +}
> >  
> > -	/* 1) Ready to reparent active objcg. */
> > -	list_add(&objcg->list, &memcg->objcg_list);
> > -	/* 2) Reparent active objcg and already reparented objcgs to parent. */
> > -	list_for_each_entry(iter, &memcg->objcg_list, list)
> > -		WRITE_ONCE(iter->memcg, parent);
> > -	/* 3) Move already reparented objcgs to the parent's list */
> > -	list_splice(&memcg->objcg_list, &parent->objcg_list);
> > -
> > -	spin_unlock_irq(&objcg_lock);
> > +static DEFINE_MEMCG_REPARENT_OPS(objcg);
> > +
> > +static const struct memcg_reparent_ops *memcg_reparent_ops[] = {
> > +	&memcg_objcg_reparent_ops,
> > +};
> > +
> > +#define DEFINE_MEMCG_REPARENT_FUNC(phase)				\
> > +	static void memcg_reparent_##phase(struct mem_cgroup *src,	\
> > +					   struct mem_cgroup *dst)	\
> > +	{								\
> > +		int i;							\
> > +									\
> > +		for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)	\
> > +			memcg_reparent_ops[i]->phase(src, dst);		\
> > +	}
> > +
> > +DEFINE_MEMCG_REPARENT_FUNC(lock)
> > +DEFINE_MEMCG_REPARENT_FUNC(relocate)
> > +DEFINE_MEMCG_REPARENT_FUNC(unlock)
> > +
> > +static void memcg_reparent_objcgs(struct mem_cgroup *src)
> > +{
> > +	struct mem_cgroup *dst = parent_mem_cgroup(src);
> > +	struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
> > +
> > +	local_irq_disable();
> > +	memcg_reparent_lock(src, dst);
> > +	memcg_reparent_relocate(src, dst);
> > +	memcg_reparent_unlock(src, dst);
> > +	local_irq_enable();
> 
> Hi,
> 
> It seems unnecessarily complicated to 1) acquire objcg, lruvec and
> thp_sq locks, 2) call their ->relocate() callbacks, and
> 3) release those locks.
> 
> Why not simply do the following instead?
> 
> for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++) {
> 	local_irq_disable();
> 	memcg_reparent_ops[i]->lock(src, dst);
> 	memcg_reparent_ops[i]->relocate(src, dst);
> 	memcg_reparent_ops[i]->unlock(src, dst);
> 	local_irq_enable();
> }
> 
> As there is no actual lock dependency between the three.
> 
> Or am I missing something important about the locking requirements?

Hmm... looks like I was missing some important requirements!

It seems like:

1) objcg should be reparented under lruvec locks, otherwise
   users can observe folio_memcg(folio) != lruvec_memcg(lruvec)

2) Similarly, lruvec_reparent_relocate() should reparent all folios
   at once under lruvec locks, otherwise users can observe
   folio_memcg(folio) != lruvec_memcg(lruvec) for some folios.

   IoW, lruvec_reparent_relocate() cannot do something like this:
   while (lruvec is not empty) {
	   move some pages;
	   unlock lruvec locks;
	   cond_resched();
	   lock lruvec locks;
   }

Failing to satisfy 1) and 2) means user can't rely on a stable binding
between a folio and a memcg, which is a no-go.

Also, 2) makes it quite undesirable to iterate over folios and move each
one to the right generation in MGLRU as this will certainly introduce
soft lockups as the memcg size grows...

Is my reasoning correct?
If so, adding a brief comment about 1 and 2 wouldn't hurt ;)

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [External] Re: [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops
  2025-07-01 22:12     ` Harry Yoo
@ 2025-07-07  9:29       ` Muchun Song
  2025-07-09  0:14         ` Harry Yoo
  0 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-07-07  9:29 UTC (permalink / raw)
  To: Harry Yoo
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Wed, Jul 2, 2025 at 6:13 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Jun 30, 2025 at 09:47:25PM +0900, Harry Yoo wrote:
> > On Tue, Apr 15, 2025 at 10:45:30AM +0800, Muchun Song wrote:
> > > In the previous patch, we established a method to ensure the safety of the
> > > lruvec lock and the split queue lock during the reparenting of LRU folios.
> > > The process involves the following steps:
> > >
> > >     memcg_reparent_objcgs(memcg)
> > >         1) lock
> > >         // lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
> > >         spin_lock(&lruvec->lru_lock);
> > >         spin_lock(&lruvec_parent->lru_lock);
> > >
> > >         2) relocate from current memcg to its parent
> > >         // Move all the pages from the lruvec list to the parent lruvec list.
> > >
> > >         3) unlock
> > >         spin_unlock(&lruvec_parent->lru_lock);
> > >         spin_unlock(&lruvec->lru_lock);
> > >
> > > In addition to the folio lruvec lock, the deferred split queue lock
> > > (specific to THP) also requires a similar approach. Therefore, we abstract
> > > the three essential steps from the memcg_reparent_objcgs() function.
> > >
> > >     memcg_reparent_objcgs(memcg)
> > >         1) lock
> > >         memcg_reparent_ops->lock(memcg, parent);
> > >
> > >         2) relocate
> > >         memcg_reparent_ops->relocate(memcg, reparent);
> > >
> > >         3) unlock
> > >         memcg_reparent_ops->unlock(memcg, reparent);
> > >
> > > Currently, two distinct locks (such as the lruvec lock and the deferred
> > > split queue lock) need to utilize this infrastructure. In the subsequent
> > > patch, we will employ these APIs to ensure the safety of these locks
> > > during the reparenting of LRU folios.
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > ---
> > >  include/linux/memcontrol.h | 20 ++++++++++++
> > >  mm/memcontrol.c            | 62 ++++++++++++++++++++++++++++++--------
> > >  2 files changed, 69 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 27b23e464229..0e450623f8fa 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -311,6 +311,26 @@ struct mem_cgroup {
> > >     struct mem_cgroup_per_node *nodeinfo[];
> > >  };
> > >
> > > +struct memcg_reparent_ops {
> > > +   /*
> > > +    * Note that interrupt is disabled before calling those callbacks,
> > > +    * so the interrupt should remain disabled when leaving those callbacks.
> > > +    */
> > > +   void (*lock)(struct mem_cgroup *src, struct mem_cgroup *dst);
> > > +   void (*relocate)(struct mem_cgroup *src, struct mem_cgroup *dst);
> > > +   void (*unlock)(struct mem_cgroup *src, struct mem_cgroup *dst);
> > > +};
> > > +
> > > +#define DEFINE_MEMCG_REPARENT_OPS(name)                                    \
> > > +   const struct memcg_reparent_ops memcg_##name##_reparent_ops = { \
> > > +           .lock           = name##_reparent_lock,                 \
> > > +           .relocate       = name##_reparent_relocate,             \
> > > +           .unlock         = name##_reparent_unlock,               \
> > > +   }
> > > +
> > > +#define DECLARE_MEMCG_REPARENT_OPS(name)                           \
> > > +   extern const struct memcg_reparent_ops memcg_##name##_reparent_ops
> > > +
> > >  /*
> > >   * size of first charge trial.
> > >   * TODO: maybe necessary to use big numbers in big irons or dynamic based of the
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 1f0c6e7b69cc..3fac51179186 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -194,24 +194,60 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
> > >     return objcg;
> > >  }
> > >
> > > -static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> > > +static void objcg_reparent_lock(struct mem_cgroup *src, struct mem_cgroup *dst)
> > > +{
> > > +   spin_lock(&objcg_lock);
> > > +}
> > > +
> > > +static void objcg_reparent_relocate(struct mem_cgroup *src, struct mem_cgroup *dst)
> > >  {
> > >     struct obj_cgroup *objcg, *iter;
> > > -   struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> > >
> > > -   objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> > > +   objcg = rcu_replace_pointer(src->objcg, NULL, true);
> > > +   /* 1) Ready to reparent active objcg. */
> > > +   list_add(&objcg->list, &src->objcg_list);
> > > +   /* 2) Reparent active objcg and already reparented objcgs to dst. */
> > > +   list_for_each_entry(iter, &src->objcg_list, list)
> > > +           WRITE_ONCE(iter->memcg, dst);
> > > +   /* 3) Move already reparented objcgs to the dst's list */
> > > +   list_splice(&src->objcg_list, &dst->objcg_list);
> > > +}
> > >
> > > -   spin_lock_irq(&objcg_lock);
> > > +static void objcg_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst)
> > > +{
> > > +   spin_unlock(&objcg_lock);
> > > +}
> > >
> > > -   /* 1) Ready to reparent active objcg. */
> > > -   list_add(&objcg->list, &memcg->objcg_list);
> > > -   /* 2) Reparent active objcg and already reparented objcgs to parent. */
> > > -   list_for_each_entry(iter, &memcg->objcg_list, list)
> > > -           WRITE_ONCE(iter->memcg, parent);
> > > -   /* 3) Move already reparented objcgs to the parent's list */
> > > -   list_splice(&memcg->objcg_list, &parent->objcg_list);
> > > -
> > > -   spin_unlock_irq(&objcg_lock);
> > > +static DEFINE_MEMCG_REPARENT_OPS(objcg);
> > > +
> > > +static const struct memcg_reparent_ops *memcg_reparent_ops[] = {
> > > +   &memcg_objcg_reparent_ops,
> > > +};
> > > +
> > > +#define DEFINE_MEMCG_REPARENT_FUNC(phase)                          \
> > > +   static void memcg_reparent_##phase(struct mem_cgroup *src,      \
> > > +                                      struct mem_cgroup *dst)      \
> > > +   {                                                               \
> > > +           int i;                                                  \
> > > +                                                                   \
> > > +           for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)    \
> > > +                   memcg_reparent_ops[i]->phase(src, dst);         \
> > > +   }
> > > +
> > > +DEFINE_MEMCG_REPARENT_FUNC(lock)
> > > +DEFINE_MEMCG_REPARENT_FUNC(relocate)
> > > +DEFINE_MEMCG_REPARENT_FUNC(unlock)
> > > +
> > > +static void memcg_reparent_objcgs(struct mem_cgroup *src)
> > > +{
> > > +   struct mem_cgroup *dst = parent_mem_cgroup(src);
> > > +   struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
> > > +
> > > +   local_irq_disable();
> > > +   memcg_reparent_lock(src, dst);
> > > +   memcg_reparent_relocate(src, dst);
> > > +   memcg_reparent_unlock(src, dst);
> > > +   local_irq_enable();
> >
> > Hi,
> >
> > It seems unnecessarily complicated to 1) acquire objcg, lruvec and
> > thp_sq locks, 2) call their ->relocate() callbacks, and
> > 3) release those locks.
> >
> > Why not simply do the following instead?
> >
> > for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++) {
> >       local_irq_disable();
> >       memcg_reparent_ops[i]->lock(src, dst);
> >       memcg_reparent_ops[i]->relocate(src, dst);
> >       memcg_reparent_ops[i]->unlock(src, dst);
> >       local_irq_enable();
> > }
> >
> > As there is no actual lock dependency between the three.
> >
> > Or am I missing something important about the locking requirements?
>
> Hmm... looks like I was missing some important requirements!
>
> It seems like:
>
> 1) objcg should be reparented under lruvec locks, otherwise
>    users can observe folio_memcg(folio) != lruvec_memcg(lruvec)
>
> 2) Similarly, lruvec_reparent_relocate() should reparent all folios
>    at once under lruvec locks, otherwise users can observe
>    folio_memcg(folio) != lruvec_memcg(lruvec) for some folios.
>
>    IoW, lruvec_reparent_relocate() cannot do something like this:
>    while (lruvec is not empty) {
>            move some pages;
>            unlock lruvec locks;
>            cond_resched();
>            lock lruvec locks;
>    }
>
> Failing to satisfy 1) and 2) means user can't rely on a stable binding
> between a folio and a memcg, which is a no-go.
>
> Also, 2) makes it quite undesirable to iterate over folios and move each
> one to the right generation in MGLRU as this will certainly introduce
> soft lockups as the memcg size grows...

Sorry for the late reply. Yes, you are right. We should iterate each folio
without holding any spinlocks. So I am confirming if we could do some
preparation work like making it suitable for reparenting without holding any
spinlock. Then we could reparent those folios like the non-MGLRU case.

Thanks.

>
> Is my reasoning correct?
> If so, adding a brief comment about 1 and 2 wouldn't hurt ;)

OK.

>
> --
> Cheers,
> Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [External] Re: [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops
  2025-07-07  9:29       ` [External] " Muchun Song
@ 2025-07-09  0:14         ` Harry Yoo
  0 siblings, 0 replies; 69+ messages in thread
From: Harry Yoo @ 2025-07-09  0:14 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Mon, Jul 07, 2025 at 05:29:22PM +0800, Muchun Song wrote:
> On Wed, Jul 2, 2025 at 6:13 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > Hi,
> > >
> > > It seems unnecessarily complicated to 1) acquire objcg, lruvec and
> > > thp_sq locks, 2) call their ->relocate() callbacks, and
> > > 3) release those locks.
> > >
> > > Why not simply do the following instead?
> > >
> > > for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++) {
> > >       local_irq_disable();
> > >       memcg_reparent_ops[i]->lock(src, dst);
> > >       memcg_reparent_ops[i]->relocate(src, dst);
> > >       memcg_reparent_ops[i]->unlock(src, dst);
> > >       local_irq_enable();
> > > }
> > >
> > > As there is no actual lock dependency between the three.
> > >
> > > Or am I missing something important about the locking requirements?
> >
> > Hmm... looks like I was missing some important requirements!
> >
> > It seems like:
> >
> > 1) objcg should be reparented under lruvec locks, otherwise
> >    users can observe folio_memcg(folio) != lruvec_memcg(lruvec)
> >
> > 2) Similarly, lruvec_reparent_relocate() should reparent all folios
> >    at once under lruvec locks, otherwise users can observe
> >    folio_memcg(folio) != lruvec_memcg(lruvec) for some folios.
> >
> >    IoW, lruvec_reparent_relocate() cannot do something like this:
> >    while (lruvec is not empty) {
> >            move some pages;
> >            unlock lruvec locks;
> >            cond_resched();
> >            lock lruvec locks;
> >    }
> >
> > Failing to satisfy 1) and 2) means user can't rely on a stable binding
> > between a folio and a memcg, which is a no-go.
> >
> > Also, 2) makes it quite undesirable to iterate over folios and move each
> > one to the right generation in MGLRU as this will certainly introduce
> > soft lockups as the memcg size grows...
> 
> Sorry for the late reply.

No problem, thanks for replying!

> Yes, you are right. We should iterate each folio
> without holding any spinlocks.

Wait, did you decide to iterate each folio and update the generation
anyway? I assumed you're still in move-folios-at-once-without-iteration
camp.

> So I am confirming if we could do some
> preparation work like making it suitable for reparenting without holding any
> spinlock. Then we could reparent those folios like the non-MGLRU case.

You mean something like take folios from LRU under spinlock and iterate
them without spinlock?

Anyway, I'd be more than happy to review future revisions of the series.

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (25 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-05-20 11:27   ` Harry Yoo
  2025-04-15  2:45 ` [PATCH RFC 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Muchun Song
                   ` (3 subsequent siblings)
  30 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

Pagecache pages are charged at allocation time and hold a reference
to the original memory cgroup until reclaimed. Depending on memory
pressure, page sharing patterns between different cgroups and cgroup
creation/destruction rates, many dying memory cgroups can be pinned
by pagecache pages, reducing page reclaim efficiency and wasting
memory. Converting LRU folios and most other raw memory cgroup pins
to the object cgroup direction can fix this long-living problem.

Finally, folio->memcg_data of LRU folios and kmem folios will always
point to an object cgroup pointer. The folio->memcg_data of slab
folios will point to an vector of object cgroups.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
---
 include/linux/memcontrol.h |  78 +++++--------
 mm/huge_memory.c           |  33 ++++++
 mm/memcontrol-v1.c         |  15 ++-
 mm/memcontrol.c            | 228 +++++++++++++++++++++++++------------
 4 files changed, 222 insertions(+), 132 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0e450623f8fa..7b1279963c0c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -368,9 +368,6 @@ enum objext_flags {
 #define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
 
 #ifdef CONFIG_MEMCG
-
-static inline bool folio_memcg_kmem(struct folio *folio);
-
 /*
  * After the initialization objcg->memcg is always pointing at
  * a valid memcg, but can be atomically swapped to the parent memcg.
@@ -384,43 +381,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
 }
 
 /*
- * __folio_memcg - Get the memory cgroup associated with a non-kmem folio
- * @folio: Pointer to the folio.
- *
- * Returns a pointer to the memory cgroup associated with the folio,
- * or NULL. This function assumes that the folio is known to have a
- * proper memory cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * kmem folios.
- */
-static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
-{
-	unsigned long memcg_data = folio->memcg_data;
-
-	VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
-	VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
-	VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
-
-	return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
-}
-
-/*
- * __folio_objcg - get the object cgroup associated with a kmem folio.
+ * folio_objcg - get the object cgroup associated with a folio.
  * @folio: Pointer to the folio.
  *
  * Returns a pointer to the object cgroup associated with the folio,
  * or NULL. This function assumes that the folio is known to have a
- * proper object cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * LRU folios.
+ * proper object cgroup pointer.
  */
-static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
+static inline struct obj_cgroup *folio_objcg(struct folio *folio)
 {
 	unsigned long memcg_data = folio->memcg_data;
 
 	VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
 	VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
-	VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
 
 	return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 }
@@ -434,21 +407,31 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
  * proper memory cgroup pointer. It's not safe to call this function
  * against some type of folios, e.g. slab folios or ex-slab folios.
  *
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * For a folio any of the following ensures folio and objcg binding stability:
  *
  * - the folio lock
  * - LRU isolation
  * - exclusive reference
  *
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * Based on the stable binding of folio and objcg, for a folio any of the
+ * following ensures folio and memcg binding stability:
+ *
+ * - cgroup_mutex
+ * - the lruvec lock
+ * - the split queue lock (only THP page)
+ *
+ * If the caller only want to ensure that the page counters of memcg are
+ * updated correctly, ensure that the binding stability of folio and objcg
+ * is sufficient.
+ *
+ * Note: The caller should hold an rcu read lock or cgroup_mutex to protect
+ * memcg associated with a folio from being released.
  */
 static inline struct mem_cgroup *folio_memcg(struct folio *folio)
 {
-	if (folio_memcg_kmem(folio))
-		return obj_cgroup_memcg(__folio_objcg(folio));
-	return __folio_memcg(folio);
+	struct obj_cgroup *objcg = folio_objcg(folio);
+
+	return objcg ? obj_cgroup_memcg(objcg) : NULL;
 }
 
 /*
@@ -472,15 +455,10 @@ static inline bool folio_memcg_charged(struct folio *folio)
  * has an associated memory cgroup pointer or an object cgroups vector or
  * an object cgroup.
  *
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * The page and objcg or memcg binding rules can refer to folio_memcg().
  *
- * - the folio lock
- * - LRU isolation
- * - exclusive reference
- *
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * A caller should hold an rcu read lock to protect memcg associated with a
+ * page from being released.
  */
 static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
 {
@@ -489,18 +467,14 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
 	 * for slabs, READ_ONCE() should be used here.
 	 */
 	unsigned long memcg_data = READ_ONCE(folio->memcg_data);
+	struct obj_cgroup *objcg;
 
 	if (memcg_data & MEMCG_DATA_OBJEXTS)
 		return NULL;
 
-	if (memcg_data & MEMCG_DATA_KMEM) {
-		struct obj_cgroup *objcg;
-
-		objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
-		return obj_cgroup_memcg(objcg);
-	}
+	objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
 
-	return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
+	return objcg ? obj_cgroup_memcg(objcg) : NULL;
 }
 
 static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 813334994f84..0236020de5b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1075,6 +1075,39 @@ static inline struct deferred_split *folio_memcg_split_queue(struct folio *folio
 
 	return memcg ? &memcg->deferred_split_queue : NULL;
 }
+
+static void thp_sq_reparent_lock(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	spin_lock(&src->deferred_split_queue.split_queue_lock);
+	spin_lock_nested(&dst->deferred_split_queue.split_queue_lock,
+			 SINGLE_DEPTH_NESTING);
+}
+
+static void thp_sq_reparent_relocate(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid;
+	struct deferred_split *src_queue, *dst_queue;
+
+	src_queue = &src->deferred_split_queue;
+	dst_queue = &dst->deferred_split_queue;
+
+	if (!src_queue->split_queue_len)
+		return;
+
+	list_splice_tail_init(&src_queue->split_queue, &dst_queue->split_queue);
+	dst_queue->split_queue_len += src_queue->split_queue_len;
+	src_queue->split_queue_len = 0;
+
+	for_each_node(nid)
+		set_shrinker_bit(dst, nid, deferred_split_shrinker->id);
+}
+
+static void thp_sq_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	spin_unlock(&dst->deferred_split_queue.split_queue_lock);
+	spin_unlock(&src->deferred_split_queue.split_queue_lock);
+}
+DEFINE_MEMCG_REPARENT_OPS(thp_sq);
 #else
 static inline
 struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 8660908850dc..fb060e5c28ca 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -591,6 +591,7 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 {
 	struct mem_cgroup *memcg, *swap_memcg;
+	struct obj_cgroup *objcg;
 	unsigned int nr_entries;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
@@ -602,12 +603,13 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	if (!do_memsw_account())
 		return;
 
-	memcg = folio_memcg(folio);
-
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
+	objcg = folio_objcg(folio);
+	VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+	if (!objcg)
 		return;
 
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
 	/*
 	 * In case the memcg owning these pages has been offlined and doesn't
 	 * have an ID allocated to it anymore, charge the closest online
@@ -625,7 +627,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	folio_unqueue_deferred_split(folio);
 	folio->memcg_data = 0;
 
-	if (!mem_cgroup_is_root(memcg))
+	if (!obj_cgroup_is_root(objcg))
 		page_counter_uncharge(&memcg->memory, nr_entries);
 
 	if (memcg != swap_memcg) {
@@ -646,7 +648,8 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
 	preempt_enable_nested();
 	memcg1_check_events(memcg, folio_nid(folio));
 
-	css_put(&memcg->css);
+	rcu_read_unlock();
+	obj_cgroup_put(objcg);
 }
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3fac51179186..1381a9e97ec5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -220,8 +220,78 @@ static void objcg_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst
 
 static DEFINE_MEMCG_REPARENT_OPS(objcg);
 
+static void lruvec_reparent_lock(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid, nest = 0;
+
+	for_each_node(nid) {
+		spin_lock_nested(&mem_cgroup_lruvec(src,
+				 NODE_DATA(nid))->lru_lock, nest++);
+		spin_lock_nested(&mem_cgroup_lruvec(dst,
+				 NODE_DATA(nid))->lru_lock, nest++);
+	}
+}
+
+static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
+				enum lru_list lru)
+{
+	int zid;
+	struct mem_cgroup_per_node *mz_src, *mz_dst;
+
+	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
+	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
+
+	if (lru != LRU_UNEVICTABLE)
+		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
+
+	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
+		mz_src->lru_zone_size[zid][lru] = 0;
+	}
+}
+
+static void lruvec_reparent_relocate(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid;
+
+	for_each_node(nid) {
+		enum lru_list lru;
+		struct lruvec *src_lruvec, *dst_lruvec;
+
+		src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
+		dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
+
+		dst_lruvec->anon_cost += src_lruvec->anon_cost;
+		dst_lruvec->file_cost += src_lruvec->file_cost;
+
+		for_each_lru(lru)
+			lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
+	}
+}
+
+static void lruvec_reparent_unlock(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+	int nid;
+
+	for_each_node(nid) {
+		spin_unlock(&mem_cgroup_lruvec(dst, NODE_DATA(nid))->lru_lock);
+		spin_unlock(&mem_cgroup_lruvec(src, NODE_DATA(nid))->lru_lock);
+	}
+}
+
+static DEFINE_MEMCG_REPARENT_OPS(lruvec);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+DECLARE_MEMCG_REPARENT_OPS(thp_sq);
+#endif
+
+/* The lock order depends on the order of elements in this array. */
 static const struct memcg_reparent_ops *memcg_reparent_ops[] = {
 	&memcg_objcg_reparent_ops,
+	&memcg_lruvec_reparent_ops,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	&memcg_thp_sq_reparent_ops,
+#endif
 };
 
 #define DEFINE_MEMCG_REPARENT_FUNC(phase)				\
@@ -1018,6 +1088,8 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
 /**
  * get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg.
  * @folio: folio from which memcg should be extracted.
+ *
+ * The page and objcg or memcg binding rules can refer to folio_memcg().
  */
 struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
 {
@@ -2489,17 +2561,17 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	return try_charge_memcg(memcg, gfp_mask, nr_pages);
 }
 
-static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
+static void commit_charge(struct folio *folio, struct obj_cgroup *objcg)
 {
 	VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio);
 	/*
-	 * Any of the following ensures page's memcg stability:
+	 * Any of the following ensures page's objcg stability:
 	 *
 	 * - the page lock
 	 * - LRU isolation
 	 * - exclusive reference
 	 */
-	folio->memcg_data = (unsigned long)memcg;
+	folio->memcg_data = (unsigned long)objcg;
 }
 
 static inline void __mod_objcg_mlstate(struct obj_cgroup *objcg,
@@ -2580,6 +2652,17 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
 	return NULL;
 }
 
+static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+	struct obj_cgroup *objcg;
+
+	rcu_read_lock();
+	objcg = __get_obj_cgroup_from_memcg(memcg);
+	rcu_read_unlock();
+
+	return objcg;
+}
+
 static struct obj_cgroup *current_objcg_update(void)
 {
 	struct mem_cgroup *memcg;
@@ -2677,17 +2760,10 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
 {
 	struct obj_cgroup *objcg;
 
-	if (!memcg_kmem_online())
-		return NULL;
-
-	if (folio_memcg_kmem(folio)) {
-		objcg = __folio_objcg(folio);
+	objcg = folio_objcg(folio);
+	if (objcg)
 		obj_cgroup_get(objcg);
-	} else {
-		rcu_read_lock();
-		objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
-		rcu_read_unlock();
-	}
+
 	return objcg;
 }
 
@@ -3168,7 +3244,7 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order,
 		return;
 
 	new_refs = (1 << (old_order - new_order)) - 1;
-	css_get_many(&__folio_memcg(folio)->css, new_refs);
+	obj_cgroup_get_many(folio_objcg(folio), new_refs);
 }
 
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
@@ -4616,16 +4692,20 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 			gfp_t gfp)
 {
-	int ret;
-
-	ret = try_charge(memcg, gfp, folio_nr_pages(folio));
-	if (ret)
-		goto out;
+	int ret = 0;
+	struct obj_cgroup *objcg;
 
-	css_get(&memcg->css);
-	commit_charge(folio, memcg);
+	objcg = get_obj_cgroup_from_memcg(memcg);
+	/* Do not account at the root objcg level. */
+	if (!obj_cgroup_is_root(objcg))
+		ret = try_charge(memcg, gfp, folio_nr_pages(folio));
+	if (ret) {
+		obj_cgroup_put(objcg);
+		return ret;
+	}
+	commit_charge(folio, objcg);
 	memcg1_commit_charge(folio, memcg);
-out:
+
 	return ret;
 }
 
@@ -4711,7 +4791,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
 }
 
 struct uncharge_gather {
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	unsigned long nr_memory;
 	unsigned long pgpgout;
 	unsigned long nr_kmem;
@@ -4725,60 +4805,54 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug)
 
 static void uncharge_batch(const struct uncharge_gather *ug)
 {
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(ug->objcg);
 	if (ug->nr_memory) {
-		page_counter_uncharge(&ug->memcg->memory, ug->nr_memory);
+		page_counter_uncharge(&memcg->memory, ug->nr_memory);
 		if (do_memsw_account())
-			page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory);
+			page_counter_uncharge(&memcg->memsw, ug->nr_memory);
 		if (ug->nr_kmem) {
-			mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem);
-			memcg1_account_kmem(ug->memcg, -ug->nr_kmem);
+			mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
+			memcg1_account_kmem(memcg, -ug->nr_kmem);
 		}
-		memcg1_oom_recover(ug->memcg);
+		memcg1_oom_recover(memcg);
 	}
 
-	memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+	memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+	rcu_read_unlock();
 
 	/* drop reference from uncharge_folio */
-	css_put(&ug->memcg->css);
+	obj_cgroup_put(ug->objcg);
 }
 
 static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 {
 	long nr_pages;
-	struct mem_cgroup *memcg;
 	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
 
 	/*
 	 * Nobody should be changing or seriously looking at
-	 * folio memcg or objcg at this point, we have fully
-	 * exclusive access to the folio.
+	 * folio objcg at this point, we have fully exclusive
+	 * access to the folio.
 	 */
-	if (folio_memcg_kmem(folio)) {
-		objcg = __folio_objcg(folio);
-		/*
-		 * This get matches the put at the end of the function and
-		 * kmem pages do not hold memcg references anymore.
-		 */
-		memcg = get_mem_cgroup_from_objcg(objcg);
-	} else {
-		memcg = __folio_memcg(folio);
-	}
-
-	if (!memcg)
+	objcg = folio_objcg(folio);
+	if (!objcg)
 		return;
 
-	if (ug->memcg != memcg) {
-		if (ug->memcg) {
+	if (ug->objcg != objcg) {
+		if (ug->objcg) {
 			uncharge_batch(ug);
 			uncharge_gather_clear(ug);
 		}
-		ug->memcg = memcg;
+		ug->objcg = objcg;
 		ug->nid = folio_nid(folio);
 
-		/* pairs with css_put in uncharge_batch */
-		css_get(&memcg->css);
+		/* pairs with obj_cgroup_put in uncharge_batch */
+		obj_cgroup_get(objcg);
 	}
 
 	nr_pages = folio_nr_pages(folio);
@@ -4786,20 +4860,17 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
 	if (folio_memcg_kmem(folio)) {
 		ug->nr_memory += nr_pages;
 		ug->nr_kmem += nr_pages;
-
-		folio->memcg_data = 0;
-		obj_cgroup_put(objcg);
 	} else {
 		/* LRU pages aren't accounted at the root level */
-		if (!mem_cgroup_is_root(memcg))
+		if (!obj_cgroup_is_root(objcg))
 			ug->nr_memory += nr_pages;
 		ug->pgpgout++;
 
 		WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
-		folio->memcg_data = 0;
 	}
 
-	css_put(&memcg->css);
+	folio->memcg_data = 0;
+	obj_cgroup_put(objcg);
 }
 
 void __mem_cgroup_uncharge(struct folio *folio)
@@ -4823,7 +4894,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 	uncharge_gather_clear(&ug);
 	for (i = 0; i < folios->nr; i++)
 		uncharge_folio(folios->folios[i], &ug);
-	if (ug.memcg)
+	if (ug.objcg)
 		uncharge_batch(&ug);
 }
 
@@ -4840,6 +4911,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
 void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 {
 	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 	long nr_pages = folio_nr_pages(new);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
@@ -4854,21 +4926,24 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 	if (folio_memcg_charged(new))
 		return;
 
-	memcg = folio_memcg(old);
-	VM_WARN_ON_ONCE_FOLIO(!memcg, old);
-	if (!memcg)
+	objcg = folio_objcg(old);
+	VM_WARN_ON_ONCE_FOLIO(!objcg, old);
+	if (!objcg)
 		return;
 
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
 	/* Force-charge the new page. The old one will be freed soon */
-	if (!mem_cgroup_is_root(memcg)) {
+	if (!obj_cgroup_is_root(objcg)) {
 		page_counter_charge(&memcg->memory, nr_pages);
 		if (do_memsw_account())
 			page_counter_charge(&memcg->memsw, nr_pages);
 	}
 
-	css_get(&memcg->css);
-	commit_charge(new, memcg);
+	obj_cgroup_get(objcg);
+	commit_charge(new, objcg);
 	memcg1_commit_charge(new, memcg);
+	rcu_read_unlock();
 }
 
 /**
@@ -4884,7 +4959,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
  */
 void mem_cgroup_migrate(struct folio *old, struct folio *new)
 {
-	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
 	VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
@@ -4895,18 +4970,18 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
 	if (mem_cgroup_disabled())
 		return;
 
-	memcg = folio_memcg(old);
+	objcg = folio_objcg(old);
 	/*
-	 * Note that it is normal to see !memcg for a hugetlb folio.
+	 * Note that it is normal to see !objcg for a hugetlb folio.
 	 * For e.g, itt could have been allocated when memory_hugetlb_accounting
 	 * was not selected.
 	 */
-	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old);
-	if (!memcg)
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old);
+	if (!objcg)
 		return;
 
-	/* Transfer the charge and the css ref */
-	commit_charge(new, memcg);
+	/* Transfer the charge and the objcg ref */
+	commit_charge(new, objcg);
 
 	/* Warning should never happen, so don't worry about refcount non-0 */
 	WARN_ON_ONCE(folio_unqueue_deferred_split(old));
@@ -5049,22 +5124,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	unsigned int nr_pages = folio_nr_pages(folio);
 	struct page_counter *counter;
 	struct mem_cgroup *memcg;
+	struct obj_cgroup *objcg;
 
 	if (do_memsw_account())
 		return 0;
 
-	memcg = folio_memcg(folio);
-
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
+	objcg = folio_objcg(folio);
+	VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+	if (!objcg)
 		return 0;
 
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
 	if (!entry.val) {
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+		rcu_read_unlock();
 		return 0;
 	}
 
 	memcg = mem_cgroup_id_get_online(memcg);
+	/* memcg is pined by memcg ID. */
+	rcu_read_unlock();
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-04-15  2:45 ` [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Muchun Song
@ 2025-05-20 11:27   ` Harry Yoo
  2025-05-22  2:31     ` Muchun Song
  0 siblings, 1 reply; 69+ messages in thread
From: Harry Yoo @ 2025-05-20 11:27 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:31AM +0800, Muchun Song wrote:
> Pagecache pages are charged at allocation time and hold a reference
> to the original memory cgroup until reclaimed. Depending on memory
> pressure, page sharing patterns between different cgroups and cgroup
> creation/destruction rates, many dying memory cgroups can be pinned
> by pagecache pages, reducing page reclaim efficiency and wasting
> memory. Converting LRU folios and most other raw memory cgroup pins
> to the object cgroup direction can fix this long-living problem.
> 
> Finally, folio->memcg_data of LRU folios and kmem folios will always
> point to an object cgroup pointer. The folio->memcg_data of slab
> folios will point to an vector of object cgroups.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> ---
>  include/linux/memcontrol.h |  78 +++++--------
>  mm/huge_memory.c           |  33 ++++++
>  mm/memcontrol-v1.c         |  15 ++-
>  mm/memcontrol.c            | 228 +++++++++++++++++++++++++------------
>  4 files changed, 222 insertions(+), 132 deletions(-)

[...]

> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
> +				enum lru_list lru)
> +{
> +	int zid;
> +	struct mem_cgroup_per_node *mz_src, *mz_dst;
> +
> +	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
> +	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
> +
> +	if (lru != LRU_UNEVICTABLE)
> +		list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
> +
> +	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> +		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
> +		mz_src->lru_zone_size[zid][lru] = 0;
> +	}
> +}

I think this function should also update memcg and lruvec stats of
parent memcg? Or is it intentional?

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-05-20 11:27   ` Harry Yoo
@ 2025-05-22  2:31     ` Muchun Song
  2025-05-23  1:24       ` Harry Yoo
  0 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-05-22  2:31 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais



> On May 20, 2025, at 19:27, Harry Yoo <harry.yoo@oracle.com> wrote:
> 
> On Tue, Apr 15, 2025 at 10:45:31AM +0800, Muchun Song wrote:
>> Pagecache pages are charged at allocation time and hold a reference
>> to the original memory cgroup until reclaimed. Depending on memory
>> pressure, page sharing patterns between different cgroups and cgroup
>> creation/destruction rates, many dying memory cgroups can be pinned
>> by pagecache pages, reducing page reclaim efficiency and wasting
>> memory. Converting LRU folios and most other raw memory cgroup pins
>> to the object cgroup direction can fix this long-living problem.
>> 
>> Finally, folio->memcg_data of LRU folios and kmem folios will always
>> point to an object cgroup pointer. The folio->memcg_data of slab
>> folios will point to an vector of object cgroups.
>> 
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> ---
>> include/linux/memcontrol.h |  78 +++++--------
>> mm/huge_memory.c           |  33 ++++++
>> mm/memcontrol-v1.c         |  15 ++-
>> mm/memcontrol.c            | 228 +++++++++++++++++++++++++------------
>> 4 files changed, 222 insertions(+), 132 deletions(-)
> 
> [...]
> 
>> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
>> + 				enum lru_list lru)
>> +{
>> + 	int zid;
>> + 	struct mem_cgroup_per_node *mz_src, *mz_dst;
>> +
>> + 	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
>> + 	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
>> +	
>> + 	if (lru != LRU_UNEVICTABLE)
>> + 	list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
>> +
>> + 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
>> + 		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
>> + 		mz_src->lru_zone_size[zid][lru] = 0;
>> + 	}
>> +}
> 
> I think this function should also update memcg and lruvec stats of
> parent memcg? Or is it intentional?

Hi Harry,

No. Do not need. Because the statistics are accounted hierarchically.

Thanks.

> 
> -- 
> Cheers,
> Harry / Hyeonggon



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
  2025-05-22  2:31     ` Muchun Song
@ 2025-05-23  1:24       ` Harry Yoo
  0 siblings, 0 replies; 69+ messages in thread
From: Harry Yoo @ 2025-05-23  1:24 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Thu, May 22, 2025 at 10:31:20AM +0800, Muchun Song wrote:
> 
> 
> > On May 20, 2025, at 19:27, Harry Yoo <harry.yoo@oracle.com> wrote:
> > 
> > On Tue, Apr 15, 2025 at 10:45:31AM +0800, Muchun Song wrote:
> >> Pagecache pages are charged at allocation time and hold a reference
> >> to the original memory cgroup until reclaimed. Depending on memory
> >> pressure, page sharing patterns between different cgroups and cgroup
> >> creation/destruction rates, many dying memory cgroups can be pinned
> >> by pagecache pages, reducing page reclaim efficiency and wasting
> >> memory. Converting LRU folios and most other raw memory cgroup pins
> >> to the object cgroup direction can fix this long-living problem.
> >> 
> >> Finally, folio->memcg_data of LRU folios and kmem folios will always
> >> point to an object cgroup pointer. The folio->memcg_data of slab
> >> folios will point to an vector of object cgroups.
> >> 
> >> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> >> ---
> >> include/linux/memcontrol.h |  78 +++++--------
> >> mm/huge_memory.c           |  33 ++++++
> >> mm/memcontrol-v1.c         |  15 ++-
> >> mm/memcontrol.c            | 228 +++++++++++++++++++++++++------------
> >> 4 files changed, 222 insertions(+), 132 deletions(-)
> > 
> > [...]
> > 
> >> +static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
> >> + 				enum lru_list lru)
> >> +{
> >> + 	int zid;
> >> + 	struct mem_cgroup_per_node *mz_src, *mz_dst;
> >> +
> >> + 	mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
> >> + 	mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
> >> +	
> >> + 	if (lru != LRU_UNEVICTABLE)
> >> + 	list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
> >> +
> >> + 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
> >> + 		mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
> >> + 		mz_src->lru_zone_size[zid][lru] = 0;
> >> + 	}
> >> +}
> > 
> > I think this function should also update memcg and lruvec stats of
> > parent memcg? Or is it intentional?
> 
> Hi Harry,
> 
> No. Do not need. Because the statistics are accounted hierarchically.
> 
> Thanks.

Oh, you are absolutely right. I was missing that.
Thanks!

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH RFC 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (26 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Muchun Song
@ 2025-04-15  2:45 ` Muchun Song
  2025-04-15  2:53 ` [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (2 subsequent siblings)
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:45 UTC (permalink / raw)
  To: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou
  Cc: linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, Muchun Song

We must ensure the folio is deleted from or added to the correct lruvec list.
So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The VM_BUG_ON_PAGE()
in move_pages_to_lru() can be removed as add_page_to_lru_list() will perform
the necessary check.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/mm_inline.h | 6 ++++++
 mm/vmscan.c               | 1 -
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f9157a0c42a5..f36491c42ace 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -341,6 +341,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
 	if (lru_gen_add_folio(lruvec, folio, false))
 		return;
 
@@ -355,6 +357,8 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
 	if (lru_gen_add_folio(lruvec, folio, true))
 		return;
 
@@ -369,6 +373,8 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
 {
 	enum lru_list lru = folio_lru_list(folio);
 
+	VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
 	if (lru_gen_del_folio(lruvec, folio, false))
 		return;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index fbba14094c6d..a59268bf4112 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1952,7 +1952,6 @@ static unsigned int move_folios_to_lru(struct list_head *list)
 			continue;
 		}
 
-		VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
 		lruvec_add_folio(lruvec, folio);
 		nr_pages = folio_nr_pages(folio);
 		nr_moved += nr_pages;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (27 preceding siblings ...)
  2025-04-15  2:45 ` [PATCH RFC 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Muchun Song
@ 2025-04-15  2:53 ` Muchun Song
  2025-04-15  6:19 ` Kairui Song
  2025-05-23  1:23 ` Harry Yoo
  30 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-04-15  2:53 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, akpm, david,
	zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou, linux-kernel,
	cgroups, linux-mm, hamzamahfooz, apais, yuzhao



> On Apr 15, 2025, at 10:45, Muchun Song <songmuchun@bytedance.com> wrote:
> 
> This patchset is based on v6.15-rc2. It functions correctly only when
> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
> during rebasing onto the latest code. For more details and assistance, refer
> to the "Challenges" section. This is the reason for adding the RFC tag.

Sorry, I forgot to CC Yu Zhao. Now I've included him. I think he may
offer useful value in this aspect.

Muchun,
Thanks.

> 
> ## Introduction
> 
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].
> 
> ## Background
> 
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
> 
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
> 
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.
> 
> ## Fundamentals
> 
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
> 
> 1. The first step  to be taken is to identify all users of both functions
>   (folio_memcg() and folio_lruvec()) who are not concerned about binding
>   stability and implement appropriate measures (such as holding a RCU read
>   lock or temporarily obtaining a reference to the memory cgroup for a
>   brief period) to prevent the release of the memory cgroup.
> 
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>   how to ensure the binding stability from the user's perspective of
>   folio_lruvec().
> 
>   struct lruvec *folio_lruvec_lock(struct folio *folio)
>   {
>           struct lruvec *lruvec;
> 
>           rcu_read_lock();
>   retry:
>           lruvec = folio_lruvec(folio);
>           spin_lock(&lruvec->lru_lock);
>           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                   spin_unlock(&lruvec->lru_lock);
>                   goto retry;
>           }
> 
>           return lruvec;
>   }
> 
>   From the perspective of memory cgroup removal, the entire reparenting
>   process (altering the binding relationship between folio and its memory
>   cgroup and moving the LRU lists to its parental memory cgroup) should be
>   carried out under both the lruvec lock of the memory cgroup being removed
>   and the lruvec lock of its parent.
> 
> 3. Thirdly, another lock that requires the same approach is the split-queue
>   lock of THP.
> 
> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>   reference to the original memory cgroup.
> 
> ## Challenges
> 
> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
> LRU lists (i.e., two active lists for anonymous and file folios, and two
> inactive lists for anonymous and file folios). Due to the symmetry of the
> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
> to its parent memory cgroup during the reparenting process.
> 
> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.
> 
> 1. The first question is how to move the LRU lists from a memory cgroup to
>   its parent memory cgroup during the reparenting process. This is due to
>   the fact that the quantity of LRU lists (aka generations) may differ
>   between a child memory cgroup and its parent memory cgroup.
> 
> 2. The second question is how to make the process of reparenting more
>   efficient, since each folio charged to a memory cgroup stores its
>   generation counter into its ->flags. And the generation counter may
>   differ between a child memory cgroup and its parent memory cgroup because
>   the values of ->min_seq and ->max_seq are not identical. Should those
>   generation counters be updated correspondingly?
> 
> I am uncertain about how to handle them appropriately as I am not an
> expert at MGLRU. I would appreciate it if you could offer some suggestions.
> Moreover, if you are willing to directly provide your patches, I would be
> glad to incorporate them into this patchset.
> 
> ## Compositions
> 
> Patches 1-8 involve code refactoring and cleanup with the aim of
> facilitating the transfer LRU folios to object cgroup infrastructures.
> 
> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
> enabling the ability that LRU folios could be charged to it and aligning
> the behavior of object-cgroup-related APIs with that of the memory cgroup.
> 
> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
> being released.
> 
> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
> released.
> 
> Patches 24-25 implement the core mechanism to guarantee binding stability
> between the folio and its corresponding memory cgroup while holding lruvec
> lock or split-queue lock of THP.
> 
> Patches 26-27 are intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup.
> 
> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
> ensure correct folio operations in the future.
> 
> ## Effect
> 
> Finally, it can be observed that the quantity of dying memory cgroups will
> not experience a significant increase if the following test script is
> executed to reproduce the issue.
> 
> ```bash
> #!/bin/bash
> 
> # Create a temporary file 'temp' filled with zero bytes
> dd if=/dev/zero of=temp bs=4096 count=1
> 
> # Display memory-cgroup info from /proc/cgroups
> cat /proc/cgroups | grep memory
> 
> for i in {0..2000}
> do
>    mkdir /sys/fs/cgroup/memory/test$i
>    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> 
>    # Append 'temp' file content to 'log'
>    cat temp >> log
> 
>    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> 
>    # Potentially create a dying memory cgroup
>    rmdir /sys/fs/cgroup/memory/test$i
> done
> 
> # Display memory-cgroup info after test
> cat /proc/cgroups | grep memory
> 
> rm -f temp log
> ```
> 
> ## References
> 
> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> [2] https://lwn.net/Articles/895431/
> [3] https://github.com/systemd/systemd/pull/36827
> 
> Muchun Song (28):
>  mm: memcontrol: remove dead code of checking parent memory cgroup
>  mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock
>    holding
>  mm: workingset: use folio_lruvec() in workingset_refault()
>  mm: rename unlock_page_lruvec_irq and its variants
>  mm: thp: replace folio_memcg() with folio_memcg_charged()
>  mm: thp: introduce folio_split_queue_lock and its variants
>  mm: thp: use folio_batch to handle THP splitting in
>    deferred_split_scan()
>  mm: vmscan: refactor move_folios_to_lru()
>  mm: memcontrol: allocate object cgroup for non-kmem case
>  mm: memcontrol: return root object cgroup for root memory cgroup
>  mm: memcontrol: prevent memory cgroup release in
>    get_mem_cgroup_from_folio()
>  buffer: prevent memory cgroup release in folio_alloc_buffers()
>  writeback: prevent memory cgroup release in writeback module
>  mm: memcontrol: prevent memory cgroup release in
>    count_memcg_folio_events()
>  mm: page_io: prevent memory cgroup release in page_io module
>  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>  mm: mglru: prevent memory cgroup release in mglru
>  mm: memcontrol: prevent memory cgroup release in
>    mem_cgroup_swap_full()
>  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>  mm: workingset: prevent lruvec release in workingset_refault()
>  mm: zswap: prevent lruvec release in zswap_folio_swapin()
>  mm: swap: prevent lruvec release in swap module
>  mm: workingset: prevent lruvec release in workingset_activation()
>  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>  mm: thp: prepare for reparenting LRU pages for split queue lock
>  mm: memcontrol: introduce memcg_reparent_ops
>  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
>    folios
>  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
> 
> fs/buffer.c                      |   4 +-
> fs/fs-writeback.c                |  22 +-
> include/linux/memcontrol.h       | 190 ++++++------
> include/linux/mm_inline.h        |   6 +
> include/trace/events/writeback.h |   3 +
> mm/compaction.c                  |  43 ++-
> mm/huge_memory.c                 | 218 +++++++++-----
> mm/memcontrol-v1.c               |  15 +-
> mm/memcontrol.c                  | 476 +++++++++++++++++++------------
> mm/migrate.c                     |   2 +
> mm/mlock.c                       |   2 +-
> mm/page_io.c                     |   8 +-
> mm/percpu.c                      |   2 +-
> mm/shrinker.c                    |   6 +-
> mm/swap.c                        |  22 +-
> mm/vmscan.c                      |  73 ++---
> mm/workingset.c                  |  26 +-
> mm/zswap.c                       |   2 +
> 18 files changed, 696 insertions(+), 424 deletions(-)
> 
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (28 preceding siblings ...)
  2025-04-15  2:53 ` [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
@ 2025-04-15  6:19 ` Kairui Song
  2025-04-15  8:01   ` Muchun Song
  2025-05-23  1:23 ` Harry Yoo
  30 siblings, 1 reply; 69+ messages in thread
From: Kairui Song @ 2025-04-15  6:19 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:46 AM Muchun Song <songmuchun@bytedance.com> wrote:
>
> This patchset is based on v6.15-rc2. It functions correctly only when
> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
> during rebasing onto the latest code. For more details and assistance, refer
> to the "Challenges" section. This is the reason for adding the RFC tag.
>
> ## Introduction
>
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].
>
> ## Background
>
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
>
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
>
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.
>
> ## Fundamentals
>
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
>
> 1. The first step  to be taken is to identify all users of both functions
>    (folio_memcg() and folio_lruvec()) who are not concerned about binding
>    stability and implement appropriate measures (such as holding a RCU read
>    lock or temporarily obtaining a reference to the memory cgroup for a
>    brief period) to prevent the release of the memory cgroup.
>
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>    how to ensure the binding stability from the user's perspective of
>    folio_lruvec().
>
>    struct lruvec *folio_lruvec_lock(struct folio *folio)
>    {
>            struct lruvec *lruvec;
>
>            rcu_read_lock();
>    retry:
>            lruvec = folio_lruvec(folio);
>            spin_lock(&lruvec->lru_lock);
>            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                    spin_unlock(&lruvec->lru_lock);
>                    goto retry;
>            }
>
>            return lruvec;
>    }
>
>    From the perspective of memory cgroup removal, the entire reparenting
>    process (altering the binding relationship between folio and its memory
>    cgroup and moving the LRU lists to its parental memory cgroup) should be
>    carried out under both the lruvec lock of the memory cgroup being removed
>    and the lruvec lock of its parent.
>
> 3. Thirdly, another lock that requires the same approach is the split-queue
>    lock of THP.
>
> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>    reference to the original memory cgroup.
>

Hi, Muchun, thanks for the patch.

> ## Challenges
>
> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
> LRU lists (i.e., two active lists for anonymous and file folios, and two
> inactive lists for anonymous and file folios). Due to the symmetry of the
> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
> to its parent memory cgroup during the reparenting process.

Symmetry of LRU lists doesn't mean symmetry 'hotness', it's totally
possible that a child's active LRU is colder and should be evicted
first before the parent's inactive LRU (might even be a common
scenario for certain workloads).
This only affects the performance not the correctness though, so not a
big problem.

So will it be easier to just assume dying cgroup's folios are colder?
Simply move them to parent's LRU tail is OK. This will make the logic
appliable for both active/inactive LRU and MGLRU.

>
> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.
>
> 1. The first question is how to move the LRU lists from a memory cgroup to
>    its parent memory cgroup during the reparenting process. This is due to
>    the fact that the quantity of LRU lists (aka generations) may differ
>    between a child memory cgroup and its parent memory cgroup.
>
> 2. The second question is how to make the process of reparenting more
>    efficient, since each folio charged to a memory cgroup stores its
>    generation counter into its ->flags. And the generation counter may
>    differ between a child memory cgroup and its parent memory cgroup because
>    the values of ->min_seq and ->max_seq are not identical. Should those
>    generation counters be updated correspondingly?

I think you do have to iterate through the folios to set or clear
their generation flags if you want to put the folio in the right gen.

MGLRU does similar thing in inc_min_seq. MGLRU uses the gen flags to
defer the actual LRU movement of folios, that's a very important
optimization per my test.

>
> I am uncertain about how to handle them appropriately as I am not an
> expert at MGLRU. I would appreciate it if you could offer some suggestions.
> Moreover, if you are willing to directly provide your patches, I would be
> glad to incorporate them into this patchset.

If we just follow the above idea (move them to parent's tail), we can
just keep the folio's tier info untouched here.

For mapped file folios, they will still be promoted upon eviction if
their access bit are set (rmap walk), and MGLRU's table walker might
just promote them just fine.

For unmapped file folios, if we just keep their tier info and add
child's MGLRU tier PID counter back to the parent. Workingset
protection of MGLRU should still work just fine.

>
> ## Compositions
>
> Patches 1-8 involve code refactoring and cleanup with the aim of
> facilitating the transfer LRU folios to object cgroup infrastructures.
>
> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
> enabling the ability that LRU folios could be charged to it and aligning
> the behavior of object-cgroup-related APIs with that of the memory cgroup.
>
> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
> being released.
>
> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
> released.
>
> Patches 24-25 implement the core mechanism to guarantee binding stability
> between the folio and its corresponding memory cgroup while holding lruvec
> lock or split-queue lock of THP.
>
> Patches 26-27 are intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup.
>
> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
> ensure correct folio operations in the future.
>
> ## Effect
>
> Finally, it can be observed that the quantity of dying memory cgroups will
> not experience a significant increase if the following test script is
> executed to reproduce the issue.
>
> ```bash
> #!/bin/bash
>
> # Create a temporary file 'temp' filled with zero bytes
> dd if=/dev/zero of=temp bs=4096 count=1
>
> # Display memory-cgroup info from /proc/cgroups
> cat /proc/cgroups | grep memory
>
> for i in {0..2000}
> do
>     mkdir /sys/fs/cgroup/memory/test$i
>     echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
>
>     # Append 'temp' file content to 'log'
>     cat temp >> log
>
>     echo $$ > /sys/fs/cgroup/memory/cgroup.procs
>
>     # Potentially create a dying memory cgroup
>     rmdir /sys/fs/cgroup/memory/test$i
> done
>
> # Display memory-cgroup info after test
> cat /proc/cgroups | grep memory
>
> rm -f temp log
> ```
>
> ## References
>
> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> [2] https://lwn.net/Articles/895431/
> [3] https://github.com/systemd/systemd/pull/36827

How much overhead will it be? Objcj has some extra overhead, and we
have some extra convention for retrieving memcg of a folio now, not
sure if this will have an observable slow down.

I'm still thinking if it be more feasible to just migrate (NOT that
Cgroup V1 migrate, just set the folio's memcg to parent for dying
cgroup and update the memcg charge) and iterate the folios on
reparenting in a worker or something like that. There is already
things like destruction workqueue and offline waitqueue. That way
folios will still just point to a memcg, and seems would avoid a lot
of complexity.


>
> Muchun Song (28):
>   mm: memcontrol: remove dead code of checking parent memory cgroup
>   mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock
>     holding
>   mm: workingset: use folio_lruvec() in workingset_refault()
>   mm: rename unlock_page_lruvec_irq and its variants
>   mm: thp: replace folio_memcg() with folio_memcg_charged()
>   mm: thp: introduce folio_split_queue_lock and its variants
>   mm: thp: use folio_batch to handle THP splitting in
>     deferred_split_scan()
>   mm: vmscan: refactor move_folios_to_lru()
>   mm: memcontrol: allocate object cgroup for non-kmem case
>   mm: memcontrol: return root object cgroup for root memory cgroup
>   mm: memcontrol: prevent memory cgroup release in
>     get_mem_cgroup_from_folio()
>   buffer: prevent memory cgroup release in folio_alloc_buffers()
>   writeback: prevent memory cgroup release in writeback module
>   mm: memcontrol: prevent memory cgroup release in
>     count_memcg_folio_events()
>   mm: page_io: prevent memory cgroup release in page_io module
>   mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>   mm: mglru: prevent memory cgroup release in mglru
>   mm: memcontrol: prevent memory cgroup release in
>     mem_cgroup_swap_full()
>   mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>   mm: workingset: prevent lruvec release in workingset_refault()
>   mm: zswap: prevent lruvec release in zswap_folio_swapin()
>   mm: swap: prevent lruvec release in swap module
>   mm: workingset: prevent lruvec release in workingset_activation()
>   mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>   mm: thp: prepare for reparenting LRU pages for split queue lock
>   mm: memcontrol: introduce memcg_reparent_ops
>   mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
>     folios
>   mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
>
>  fs/buffer.c                      |   4 +-
>  fs/fs-writeback.c                |  22 +-
>  include/linux/memcontrol.h       | 190 ++++++------
>  include/linux/mm_inline.h        |   6 +
>  include/trace/events/writeback.h |   3 +
>  mm/compaction.c                  |  43 ++-
>  mm/huge_memory.c                 | 218 +++++++++-----
>  mm/memcontrol-v1.c               |  15 +-
>  mm/memcontrol.c                  | 476 +++++++++++++++++++------------
>  mm/migrate.c                     |   2 +
>  mm/mlock.c                       |   2 +-
>  mm/page_io.c                     |   8 +-
>  mm/percpu.c                      |   2 +-
>  mm/shrinker.c                    |   6 +-
>  mm/swap.c                        |  22 +-
>  mm/vmscan.c                      |  73 ++---
>  mm/workingset.c                  |  26 +-
>  mm/zswap.c                       |   2 +
>  18 files changed, 696 insertions(+), 424 deletions(-)
>
> --
> 2.20.1
>
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-15  6:19 ` Kairui Song
@ 2025-04-15  8:01   ` Muchun Song
  2025-04-17 18:22     ` Kairui Song
  0 siblings, 1 reply; 69+ messages in thread
From: Muchun Song @ 2025-04-15  8:01 UTC (permalink / raw)
  To: Kairui Song
  Cc: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao



> On Apr 15, 2025, at 14:19, Kairui Song <ryncsn@gmail.com> wrote:
> 
> On Tue, Apr 15, 2025 at 10:46 AM Muchun Song <songmuchun@bytedance.com> wrote:
>> 
>> This patchset is based on v6.15-rc2. It functions correctly only when
>> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
>> during rebasing onto the latest code. For more details and assistance, refer
>> to the "Challenges" section. This is the reason for adding the RFC tag.
>> 
>> ## Introduction
>> 
>> This patchset is intended to transfer the LRU pages to the object cgroup
>> without holding a reference to the original memory cgroup in order to
>> address the issue of the dying memory cgroup. A consensus has already been
>> reached regarding this approach recently [1].
>> 
>> ## Background
>> 
>> The issue of a dying memory cgroup refers to a situation where a memory
>> cgroup is no longer being used by users, but memory (the metadata
>> associated with memory cgroups) remains allocated to it. This situation
>> may potentially result in memory leaks or inefficiencies in memory
>> reclamation and has persisted as an issue for several years. Any memory
>> allocation that endures longer than the lifespan (from the users'
>> perspective) of a memory cgroup can lead to the issue of dying memory
>> cgroup. We have exerted greater efforts to tackle this problem by
>> introducing the infrastructure of object cgroup [2].
>> 
>> Presently, numerous types of objects (slab objects, non-slab kernel
>> allocations, per-CPU objects) are charged to the object cgroup without
>> holding a reference to the original memory cgroup. The final allocations
>> for LRU pages (anonymous pages and file pages) are charged at allocation
>> time and continues to hold a reference to the original memory cgroup
>> until reclaimed.
>> 
>> File pages are more complex than anonymous pages as they can be shared
>> among different memory cgroups and may persist beyond the lifespan of
>> the memory cgroup. The long-term pinning of file pages to memory cgroups
>> is a widespread issue that causes recurring problems in practical
>> scenarios [3]. File pages remain unreclaimed for extended periods.
>> Additionally, they are accessed by successive instances (second, third,
>> fourth, etc.) of the same job, which is restarted into a new cgroup each
>> time. As a result, unreclaimable dying memory cgroups accumulate,
>> leading to memory wastage and significantly reducing the efficiency
>> of page reclamation.
>> 
>> ## Fundamentals
>> 
>> A folio will no longer pin its corresponding memory cgroup. It is necessary
>> to ensure that the memory cgroup or the lruvec associated with the memory
>> cgroup is not released when a user obtains a pointer to the memory cgroup
>> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
>> to hold the RCU read lock or acquire a reference to the memory cgroup
>> associated with the folio to prevent its release if they are not concerned
>> about the binding stability between the folio and its corresponding memory
>> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
>> desire a stable binding between the folio and its corresponding memory
>> cgroup. An approach is needed to ensure the stability of the binding while
>> the lruvec lock is held, and to detect the situation of holding the
>> incorrect lruvec lock when there is a race condition during memory cgroup
>> reparenting. The following four steps are taken to achieve these goals.
>> 
>> 1. The first step  to be taken is to identify all users of both functions
>>   (folio_memcg() and folio_lruvec()) who are not concerned about binding
>>   stability and implement appropriate measures (such as holding a RCU read
>>   lock or temporarily obtaining a reference to the memory cgroup for a
>>   brief period) to prevent the release of the memory cgroup.
>> 
>> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>>   how to ensure the binding stability from the user's perspective of
>>   folio_lruvec().
>> 
>>   struct lruvec *folio_lruvec_lock(struct folio *folio)
>>   {
>>           struct lruvec *lruvec;
>> 
>>           rcu_read_lock();
>>   retry:
>>           lruvec = folio_lruvec(folio);
>>           spin_lock(&lruvec->lru_lock);
>>           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>>                   spin_unlock(&lruvec->lru_lock);
>>                   goto retry;
>>           }
>> 
>>           return lruvec;
>>   }
>> 
>>   From the perspective of memory cgroup removal, the entire reparenting
>>   process (altering the binding relationship between folio and its memory
>>   cgroup and moving the LRU lists to its parental memory cgroup) should be
>>   carried out under both the lruvec lock of the memory cgroup being removed
>>   and the lruvec lock of its parent.
>> 
>> 3. Thirdly, another lock that requires the same approach is the split-queue
>>   lock of THP.
>> 
>> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>>   reference to the original memory cgroup.
>> 
> 
> Hi, Muchun, thanks for the patch.

Thanks for your reply and attention.

> 
>> ## Challenges
>> 
>> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
>> LRU lists (i.e., two active lists for anonymous and file folios, and two
>> inactive lists for anonymous and file folios). Due to the symmetry of the
>> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
>> to its parent memory cgroup during the reparenting process.
> 
> Symmetry of LRU lists doesn't mean symmetry 'hotness', it's totally
> possible that a child's active LRU is colder and should be evicted
> first before the parent's inactive LRU (might even be a common
> scenario for certain workloads).

Yes.

> This only affects the performance not the correctness though, so not a
> big problem.
> 
> So will it be easier to just assume dying cgroup's folios are colder?
> Simply move them to parent's LRU tail is OK. This will make the logic
> appliable for both active/inactive LRU and MGLRU.

I think you mean moving all child LRU list to the parent memcg's inactive
list. It works well for your case. But sometimes, due to shared page cache
pages, some pages in the child list may be accessed more frequently than
those in the parent's. Still, it's okay as they can be promoted quickly
later. So I am fine with this change.

> 
>> 
>> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
>> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.
>> 
>> 1. The first question is how to move the LRU lists from a memory cgroup to
>>   its parent memory cgroup during the reparenting process. This is due to
>>   the fact that the quantity of LRU lists (aka generations) may differ
>>   between a child memory cgroup and its parent memory cgroup.
>> 
>> 2. The second question is how to make the process of reparenting more
>>   efficient, since each folio charged to a memory cgroup stores its
>>   generation counter into its ->flags. And the generation counter may
>>   differ between a child memory cgroup and its parent memory cgroup because
>>   the values of ->min_seq and ->max_seq are not identical. Should those
>>   generation counters be updated correspondingly?
> 
> I think you do have to iterate through the folios to set or clear
> their generation flags if you want to put the folio in the right gen.
> 
> MGLRU does similar thing in inc_min_seq. MGLRU uses the gen flags to
> defer the actual LRU movement of folios, that's a very important
> optimization per my test.

I noticed that, which is why I asked the second question. It's
inefficient when dealing with numerous pages related to a memory
cgroup.

> 
>> 
>> I am uncertain about how to handle them appropriately as I am not an
>> expert at MGLRU. I would appreciate it if you could offer some suggestions.
>> Moreover, if you are willing to directly provide your patches, I would be
>> glad to incorporate them into this patchset.
> 
> If we just follow the above idea (move them to parent's tail), we can
> just keep the folio's tier info untouched here.
> 
> For mapped file folios, they will still be promoted upon eviction if
> their access bit are set (rmap walk), and MGLRU's table walker might
> just promote them just fine.
> 
> For unmapped file folios, if we just keep their tier info and add
> child's MGLRU tier PID counter back to the parent. Workingset
> protection of MGLRU should still work just fine.
> 
>> 
>> ## Compositions
>> 
>> Patches 1-8 involve code refactoring and cleanup with the aim of
>> facilitating the transfer LRU folios to object cgroup infrastructures.
>> 
>> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
>> enabling the ability that LRU folios could be charged to it and aligning
>> the behavior of object-cgroup-related APIs with that of the memory cgroup.
>> 
>> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
>> being released.
>> 
>> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
>> released.
>> 
>> Patches 24-25 implement the core mechanism to guarantee binding stability
>> between the folio and its corresponding memory cgroup while holding lruvec
>> lock or split-queue lock of THP.
>> 
>> Patches 26-27 are intended to transfer the LRU pages to the object cgroup
>> without holding a reference to the original memory cgroup in order to
>> address the issue of the dying memory cgroup.
>> 
>> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
>> ensure correct folio operations in the future.
>> 
>> ## Effect
>> 
>> Finally, it can be observed that the quantity of dying memory cgroups will
>> not experience a significant increase if the following test script is
>> executed to reproduce the issue.
>> 
>> ```bash
>> #!/bin/bash
>> 
>> # Create a temporary file 'temp' filled with zero bytes
>> dd if=/dev/zero of=temp bs=4096 count=1
>> 
>> # Display memory-cgroup info from /proc/cgroups
>> cat /proc/cgroups | grep memory
>> 
>> for i in {0..2000}
>> do
>>    mkdir /sys/fs/cgroup/memory/test$i
>>    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
>> 
>>    # Append 'temp' file content to 'log'
>>    cat temp >> log
>> 
>>    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
>> 
>>    # Potentially create a dying memory cgroup
>>    rmdir /sys/fs/cgroup/memory/test$i
>> done
>> 
>> # Display memory-cgroup info after test
>> cat /proc/cgroups | grep memory
>> 
>> rm -f temp log
>> ```
>> 
>> ## References
>> 
>> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
>> [2] https://lwn.net/Articles/895431/
>> [3] https://github.com/systemd/systemd/pull/36827
> 
> How much overhead will it be? Objcj has some extra overhead, and we
> have some extra convention for retrieving memcg of a folio now, not
> sure if this will have an observable slow down.

I don't think there'll be an observable slowdown. I think objcg is
more effective for slab objects as they're more sensitive than user
pages. If it's acceptable for slab objects, it should be acceptable
for user pages too.

> 
> I'm still thinking if it be more feasible to just migrate (NOT that
> Cgroup V1 migrate, just set the folio's memcg to parent for dying
> cgroup and update the memcg charge) and iterate the folios on
> reparenting in a worker or something like that. There is already
> things like destruction workqueue and offline waitqueue. That way
> folios will still just point to a memcg, and seems would avoid a lot
> of complexity.

I didn't adopt this approach for two reasons then:

  1) It's inefficient to change `->memcg_data` to the parent when
     iterating through all pages associated with a memory cgroup.

  2) During iteration, we might come across pages isolated by other
     users. These pages aren't in any LRU list and will thus miss
     being reparented to the parent memory cgroup.

Muchun,
Thanks.

> 
> 
>> 
>> Muchun Song (28):
>>  mm: memcontrol: remove dead code of checking parent memory cgroup
>>  mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock
>>    holding
>>  mm: workingset: use folio_lruvec() in workingset_refault()
>>  mm: rename unlock_page_lruvec_irq and its variants
>>  mm: thp: replace folio_memcg() with folio_memcg_charged()
>>  mm: thp: introduce folio_split_queue_lock and its variants
>>  mm: thp: use folio_batch to handle THP splitting in
>>    deferred_split_scan()
>>  mm: vmscan: refactor move_folios_to_lru()
>>  mm: memcontrol: allocate object cgroup for non-kmem case
>>  mm: memcontrol: return root object cgroup for root memory cgroup
>>  mm: memcontrol: prevent memory cgroup release in
>>    get_mem_cgroup_from_folio()
>>  buffer: prevent memory cgroup release in folio_alloc_buffers()
>>  writeback: prevent memory cgroup release in writeback module
>>  mm: memcontrol: prevent memory cgroup release in
>>    count_memcg_folio_events()
>>  mm: page_io: prevent memory cgroup release in page_io module
>>  mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>>  mm: mglru: prevent memory cgroup release in mglru
>>  mm: memcontrol: prevent memory cgroup release in
>>    mem_cgroup_swap_full()
>>  mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>>  mm: workingset: prevent lruvec release in workingset_refault()
>>  mm: zswap: prevent lruvec release in zswap_folio_swapin()
>>  mm: swap: prevent lruvec release in swap module
>>  mm: workingset: prevent lruvec release in workingset_activation()
>>  mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>>  mm: thp: prepare for reparenting LRU pages for split queue lock
>>  mm: memcontrol: introduce memcg_reparent_ops
>>  mm: memcontrol: eliminate the problem of dying memory cgroup for LRU
>>    folios
>>  mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
>> 
>> fs/buffer.c                      |   4 +-
>> fs/fs-writeback.c                |  22 +-
>> include/linux/memcontrol.h       | 190 ++++++------
>> include/linux/mm_inline.h        |   6 +
>> include/trace/events/writeback.h |   3 +
>> mm/compaction.c                  |  43 ++-
>> mm/huge_memory.c                 | 218 +++++++++-----
>> mm/memcontrol-v1.c               |  15 +-
>> mm/memcontrol.c                  | 476 +++++++++++++++++++------------
>> mm/migrate.c                     |   2 +
>> mm/mlock.c                       |   2 +-
>> mm/page_io.c                     |   8 +-
>> mm/percpu.c                      |   2 +-
>> mm/shrinker.c                    |   6 +-
>> mm/swap.c                        |  22 +-
>> mm/vmscan.c                      |  73 ++---
>> mm/workingset.c                  |  26 +-
>> mm/zswap.c                       |   2 +
>> 18 files changed, 696 insertions(+), 424 deletions(-)
>> 
>> --
>> 2.20.1



^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-15  8:01   ` Muchun Song
@ 2025-04-17 18:22     ` Kairui Song
  2025-04-17 19:04       ` Johannes Weiner
                         ` (2 more replies)
  0 siblings, 3 replies; 69+ messages in thread
From: Kairui Song @ 2025-04-17 18:22 UTC (permalink / raw)
  To: Muchun Song
  Cc: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao

On Tue, Apr 15, 2025 at 4:02 PM Muchun Song <muchun.song@linux.dev> wrote:
>
>
>
> > On Apr 15, 2025, at 14:19, Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, Apr 15, 2025 at 10:46 AM Muchun Song <songmuchun@bytedance.com> wrote:
> >>
> >> This patchset is based on v6.15-rc2. It functions correctly only when
> >> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
> >> during rebasing onto the latest code. For more details and assistance, refer
> >> to the "Challenges" section. This is the reason for adding the RFC tag.
> >>
> >> ## Introduction
> >>
> >> This patchset is intended to transfer the LRU pages to the object cgroup
> >> without holding a reference to the original memory cgroup in order to
> >> address the issue of the dying memory cgroup. A consensus has already been
> >> reached regarding this approach recently [1].
> >>
> >> ## Background
> >>
> >> The issue of a dying memory cgroup refers to a situation where a memory
> >> cgroup is no longer being used by users, but memory (the metadata
> >> associated with memory cgroups) remains allocated to it. This situation
> >> may potentially result in memory leaks or inefficiencies in memory
> >> reclamation and has persisted as an issue for several years. Any memory
> >> allocation that endures longer than the lifespan (from the users'
> >> perspective) of a memory cgroup can lead to the issue of dying memory
> >> cgroup. We have exerted greater efforts to tackle this problem by
> >> introducing the infrastructure of object cgroup [2].
> >>
> >> Presently, numerous types of objects (slab objects, non-slab kernel
> >> allocations, per-CPU objects) are charged to the object cgroup without
> >> holding a reference to the original memory cgroup. The final allocations
> >> for LRU pages (anonymous pages and file pages) are charged at allocation
> >> time and continues to hold a reference to the original memory cgroup
> >> until reclaimed.
> >>
> >> File pages are more complex than anonymous pages as they can be shared
> >> among different memory cgroups and may persist beyond the lifespan of
> >> the memory cgroup. The long-term pinning of file pages to memory cgroups
> >> is a widespread issue that causes recurring problems in practical
> >> scenarios [3]. File pages remain unreclaimed for extended periods.
> >> Additionally, they are accessed by successive instances (second, third,
> >> fourth, etc.) of the same job, which is restarted into a new cgroup each
> >> time. As a result, unreclaimable dying memory cgroups accumulate,
> >> leading to memory wastage and significantly reducing the efficiency
> >> of page reclamation.
> >>
> >> ## Fundamentals
> >>
> >> A folio will no longer pin its corresponding memory cgroup. It is necessary
> >> to ensure that the memory cgroup or the lruvec associated with the memory
> >> cgroup is not released when a user obtains a pointer to the memory cgroup
> >> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> >> to hold the RCU read lock or acquire a reference to the memory cgroup
> >> associated with the folio to prevent its release if they are not concerned
> >> about the binding stability between the folio and its corresponding memory
> >> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> >> desire a stable binding between the folio and its corresponding memory
> >> cgroup. An approach is needed to ensure the stability of the binding while
> >> the lruvec lock is held, and to detect the situation of holding the
> >> incorrect lruvec lock when there is a race condition during memory cgroup
> >> reparenting. The following four steps are taken to achieve these goals.
> >>
> >> 1. The first step  to be taken is to identify all users of both functions
> >>   (folio_memcg() and folio_lruvec()) who are not concerned about binding
> >>   stability and implement appropriate measures (such as holding a RCU read
> >>   lock or temporarily obtaining a reference to the memory cgroup for a
> >>   brief period) to prevent the release of the memory cgroup.
> >>
> >> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
> >>   how to ensure the binding stability from the user's perspective of
> >>   folio_lruvec().
> >>
> >>   struct lruvec *folio_lruvec_lock(struct folio *folio)
> >>   {
> >>           struct lruvec *lruvec;
> >>
> >>           rcu_read_lock();
> >>   retry:
> >>           lruvec = folio_lruvec(folio);
> >>           spin_lock(&lruvec->lru_lock);
> >>           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
> >>                   spin_unlock(&lruvec->lru_lock);
> >>                   goto retry;
> >>           }
> >>
> >>           return lruvec;
> >>   }
> >>
> >>   From the perspective of memory cgroup removal, the entire reparenting
> >>   process (altering the binding relationship between folio and its memory
> >>   cgroup and moving the LRU lists to its parental memory cgroup) should be
> >>   carried out under both the lruvec lock of the memory cgroup being removed
> >>   and the lruvec lock of its parent.
> >>
> >> 3. Thirdly, another lock that requires the same approach is the split-queue
> >>   lock of THP.
> >>
> >> 4. Finally, transfer the LRU pages to the object cgroup without holding a
> >>   reference to the original memory cgroup.
> >>
> >
> > Hi, Muchun, thanks for the patch.
>
> Thanks for your reply and attention.
>
> >
> >> ## Challenges
> >>
> >> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
> >> LRU lists (i.e., two active lists for anonymous and file folios, and two
> >> inactive lists for anonymous and file folios). Due to the symmetry of the
> >> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
> >> to its parent memory cgroup during the reparenting process.
> >
> > Symmetry of LRU lists doesn't mean symmetry 'hotness', it's totally
> > possible that a child's active LRU is colder and should be evicted
> > first before the parent's inactive LRU (might even be a common
> > scenario for certain workloads).
>
> Yes.
>
> > This only affects the performance not the correctness though, so not a
> > big problem.
> >
> > So will it be easier to just assume dying cgroup's folios are colder?
> > Simply move them to parent's LRU tail is OK. This will make the logic
> > appliable for both active/inactive LRU and MGLRU.
>
> I think you mean moving all child LRU list to the parent memcg's inactive
> list. It works well for your case. But sometimes, due to shared page cache
> pages, some pages in the child list may be accessed more frequently than
> those in the parent's. Still, it's okay as they can be promoted quickly
> later. So I am fine with this change.
>
> >
> >>
> >> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
> >> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.
> >>
> >> 1. The first question is how to move the LRU lists from a memory cgroup to
> >>   its parent memory cgroup during the reparenting process. This is due to
> >>   the fact that the quantity of LRU lists (aka generations) may differ
> >>   between a child memory cgroup and its parent memory cgroup.
> >>
> >> 2. The second question is how to make the process of reparenting more
> >>   efficient, since each folio charged to a memory cgroup stores its
> >>   generation counter into its ->flags. And the generation counter may
> >>   differ between a child memory cgroup and its parent memory cgroup because
> >>   the values of ->min_seq and ->max_seq are not identical. Should those
> >>   generation counters be updated correspondingly?
> >
> > I think you do have to iterate through the folios to set or clear
> > their generation flags if you want to put the folio in the right gen.
> >
> > MGLRU does similar thing in inc_min_seq. MGLRU uses the gen flags to
> > defer the actual LRU movement of folios, that's a very important
> > optimization per my test.
>
> I noticed that, which is why I asked the second question. It's
> inefficient when dealing with numerous pages related to a memory
> cgroup.
>
> >
> >>
> >> I am uncertain about how to handle them appropriately as I am not an
> >> expert at MGLRU. I would appreciate it if you could offer some suggestions.
> >> Moreover, if you are willing to directly provide your patches, I would be
> >> glad to incorporate them into this patchset.
> >
> > If we just follow the above idea (move them to parent's tail), we can
> > just keep the folio's tier info untouched here.
> >
> > For mapped file folios, they will still be promoted upon eviction if
> > their access bit are set (rmap walk), and MGLRU's table walker might
> > just promote them just fine.
> >
> > For unmapped file folios, if we just keep their tier info and add
> > child's MGLRU tier PID counter back to the parent. Workingset
> > protection of MGLRU should still work just fine.
> >
> >>
> >> ## Compositions
> >>
> >> Patches 1-8 involve code refactoring and cleanup with the aim of
> >> facilitating the transfer LRU folios to object cgroup infrastructures.
> >>
> >> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
> >> enabling the ability that LRU folios could be charged to it and aligning
> >> the behavior of object-cgroup-related APIs with that of the memory cgroup.
> >>
> >> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
> >> being released.
> >>
> >> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
> >> released.
> >>
> >> Patches 24-25 implement the core mechanism to guarantee binding stability
> >> between the folio and its corresponding memory cgroup while holding lruvec
> >> lock or split-queue lock of THP.
> >>
> >> Patches 26-27 are intended to transfer the LRU pages to the object cgroup
> >> without holding a reference to the original memory cgroup in order to
> >> address the issue of the dying memory cgroup.
> >>
> >> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
> >> ensure correct folio operations in the future.
> >>
> >> ## Effect
> >>
> >> Finally, it can be observed that the quantity of dying memory cgroups will
> >> not experience a significant increase if the following test script is
> >> executed to reproduce the issue.
> >>
> >> ```bash
> >> #!/bin/bash
> >>
> >> # Create a temporary file 'temp' filled with zero bytes
> >> dd if=/dev/zero of=temp bs=4096 count=1
> >>
> >> # Display memory-cgroup info from /proc/cgroups
> >> cat /proc/cgroups | grep memory
> >>
> >> for i in {0..2000}
> >> do
> >>    mkdir /sys/fs/cgroup/memory/test$i
> >>    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> >>
> >>    # Append 'temp' file content to 'log'
> >>    cat temp >> log
> >>
> >>    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> >>
> >>    # Potentially create a dying memory cgroup
> >>    rmdir /sys/fs/cgroup/memory/test$i
> >> done
> >>
> >> # Display memory-cgroup info after test
> >> cat /proc/cgroups | grep memory
> >>
> >> rm -f temp log
> >> ```
> >>
> >> ## References
> >>
> >> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> >> [2] https://lwn.net/Articles/895431/
> >> [3] https://github.com/systemd/systemd/pull/36827
> >
> > How much overhead will it be? Objcj has some extra overhead, and we
> > have some extra convention for retrieving memcg of a folio now, not
> > sure if this will have an observable slow down.
>
> I don't think there'll be an observable slowdown. I think objcg is
> more effective for slab objects as they're more sensitive than user
> pages. If it's acceptable for slab objects, it should be acceptable
> for user pages too.

We currently have some workloads running with `nokmem` due to objcg
performance issues. I know there are efforts to improve them, but so
far it's still not painless to have. So I'm a bit worried about
this...

> >
> > I'm still thinking if it be more feasible to just migrate (NOT that
> > Cgroup V1 migrate, just set the folio's memcg to parent for dying
> > cgroup and update the memcg charge) and iterate the folios on
> > reparenting in a worker or something like that. There is already
> > things like destruction workqueue and offline waitqueue. That way
> > folios will still just point to a memcg, and seems would avoid a lot
> > of complexity.
>
> I didn't adopt this approach for two reasons then:
>
>   1) It's inefficient to change `->memcg_data` to the parent when
>      iterating through all pages associated with a memory cgroup.

This is a problem indeed, but isn't reparenting a rather rare
operation? So a slow async worker might be just fine?

>   2) During iteration, we might come across pages isolated by other
>      users. These pages aren't in any LRU list and will thus miss
>      being reparented to the parent memory cgroup.

Hmm, such pages will have to be returned at some point, adding
convention for isolate / return seems cleaner than adding convention
for all folio memcg retrieving?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-17 18:22     ` Kairui Song
@ 2025-04-17 19:04       ` Johannes Weiner
  2025-06-27  8:50         ` Chen Ridong
  2025-04-17 21:45       ` Roman Gushchin
  2025-04-22 14:20       ` Yosry Ahmed
  2 siblings, 1 reply; 69+ messages in thread
From: Johannes Weiner @ 2025-04-17 19:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: Muchun Song, Muchun Song, mhocko, roman.gushchin, shakeel.butt,
	akpm, david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao

On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
> On Tue, Apr 15, 2025 at 4:02 PM Muchun Song <muchun.song@linux.dev> wrote:
> We currently have some workloads running with `nokmem` due to objcg
> performance issues. I know there are efforts to improve them, but so
> far it's still not painless to have. So I'm a bit worried about
> this...

That's presumably more about the size and corresponding rate of slab
allocations. The objcg path has the same percpu cached charging and
uncharging, direct task pointer etc. as the direct memcg path. Not
sure the additional objcg->memcg indirection in the slowpath would be
noticable among hierarchical page counter atomics...

> This is a problem indeed, but isn't reparenting a rather rare
> operation? So a slow async worker might be just fine?

That could be millions of pages that need updating. rmdir is no fast
path, but that's a lot of work compared to flipping objcg->memcg and
doing a list_splice().

We used to do this in the past, if you check the git history. That's
not a desirable direction to take again, certainly not without hard
data showing that objcg is an absolute no go.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-17 19:04       ` Johannes Weiner
@ 2025-06-27  8:50         ` Chen Ridong
  0 siblings, 0 replies; 69+ messages in thread
From: Chen Ridong @ 2025-06-27  8:50 UTC (permalink / raw)
  To: Johannes Weiner, Kairui Song
  Cc: Muchun Song, Muchun Song, mhocko, roman.gushchin, shakeel.butt,
	akpm, david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao



On 2025/4/18 3:04, Johannes Weiner wrote:
> On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
>> On Tue, Apr 15, 2025 at 4:02 PM Muchun Song <muchun.song@linux.dev> wrote:
>> We currently have some workloads running with `nokmem` due to objcg
>> performance issues. I know there are efforts to improve them, but so
>> far it's still not painless to have. So I'm a bit worried about
>> this...
> 
> That's presumably more about the size and corresponding rate of slab
> allocations. The objcg path has the same percpu cached charging and
> uncharging, direct task pointer etc. as the direct memcg path. Not
> sure the additional objcg->memcg indirection in the slowpath would be
> noticable among hierarchical page counter atomics...
> 

We have encountered the same memory accounting performance issue with
kmem in our environment running cgroup v1 on Linux kernel v6.6. We have
observed significant performance overhead in the following critical path:

alloc_pages
  __alloc_pages
    __memcg_kmem_charge_page
      memcg_account_kmem
        page_counter_charge

Our profiling shows this call chain accounts for over 23% . This
bottleneck occurs because multiple Docker containers simultaneously
charge to their common parent's page_counter, creating contention on the
atomic operations.

While cgroup v1 is being deprecated, many production systems still rely
on it. To mitigate this issue, I'm considering implementing a per-CPU
stock mechanism specifically for memcg_account_kmem (limited to v1
usage). Would this approach be acceptable?

Best regard,
Ridong


>> This is a problem indeed, but isn't reparenting a rather rare
>> operation? So a slow async worker might be just fine?
> 
> That could be millions of pages that need updating. rmdir is no fast
> path, but that's a lot of work compared to flipping objcg->memcg and
> doing a list_splice().
> 
> We used to do this in the past, if you check the git history. That's
> not a desirable direction to take again, certainly not without hard
> data showing that objcg is an absolute no go.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-17 18:22     ` Kairui Song
  2025-04-17 19:04       ` Johannes Weiner
@ 2025-04-17 21:45       ` Roman Gushchin
  2025-04-28  3:43         ` Kairui Song
  2025-04-22 14:20       ` Yosry Ahmed
  2 siblings, 1 reply; 69+ messages in thread
From: Roman Gushchin @ 2025-04-17 21:45 UTC (permalink / raw)
  To: Kairui Song
  Cc: Muchun Song, Muchun Song, hannes, mhocko, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao

On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
> On Tue, Apr 15, 2025 at 4:02 PM Muchun Song <muchun.song@linux.dev> wrote:
> >
> >
> >
> > > On Apr 15, 2025, at 14:19, Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Apr 15, 2025 at 10:46 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > >>
> > >> This patchset is based on v6.15-rc2. It functions correctly only when
> > >> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
> > >> during rebasing onto the latest code. For more details and assistance, refer
> > >> to the "Challenges" section. This is the reason for adding the RFC tag.
> > >>
> > >> ## Introduction
> > >>
> > >> This patchset is intended to transfer the LRU pages to the object cgroup
> > >> without holding a reference to the original memory cgroup in order to
> > >> address the issue of the dying memory cgroup. A consensus has already been
> > >> reached regarding this approach recently [1].
> > >>
> > >> ## Background
> > >>
> > >> The issue of a dying memory cgroup refers to a situation where a memory
> > >> cgroup is no longer being used by users, but memory (the metadata
> > >> associated with memory cgroups) remains allocated to it. This situation
> > >> may potentially result in memory leaks or inefficiencies in memory
> > >> reclamation and has persisted as an issue for several years. Any memory
> > >> allocation that endures longer than the lifespan (from the users'
> > >> perspective) of a memory cgroup can lead to the issue of dying memory
> > >> cgroup. We have exerted greater efforts to tackle this problem by
> > >> introducing the infrastructure of object cgroup [2].
> > >>
> > >> Presently, numerous types of objects (slab objects, non-slab kernel
> > >> allocations, per-CPU objects) are charged to the object cgroup without
> > >> holding a reference to the original memory cgroup. The final allocations
> > >> for LRU pages (anonymous pages and file pages) are charged at allocation
> > >> time and continues to hold a reference to the original memory cgroup
> > >> until reclaimed.
> > >>
> > >> File pages are more complex than anonymous pages as they can be shared
> > >> among different memory cgroups and may persist beyond the lifespan of
> > >> the memory cgroup. The long-term pinning of file pages to memory cgroups
> > >> is a widespread issue that causes recurring problems in practical
> > >> scenarios [3]. File pages remain unreclaimed for extended periods.
> > >> Additionally, they are accessed by successive instances (second, third,
> > >> fourth, etc.) of the same job, which is restarted into a new cgroup each
> > >> time. As a result, unreclaimable dying memory cgroups accumulate,
> > >> leading to memory wastage and significantly reducing the efficiency
> > >> of page reclamation.
> > >>
> > >> ## Fundamentals
> > >>
> > >> A folio will no longer pin its corresponding memory cgroup. It is necessary
> > >> to ensure that the memory cgroup or the lruvec associated with the memory
> > >> cgroup is not released when a user obtains a pointer to the memory cgroup
> > >> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> > >> to hold the RCU read lock or acquire a reference to the memory cgroup
> > >> associated with the folio to prevent its release if they are not concerned
> > >> about the binding stability between the folio and its corresponding memory
> > >> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> > >> desire a stable binding between the folio and its corresponding memory
> > >> cgroup. An approach is needed to ensure the stability of the binding while
> > >> the lruvec lock is held, and to detect the situation of holding the
> > >> incorrect lruvec lock when there is a race condition during memory cgroup
> > >> reparenting. The following four steps are taken to achieve these goals.
> > >>
> > >> 1. The first step  to be taken is to identify all users of both functions
> > >>   (folio_memcg() and folio_lruvec()) who are not concerned about binding
> > >>   stability and implement appropriate measures (such as holding a RCU read
> > >>   lock or temporarily obtaining a reference to the memory cgroup for a
> > >>   brief period) to prevent the release of the memory cgroup.
> > >>
> > >> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
> > >>   how to ensure the binding stability from the user's perspective of
> > >>   folio_lruvec().
> > >>
> > >>   struct lruvec *folio_lruvec_lock(struct folio *folio)
> > >>   {
> > >>           struct lruvec *lruvec;
> > >>
> > >>           rcu_read_lock();
> > >>   retry:
> > >>           lruvec = folio_lruvec(folio);
> > >>           spin_lock(&lruvec->lru_lock);
> > >>           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
> > >>                   spin_unlock(&lruvec->lru_lock);
> > >>                   goto retry;
> > >>           }
> > >>
> > >>           return lruvec;
> > >>   }
> > >>
> > >>   From the perspective of memory cgroup removal, the entire reparenting
> > >>   process (altering the binding relationship between folio and its memory
> > >>   cgroup and moving the LRU lists to its parental memory cgroup) should be
> > >>   carried out under both the lruvec lock of the memory cgroup being removed
> > >>   and the lruvec lock of its parent.
> > >>
> > >> 3. Thirdly, another lock that requires the same approach is the split-queue
> > >>   lock of THP.
> > >>
> > >> 4. Finally, transfer the LRU pages to the object cgroup without holding a
> > >>   reference to the original memory cgroup.
> > >>
> > >
> > > Hi, Muchun, thanks for the patch.
> >
> > Thanks for your reply and attention.
> >
> > >
> > >> ## Challenges
> > >>
> > >> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
> > >> LRU lists (i.e., two active lists for anonymous and file folios, and two
> > >> inactive lists for anonymous and file folios). Due to the symmetry of the
> > >> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
> > >> to its parent memory cgroup during the reparenting process.
> > >
> > > Symmetry of LRU lists doesn't mean symmetry 'hotness', it's totally
> > > possible that a child's active LRU is colder and should be evicted
> > > first before the parent's inactive LRU (might even be a common
> > > scenario for certain workloads).
> >
> > Yes.
> >
> > > This only affects the performance not the correctness though, so not a
> > > big problem.
> > >
> > > So will it be easier to just assume dying cgroup's folios are colder?
> > > Simply move them to parent's LRU tail is OK. This will make the logic
> > > appliable for both active/inactive LRU and MGLRU.
> >
> > I think you mean moving all child LRU list to the parent memcg's inactive
> > list. It works well for your case. But sometimes, due to shared page cache
> > pages, some pages in the child list may be accessed more frequently than
> > those in the parent's. Still, it's okay as they can be promoted quickly
> > later. So I am fine with this change.
> >
> > >
> > >>
> > >> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
> > >> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.
> > >>
> > >> 1. The first question is how to move the LRU lists from a memory cgroup to
> > >>   its parent memory cgroup during the reparenting process. This is due to
> > >>   the fact that the quantity of LRU lists (aka generations) may differ
> > >>   between a child memory cgroup and its parent memory cgroup.
> > >>
> > >> 2. The second question is how to make the process of reparenting more
> > >>   efficient, since each folio charged to a memory cgroup stores its
> > >>   generation counter into its ->flags. And the generation counter may
> > >>   differ between a child memory cgroup and its parent memory cgroup because
> > >>   the values of ->min_seq and ->max_seq are not identical. Should those
> > >>   generation counters be updated correspondingly?
> > >
> > > I think you do have to iterate through the folios to set or clear
> > > their generation flags if you want to put the folio in the right gen.
> > >
> > > MGLRU does similar thing in inc_min_seq. MGLRU uses the gen flags to
> > > defer the actual LRU movement of folios, that's a very important
> > > optimization per my test.
> >
> > I noticed that, which is why I asked the second question. It's
> > inefficient when dealing with numerous pages related to a memory
> > cgroup.
> >
> > >
> > >>
> > >> I am uncertain about how to handle them appropriately as I am not an
> > >> expert at MGLRU. I would appreciate it if you could offer some suggestions.
> > >> Moreover, if you are willing to directly provide your patches, I would be
> > >> glad to incorporate them into this patchset.
> > >
> > > If we just follow the above idea (move them to parent's tail), we can
> > > just keep the folio's tier info untouched here.
> > >
> > > For mapped file folios, they will still be promoted upon eviction if
> > > their access bit are set (rmap walk), and MGLRU's table walker might
> > > just promote them just fine.
> > >
> > > For unmapped file folios, if we just keep their tier info and add
> > > child's MGLRU tier PID counter back to the parent. Workingset
> > > protection of MGLRU should still work just fine.
> > >
> > >>
> > >> ## Compositions
> > >>
> > >> Patches 1-8 involve code refactoring and cleanup with the aim of
> > >> facilitating the transfer LRU folios to object cgroup infrastructures.
> > >>
> > >> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
> > >> enabling the ability that LRU folios could be charged to it and aligning
> > >> the behavior of object-cgroup-related APIs with that of the memory cgroup.
> > >>
> > >> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
> > >> being released.
> > >>
> > >> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
> > >> released.
> > >>
> > >> Patches 24-25 implement the core mechanism to guarantee binding stability
> > >> between the folio and its corresponding memory cgroup while holding lruvec
> > >> lock or split-queue lock of THP.
> > >>
> > >> Patches 26-27 are intended to transfer the LRU pages to the object cgroup
> > >> without holding a reference to the original memory cgroup in order to
> > >> address the issue of the dying memory cgroup.
> > >>
> > >> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
> > >> ensure correct folio operations in the future.
> > >>
> > >> ## Effect
> > >>
> > >> Finally, it can be observed that the quantity of dying memory cgroups will
> > >> not experience a significant increase if the following test script is
> > >> executed to reproduce the issue.
> > >>
> > >> ```bash
> > >> #!/bin/bash
> > >>
> > >> # Create a temporary file 'temp' filled with zero bytes
> > >> dd if=/dev/zero of=temp bs=4096 count=1
> > >>
> > >> # Display memory-cgroup info from /proc/cgroups
> > >> cat /proc/cgroups | grep memory
> > >>
> > >> for i in {0..2000}
> > >> do
> > >>    mkdir /sys/fs/cgroup/memory/test$i
> > >>    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> > >>
> > >>    # Append 'temp' file content to 'log'
> > >>    cat temp >> log
> > >>
> > >>    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> > >>
> > >>    # Potentially create a dying memory cgroup
> > >>    rmdir /sys/fs/cgroup/memory/test$i
> > >> done
> > >>
> > >> # Display memory-cgroup info after test
> > >> cat /proc/cgroups | grep memory
> > >>
> > >> rm -f temp log
> > >> ```
> > >>
> > >> ## References
> > >>
> > >> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> > >> [2] https://lwn.net/Articles/895431/
> > >> [3] https://github.com/systemd/systemd/pull/36827
> > >
> > > How much overhead will it be? Objcj has some extra overhead, and we
> > > have some extra convention for retrieving memcg of a folio now, not
> > > sure if this will have an observable slow down.
> >
> > I don't think there'll be an observable slowdown. I think objcg is
> > more effective for slab objects as they're more sensitive than user
> > pages. If it's acceptable for slab objects, it should be acceptable
> > for user pages too.
> 
> We currently have some workloads running with `nokmem` due to objcg
> performance issues. I know there are efforts to improve them, but so
> far it's still not painless to have. So I'm a bit worried about
> this...

Do you mind sharing more details here?

Thanks!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-17 21:45       ` Roman Gushchin
@ 2025-04-28  3:43         ` Kairui Song
  2025-06-27  9:02           ` Chen Ridong
  0 siblings, 1 reply; 69+ messages in thread
From: Kairui Song @ 2025-04-28  3:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Muchun Song, Muchun Song, hannes, mhocko, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao

On Fri, Apr 18, 2025 at 5:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
> >
> > We currently have some workloads running with `nokmem` due to objcg
> > performance issues. I know there are efforts to improve them, but so
> > far it's still not painless to have. So I'm a bit worried about
> > this...
>
> Do you mind sharing more details here?
>
> Thanks!

Hi,

Sorry for the late response, I was busy with another series and other works.

It's not hard to observe such slow down, for example a simple redis
test can expose it:

Without nokmem:
redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
SET: 16393.44 requests per second, p50=0.055 msec
GET: 16956.34 requests per second, p50=0.055 msec

With nokmem:
redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
SET: 17263.70 requests per second, p50=0.055 msec
GET: 17410.23 requests per second, p50=0.055 msec

And I'm testing with latest kernel:
uname -a
Linux localhost 6.15.0-rc2+ #1594 SMP PREEMPT_DYNAMIC Sun Apr 27
15:13:27 CST 2025 x86_64 GNU/Linux

This is just an example. For redis, it can be a workaround by using
things like redis pipeline, but not all workloads can be adjusted
that flexibly.

And the slowdown could be amplified in some cases.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-28  3:43         ` Kairui Song
@ 2025-06-27  9:02           ` Chen Ridong
  2025-06-27 18:54             ` Kairui Song
  0 siblings, 1 reply; 69+ messages in thread
From: Chen Ridong @ 2025-06-27  9:02 UTC (permalink / raw)
  To: Kairui Song, Roman Gushchin
  Cc: Muchun Song, Muchun Song, hannes, mhocko, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao



On 2025/4/28 11:43, Kairui Song wrote:
> On Fri, Apr 18, 2025 at 5:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
>>>
>>> We currently have some workloads running with `nokmem` due to objcg
>>> performance issues. I know there are efforts to improve them, but so
>>> far it's still not painless to have. So I'm a bit worried about
>>> this...
>>
>> Do you mind sharing more details here?
>>
>> Thanks!
> 
> Hi,
> 
> Sorry for the late response, I was busy with another series and other works.
> 
> It's not hard to observe such slow down, for example a simple redis
> test can expose it:
> 
> Without nokmem:
> redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
> SET: 16393.44 requests per second, p50=0.055 msec
> GET: 16956.34 requests per second, p50=0.055 msec
> 
> With nokmem:
> redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
> SET: 17263.70 requests per second, p50=0.055 msec
> GET: 17410.23 requests per second, p50=0.055 msec
> 
> And I'm testing with latest kernel:
> uname -a
> Linux localhost 6.15.0-rc2+ #1594 SMP PREEMPT_DYNAMIC Sun Apr 27
> 15:13:27 CST 2025 x86_64 GNU/Linux
> 
> This is just an example. For redis, it can be a workaround by using
> things like redis pipeline, but not all workloads can be adjusted
> that flexibly.
> 
> And the slowdown could be amplified in some cases.

Hi Kairui,

We've also encountered this issue in our Redis scenario. May I confirm
whether your testing is based on cgroup v1 or v2?

In our environment using cgroup v1, we've identified memcg_account_kmem
as the critical performance bottleneck function - which, as you know, is
specific to the v1 implementation.

Best regards,
Ridong


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-06-27  9:02           ` Chen Ridong
@ 2025-06-27 18:54             ` Kairui Song
  2025-06-27 19:14               ` Shakeel Butt
  0 siblings, 1 reply; 69+ messages in thread
From: Kairui Song @ 2025-06-27 18:54 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Roman Gushchin, Muchun Song, Muchun Song, hannes, mhocko,
	shakeel.butt, akpm, david, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, linux-kernel, cgroups, linux-mm, hamzamahfooz,
	apais, yuzhao

On Fri, Jun 27, 2025 at 5:02 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> On 2025/4/28 11:43, Kairui Song wrote:
> > On Fri, Apr 18, 2025 at 5:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >>
> >> On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
> >>>
> >>> We currently have some workloads running with `nokmem` due to objcg
> >>> performance issues. I know there are efforts to improve them, but so
> >>> far it's still not painless to have. So I'm a bit worried about
> >>> this...
> >>
> >> Do you mind sharing more details here?
> >>
> >> Thanks!
> >
> > Hi,
> >
> > Sorry for the late response, I was busy with another series and other works.
> >
> > It's not hard to observe such slow down, for example a simple redis
> > test can expose it:
> >
> > Without nokmem:
> > redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
> > SET: 16393.44 requests per second, p50=0.055 msec
> > GET: 16956.34 requests per second, p50=0.055 msec
> >
> > With nokmem:
> > redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
> > SET: 17263.70 requests per second, p50=0.055 msec
> > GET: 17410.23 requests per second, p50=0.055 msec
> >
> > And I'm testing with latest kernel:
> > uname -a
> > Linux localhost 6.15.0-rc2+ #1594 SMP PREEMPT_DYNAMIC Sun Apr 27
> > 15:13:27 CST 2025 x86_64 GNU/Linux
> >
> > This is just an example. For redis, it can be a workaround by using
> > things like redis pipeline, but not all workloads can be adjusted
> > that flexibly.
> >
> > And the slowdown could be amplified in some cases.
>
> Hi Kairui,
>
> We've also encountered this issue in our Redis scenario. May I confirm
> whether your testing is based on cgroup v1 or v2?
>
> In our environment using cgroup v1, we've identified memcg_account_kmem
> as the critical performance bottleneck function - which, as you know, is
> specific to the v1 implementation.
>
> Best regards,
> Ridong

Hi Ridong

I can confirm I was testing using Cgroup V2, and I can still reproduce
it, it seems the performance gap is smaller with the latest upstream
though, but still easily observable.

My previous observation is that the performance drain behaves
differently with different CPUs, my current test machine is an Intel
8255C. I'll do a more detailed performance analysis of this when I
have time to work on this. Thanks for the tips!

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-06-27 18:54             ` Kairui Song
@ 2025-06-27 19:14               ` Shakeel Butt
  2025-06-28  9:21                 ` Chen Ridong
  0 siblings, 1 reply; 69+ messages in thread
From: Shakeel Butt @ 2025-06-27 19:14 UTC (permalink / raw)
  To: Kairui Song
  Cc: Chen Ridong, Roman Gushchin, Muchun Song, Muchun Song, hannes,
	mhocko, akpm, david, zhengqi.arch, yosry.ahmed, nphamcs,
	chengming.zhou, linux-kernel, cgroups, linux-mm, hamzamahfooz,
	apais, yuzhao

On Sat, Jun 28, 2025 at 02:54:10AM +0800, Kairui Song wrote:
> On Fri, Jun 27, 2025 at 5:02 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> > On 2025/4/28 11:43, Kairui Song wrote:
> > > On Fri, Apr 18, 2025 at 5:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > >>
> > >> On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
> > >>>
> > >>> We currently have some workloads running with `nokmem` due to objcg
> > >>> performance issues. I know there are efforts to improve them, but so
> > >>> far it's still not painless to have. So I'm a bit worried about
> > >>> this...
> > >>
> > >> Do you mind sharing more details here?
> > >>
> > >> Thanks!
> > >
> > > Hi,
> > >
> > > Sorry for the late response, I was busy with another series and other works.
> > >
> > > It's not hard to observe such slow down, for example a simple redis
> > > test can expose it:
> > >
> > > Without nokmem:
> > > redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
> > > SET: 16393.44 requests per second, p50=0.055 msec
> > > GET: 16956.34 requests per second, p50=0.055 msec
> > >
> > > With nokmem:
> > > redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
> > > SET: 17263.70 requests per second, p50=0.055 msec
> > > GET: 17410.23 requests per second, p50=0.055 msec
> > >
> > > And I'm testing with latest kernel:
> > > uname -a
> > > Linux localhost 6.15.0-rc2+ #1594 SMP PREEMPT_DYNAMIC Sun Apr 27
> > > 15:13:27 CST 2025 x86_64 GNU/Linux
> > >
> > > This is just an example. For redis, it can be a workaround by using
> > > things like redis pipeline, but not all workloads can be adjusted
> > > that flexibly.
> > >
> > > And the slowdown could be amplified in some cases.
> >
> > Hi Kairui,
> >
> > We've also encountered this issue in our Redis scenario. May I confirm
> > whether your testing is based on cgroup v1 or v2?
> >
> > In our environment using cgroup v1, we've identified memcg_account_kmem
> > as the critical performance bottleneck function - which, as you know, is
> > specific to the v1 implementation.
> >
> > Best regards,
> > Ridong
> 
> Hi Ridong
> 
> I can confirm I was testing using Cgroup V2, and I can still reproduce
> it, it seems the performance gap is smaller with the latest upstream
> though, but still easily observable.
> 
> My previous observation is that the performance drain behaves
> differently with different CPUs, my current test machine is an Intel
> 8255C. I'll do a more detailed performance analysis of this when I
> have time to work on this. Thanks for the tips!

Please try with the latest upstream kernel i.e. 6.16 as the charging
code has changed a lot.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-06-27 19:14               ` Shakeel Butt
@ 2025-06-28  9:21                 ` Chen Ridong
  0 siblings, 0 replies; 69+ messages in thread
From: Chen Ridong @ 2025-06-28  9:21 UTC (permalink / raw)
  To: Shakeel Butt, Kairui Song
  Cc: Roman Gushchin, Muchun Song, Muchun Song, hannes, mhocko, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao



On 2025/6/28 3:14, Shakeel Butt wrote:
> On Sat, Jun 28, 2025 at 02:54:10AM +0800, Kairui Song wrote:
>> On Fri, Jun 27, 2025 at 5:02 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>>> On 2025/4/28 11:43, Kairui Song wrote:
>>>> On Fri, Apr 18, 2025 at 5:45 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>>>>
>>>>> On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
>>>>>>
>>>>>> We currently have some workloads running with `nokmem` due to objcg
>>>>>> performance issues. I know there are efforts to improve them, but so
>>>>>> far it's still not painless to have. So I'm a bit worried about
>>>>>> this...
>>>>>
>>>>> Do you mind sharing more details here?
>>>>>
>>>>> Thanks!
>>>>
>>>> Hi,
>>>>
>>>> Sorry for the late response, I was busy with another series and other works.
>>>>
>>>> It's not hard to observe such slow down, for example a simple redis
>>>> test can expose it:
>>>>
>>>> Without nokmem:
>>>> redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
>>>> SET: 16393.44 requests per second, p50=0.055 msec
>>>> GET: 16956.34 requests per second, p50=0.055 msec
>>>>
>>>> With nokmem:
>>>> redis-benchmark -h 127.0.0.1 -q -t set,get -n 80000 -c 1
>>>> SET: 17263.70 requests per second, p50=0.055 msec
>>>> GET: 17410.23 requests per second, p50=0.055 msec
>>>>
>>>> And I'm testing with latest kernel:
>>>> uname -a
>>>> Linux localhost 6.15.0-rc2+ #1594 SMP PREEMPT_DYNAMIC Sun Apr 27
>>>> 15:13:27 CST 2025 x86_64 GNU/Linux
>>>>
>>>> This is just an example. For redis, it can be a workaround by using
>>>> things like redis pipeline, but not all workloads can be adjusted
>>>> that flexibly.
>>>>
>>>> And the slowdown could be amplified in some cases.
>>>
>>> Hi Kairui,
>>>
>>> We've also encountered this issue in our Redis scenario. May I confirm
>>> whether your testing is based on cgroup v1 or v2?
>>>
>>> In our environment using cgroup v1, we've identified memcg_account_kmem
>>> as the critical performance bottleneck function - which, as you know, is
>>> specific to the v1 implementation.
>>>
>>> Best regards,
>>> Ridong
>>
>> Hi Ridong
>>
>> I can confirm I was testing using Cgroup V2, and I can still reproduce
>> it, it seems the performance gap is smaller with the latest upstream
>> though, but still easily observable.
>>
>> My previous observation is that the performance drain behaves
>> differently with different CPUs, my current test machine is an Intel
>> 8255C. I'll do a more detailed performance analysis of this when I
>> have time to work on this. Thanks for the tips!
> 
> Please try with the latest upstream kernel i.e. 6.16 as the charging
> code has changed a lot.

Thanks, I will try.


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-17 18:22     ` Kairui Song
  2025-04-17 19:04       ` Johannes Weiner
  2025-04-17 21:45       ` Roman Gushchin
@ 2025-04-22 14:20       ` Yosry Ahmed
  2 siblings, 0 replies; 69+ messages in thread
From: Yosry Ahmed @ 2025-04-22 14:20 UTC (permalink / raw)
  To: Kairui Song
  Cc: Muchun Song, Muchun Song, hannes, mhocko, roman.gushchin,
	shakeel.butt, akpm, david, zhengqi.arch, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais, yuzhao

On Fri, Apr 18, 2025 at 02:22:12AM +0800, Kairui Song wrote:
> On Tue, Apr 15, 2025 at 4:02 PM Muchun Song <muchun.song@linux.dev> wrote:
> >
> >
> >
> > > On Apr 15, 2025, at 14:19, Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Apr 15, 2025 at 10:46 AM Muchun Song <songmuchun@bytedance.com> wrote:
> > >>
> > >> This patchset is based on v6.15-rc2. It functions correctly only when
> > >> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
> > >> during rebasing onto the latest code. For more details and assistance, refer
> > >> to the "Challenges" section. This is the reason for adding the RFC tag.
> > >>
> > >> ## Introduction
> > >>
> > >> This patchset is intended to transfer the LRU pages to the object cgroup
> > >> without holding a reference to the original memory cgroup in order to
> > >> address the issue of the dying memory cgroup. A consensus has already been
> > >> reached regarding this approach recently [1].
> > >>
> > >> ## Background
> > >>
> > >> The issue of a dying memory cgroup refers to a situation where a memory
> > >> cgroup is no longer being used by users, but memory (the metadata
> > >> associated with memory cgroups) remains allocated to it. This situation
> > >> may potentially result in memory leaks or inefficiencies in memory
> > >> reclamation and has persisted as an issue for several years. Any memory
> > >> allocation that endures longer than the lifespan (from the users'
> > >> perspective) of a memory cgroup can lead to the issue of dying memory
> > >> cgroup. We have exerted greater efforts to tackle this problem by
> > >> introducing the infrastructure of object cgroup [2].
> > >>
> > >> Presently, numerous types of objects (slab objects, non-slab kernel
> > >> allocations, per-CPU objects) are charged to the object cgroup without
> > >> holding a reference to the original memory cgroup. The final allocations
> > >> for LRU pages (anonymous pages and file pages) are charged at allocation
> > >> time and continues to hold a reference to the original memory cgroup
> > >> until reclaimed.
> > >>
> > >> File pages are more complex than anonymous pages as they can be shared
> > >> among different memory cgroups and may persist beyond the lifespan of
> > >> the memory cgroup. The long-term pinning of file pages to memory cgroups
> > >> is a widespread issue that causes recurring problems in practical
> > >> scenarios [3]. File pages remain unreclaimed for extended periods.
> > >> Additionally, they are accessed by successive instances (second, third,
> > >> fourth, etc.) of the same job, which is restarted into a new cgroup each
> > >> time. As a result, unreclaimable dying memory cgroups accumulate,
> > >> leading to memory wastage and significantly reducing the efficiency
> > >> of page reclamation.
> > >>
> > >> ## Fundamentals
> > >>
> > >> A folio will no longer pin its corresponding memory cgroup. It is necessary
> > >> to ensure that the memory cgroup or the lruvec associated with the memory
> > >> cgroup is not released when a user obtains a pointer to the memory cgroup
> > >> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> > >> to hold the RCU read lock or acquire a reference to the memory cgroup
> > >> associated with the folio to prevent its release if they are not concerned
> > >> about the binding stability between the folio and its corresponding memory
> > >> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> > >> desire a stable binding between the folio and its corresponding memory
> > >> cgroup. An approach is needed to ensure the stability of the binding while
> > >> the lruvec lock is held, and to detect the situation of holding the
> > >> incorrect lruvec lock when there is a race condition during memory cgroup
> > >> reparenting. The following four steps are taken to achieve these goals.
> > >>
> > >> 1. The first step  to be taken is to identify all users of both functions
> > >>   (folio_memcg() and folio_lruvec()) who are not concerned about binding
> > >>   stability and implement appropriate measures (such as holding a RCU read
> > >>   lock or temporarily obtaining a reference to the memory cgroup for a
> > >>   brief period) to prevent the release of the memory cgroup.
> > >>
> > >> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
> > >>   how to ensure the binding stability from the user's perspective of
> > >>   folio_lruvec().
> > >>
> > >>   struct lruvec *folio_lruvec_lock(struct folio *folio)
> > >>   {
> > >>           struct lruvec *lruvec;
> > >>
> > >>           rcu_read_lock();
> > >>   retry:
> > >>           lruvec = folio_lruvec(folio);
> > >>           spin_lock(&lruvec->lru_lock);
> > >>           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
> > >>                   spin_unlock(&lruvec->lru_lock);
> > >>                   goto retry;
> > >>           }
> > >>
> > >>           return lruvec;
> > >>   }
> > >>
> > >>   From the perspective of memory cgroup removal, the entire reparenting
> > >>   process (altering the binding relationship between folio and its memory
> > >>   cgroup and moving the LRU lists to its parental memory cgroup) should be
> > >>   carried out under both the lruvec lock of the memory cgroup being removed
> > >>   and the lruvec lock of its parent.
> > >>
> > >> 3. Thirdly, another lock that requires the same approach is the split-queue
> > >>   lock of THP.
> > >>
> > >> 4. Finally, transfer the LRU pages to the object cgroup without holding a
> > >>   reference to the original memory cgroup.
> > >>
> > >
> > > Hi, Muchun, thanks for the patch.
> >
> > Thanks for your reply and attention.
> >
> > >
> > >> ## Challenges
> > >>
> > >> In a non-MGLRU scenario, each lruvec of every memory cgroup comprises four
> > >> LRU lists (i.e., two active lists for anonymous and file folios, and two
> > >> inactive lists for anonymous and file folios). Due to the symmetry of the
> > >> LRU lists, it is feasible to transfer the LRU lists from a memory cgroup
> > >> to its parent memory cgroup during the reparenting process.
> > >
> > > Symmetry of LRU lists doesn't mean symmetry 'hotness', it's totally
> > > possible that a child's active LRU is colder and should be evicted
> > > first before the parent's inactive LRU (might even be a common
> > > scenario for certain workloads).
> >
> > Yes.
> >
> > > This only affects the performance not the correctness though, so not a
> > > big problem.
> > >
> > > So will it be easier to just assume dying cgroup's folios are colder?
> > > Simply move them to parent's LRU tail is OK. This will make the logic
> > > appliable for both active/inactive LRU and MGLRU.
> >
> > I think you mean moving all child LRU list to the parent memcg's inactive
> > list. It works well for your case. But sometimes, due to shared page cache
> > pages, some pages in the child list may be accessed more frequently than
> > those in the parent's. Still, it's okay as they can be promoted quickly
> > later. So I am fine with this change.
> >
> > >
> > >>
> > >> In a MGLRU scenario, each lruvec of every memory cgroup comprises at least
> > >> 2 (MIN_NR_GENS) generations and at most 4 (MAX_NR_GENS) generations.
> > >>
> > >> 1. The first question is how to move the LRU lists from a memory cgroup to
> > >>   its parent memory cgroup during the reparenting process. This is due to
> > >>   the fact that the quantity of LRU lists (aka generations) may differ
> > >>   between a child memory cgroup and its parent memory cgroup.
> > >>
> > >> 2. The second question is how to make the process of reparenting more
> > >>   efficient, since each folio charged to a memory cgroup stores its
> > >>   generation counter into its ->flags. And the generation counter may
> > >>   differ between a child memory cgroup and its parent memory cgroup because
> > >>   the values of ->min_seq and ->max_seq are not identical. Should those
> > >>   generation counters be updated correspondingly?
> > >
> > > I think you do have to iterate through the folios to set or clear
> > > their generation flags if you want to put the folio in the right gen.
> > >
> > > MGLRU does similar thing in inc_min_seq. MGLRU uses the gen flags to
> > > defer the actual LRU movement of folios, that's a very important
> > > optimization per my test.
> >
> > I noticed that, which is why I asked the second question. It's
> > inefficient when dealing with numerous pages related to a memory
> > cgroup.
> >
> > >
> > >>
> > >> I am uncertain about how to handle them appropriately as I am not an
> > >> expert at MGLRU. I would appreciate it if you could offer some suggestions.
> > >> Moreover, if you are willing to directly provide your patches, I would be
> > >> glad to incorporate them into this patchset.
> > >
> > > If we just follow the above idea (move them to parent's tail), we can
> > > just keep the folio's tier info untouched here.
> > >
> > > For mapped file folios, they will still be promoted upon eviction if
> > > their access bit are set (rmap walk), and MGLRU's table walker might
> > > just promote them just fine.
> > >
> > > For unmapped file folios, if we just keep their tier info and add
> > > child's MGLRU tier PID counter back to the parent. Workingset
> > > protection of MGLRU should still work just fine.
> > >
> > >>
> > >> ## Compositions
> > >>
> > >> Patches 1-8 involve code refactoring and cleanup with the aim of
> > >> facilitating the transfer LRU folios to object cgroup infrastructures.
> > >>
> > >> Patches 9-10 aim to allocate the object cgroup for non-kmem scenarios,
> > >> enabling the ability that LRU folios could be charged to it and aligning
> > >> the behavior of object-cgroup-related APIs with that of the memory cgroup.
> > >>
> > >> Patches 11-19 aim to prevent memory cgroup returned by folio_memcg() from
> > >> being released.
> > >>
> > >> Patches 20-23 aim to prevent lruvec returned by folio_lruvec() from being
> > >> released.
> > >>
> > >> Patches 24-25 implement the core mechanism to guarantee binding stability
> > >> between the folio and its corresponding memory cgroup while holding lruvec
> > >> lock or split-queue lock of THP.
> > >>
> > >> Patches 26-27 are intended to transfer the LRU pages to the object cgroup
> > >> without holding a reference to the original memory cgroup in order to
> > >> address the issue of the dying memory cgroup.
> > >>
> > >> Patch 28 aims to add VM_WARN_ON_ONCE_FOLIO to LRU maintenance helpers to
> > >> ensure correct folio operations in the future.
> > >>
> > >> ## Effect
> > >>
> > >> Finally, it can be observed that the quantity of dying memory cgroups will
> > >> not experience a significant increase if the following test script is
> > >> executed to reproduce the issue.
> > >>
> > >> ```bash
> > >> #!/bin/bash
> > >>
> > >> # Create a temporary file 'temp' filled with zero bytes
> > >> dd if=/dev/zero of=temp bs=4096 count=1
> > >>
> > >> # Display memory-cgroup info from /proc/cgroups
> > >> cat /proc/cgroups | grep memory
> > >>
> > >> for i in {0..2000}
> > >> do
> > >>    mkdir /sys/fs/cgroup/memory/test$i
> > >>    echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
> > >>
> > >>    # Append 'temp' file content to 'log'
> > >>    cat temp >> log
> > >>
> > >>    echo $$ > /sys/fs/cgroup/memory/cgroup.procs
> > >>
> > >>    # Potentially create a dying memory cgroup
> > >>    rmdir /sys/fs/cgroup/memory/test$i
> > >> done
> > >>
> > >> # Display memory-cgroup info after test
> > >> cat /proc/cgroups | grep memory
> > >>
> > >> rm -f temp log
> > >> ```
> > >>
> > >> ## References
> > >>
> > >> [1] https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
> > >> [2] https://lwn.net/Articles/895431/
> > >> [3] https://github.com/systemd/systemd/pull/36827
> > >
> > > How much overhead will it be? Objcj has some extra overhead, and we
> > > have some extra convention for retrieving memcg of a folio now, not
> > > sure if this will have an observable slow down.
> >
> > I don't think there'll be an observable slowdown. I think objcg is
> > more effective for slab objects as they're more sensitive than user
> > pages. If it's acceptable for slab objects, it should be acceptable
> > for user pages too.

It would be nice if we can get some numbers to make sure there are no
regressions in common workloads, especially those that trigger a lot of
calls to folio_memcg() and friends.

> 
> We currently have some workloads running with `nokmem` due to objcg
> performance issues. I know there are efforts to improve them, but so
> far it's still not painless to have. So I'm a bit worried about
> this...
> 
> > >
> > > I'm still thinking if it be more feasible to just migrate (NOT that
> > > Cgroup V1 migrate, just set the folio's memcg to parent for dying
> > > cgroup and update the memcg charge) and iterate the folios on
> > > reparenting in a worker or something like that. There is already
> > > things like destruction workqueue and offline waitqueue. That way
> > > folios will still just point to a memcg, and seems would avoid a lot
> > > of complexity.
> >
> > I didn't adopt this approach for two reasons then:
> >
> >   1) It's inefficient to change `->memcg_data` to the parent when
> >      iterating through all pages associated with a memory cgroup.
> 
> This is a problem indeed, but isn't reparenting a rather rare
> operation? So a slow async worker might be just fine?
> 
> >   2) During iteration, we might come across pages isolated by other
> >      users. These pages aren't in any LRU list and will thus miss
> >      being reparented to the parent memory cgroup.
> 
> Hmm, such pages will have to be returned at some point, adding
> convention for isolate / return seems cleaner than adding convention
> for all folio memcg retrieving?

Apart from isolated folios, we may come across folios that are locked or
have their refs frozen by someone else. I assume we wouldn't want to
mess with those folios. Such indetermenistic behavior was the main
reason my recharging approach was turned down:
https://lore.kernel.org/lkml/20230720070825.992023-1-yosryahmed@google.com/

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
                   ` (29 preceding siblings ...)
  2025-04-15  6:19 ` Kairui Song
@ 2025-05-23  1:23 ` Harry Yoo
  2025-05-23  2:39   ` Muchun Song
  30 siblings, 1 reply; 69+ messages in thread
From: Harry Yoo @ 2025-05-23  1:23 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, mhocko, roman.gushchin, shakeel.butt, muchun.song, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais

On Tue, Apr 15, 2025 at 10:45:04AM +0800, Muchun Song wrote:
> This patchset is based on v6.15-rc2. It functions correctly only when
> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
> during rebasing onto the latest code. For more details and assistance, refer
> to the "Challenges" section. This is the reason for adding the RFC tag.
> 

[...snip...]

> ## Fundamentals
> 
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
> 
> 1. The first step  to be taken is to identify all users of both functions
>    (folio_memcg() and folio_lruvec()) who are not concerned about binding
>    stability and implement appropriate measures (such as holding a RCU read
>    lock or temporarily obtaining a reference to the memory cgroup for a
>    brief period) to prevent the release of the memory cgroup.
> 
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>    how to ensure the binding stability from the user's perspective of
>    folio_lruvec().
> 
>    struct lruvec *folio_lruvec_lock(struct folio *folio)
>    {
>            struct lruvec *lruvec;
> 
>            rcu_read_lock();
>    retry:
>            lruvec = folio_lruvec(folio);
>            spin_lock(&lruvec->lru_lock);
>            if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>                    spin_unlock(&lruvec->lru_lock);
>                    goto retry;
>            }
> 
>            return lruvec;
>    }

Is it still required to hold RCU read lock after binding stability
between folio and memcg? 

In the previous version of this series, folio_lruvec_lock() is implemented:

 struct lruvec *folio_lruvec_lock(struct folio *folio)
 {
	struct lruvec *lruvec;

	rcu_read_lock();
retry:
	lruvec = folio_lruvec(folio);
 	spin_lock(&lruvec->lru_lock);

	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
		spin_unlock(&lruvec->lru_lock);
		goto retry;
	}
	rcu_read_unlock();

 	return lruvec;
 }

And then this version calls rcu_read_unlock() in lruvec_unlock(),
instead of folio_lruvec_lock().

I wonder if this is because the memcg or objcg can be released without
rcu_read_lock(), or just to silence the warning in
folio_memcg()->obj_cgroup_memcg()->lockdep_assert_once(rcu_read_lock_is_held())?

>    From the perspective of memory cgroup removal, the entire reparenting
>    process (altering the binding relationship between folio and its memory
>    cgroup and moving the LRU lists to its parental memory cgroup) should be
>    carried out under both the lruvec lock of the memory cgroup being removed
>    and the lruvec lock of its parent.
> 
> 3. Thirdly, another lock that requires the same approach is the split-queue
>    lock of THP.
> 
> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>    reference to the original memory cgroup.

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH RFC 00/28] Eliminate Dying Memory Cgroup
  2025-05-23  1:23 ` Harry Yoo
@ 2025-05-23  2:39   ` Muchun Song
  0 siblings, 0 replies; 69+ messages in thread
From: Muchun Song @ 2025-05-23  2:39 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Muchun Song, hannes, mhocko, roman.gushchin, shakeel.butt, akpm,
	david, zhengqi.arch, yosry.ahmed, nphamcs, chengming.zhou,
	linux-kernel, cgroups, linux-mm, hamzamahfooz, apais



> On May 23, 2025, at 09:23, Harry Yoo <harry.yoo@oracle.com> wrote:
> 
> On Tue, Apr 15, 2025 at 10:45:04AM +0800, Muchun Song wrote:
>> This patchset is based on v6.15-rc2. It functions correctly only when
>> CONFIG_LRU_GEN (Multi-Gen LRU) is disabled. Several issues were encountered
>> during rebasing onto the latest code. For more details and assistance, refer
>> to the "Challenges" section. This is the reason for adding the RFC tag.
>> 
> 
> [...snip...]
> 
>> ## Fundamentals
>> 
>> A folio will no longer pin its corresponding memory cgroup. It is necessary
>> to ensure that the memory cgroup or the lruvec associated with the memory
>> cgroup is not released when a user obtains a pointer to the memory cgroup
>> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
>> to hold the RCU read lock or acquire a reference to the memory cgroup
>> associated with the folio to prevent its release if they are not concerned
>> about the binding stability between the folio and its corresponding memory
>> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
>> desire a stable binding between the folio and its corresponding memory
>> cgroup. An approach is needed to ensure the stability of the binding while
>> the lruvec lock is held, and to detect the situation of holding the
>> incorrect lruvec lock when there is a race condition during memory cgroup
>> reparenting. The following four steps are taken to achieve these goals.
>> 
>> 1. The first step  to be taken is to identify all users of both functions
>>   (folio_memcg() and folio_lruvec()) who are not concerned about binding
>>   stability and implement appropriate measures (such as holding a RCU read
>>   lock or temporarily obtaining a reference to the memory cgroup for a
>>   brief period) to prevent the release of the memory cgroup.
>> 
>> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>>   how to ensure the binding stability from the user's perspective of
>>   folio_lruvec().
>> 
>>   struct lruvec *folio_lruvec_lock(struct folio *folio)
>>   {
>>           struct lruvec *lruvec;
>> 
>>           rcu_read_lock();
>>   retry:
>>           lruvec = folio_lruvec(folio);
>>           spin_lock(&lruvec->lru_lock);
>>           if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>>                   spin_unlock(&lruvec->lru_lock);
>>                   goto retry;
>>           }
>> 
>>           return lruvec;
>>   }
> 
> Is it still required to hold RCU read lock after binding stability
> between folio and memcg?

No. The spin lock is enough. The reason is because the introducing
of lock assertion in commit:

  02f4bbefcada ("mm: kmem: add lockdep assertion to obj_cgroup_memcg")

The user may unintentionally call obj_cgroup_memcg() with holding
lruvec lock, if we do not hold rcu read lock, then obj_cgroup_memcg()
will complain about this.

> 
> In the previous version of this series, folio_lruvec_lock() is implemented:
> 
> struct lruvec *folio_lruvec_lock(struct folio *folio)
> {
> 	struct lruvec *lruvec;
> 
> 	rcu_read_lock();
> retry:
> 	lruvec = folio_lruvec(folio);
> 	spin_lock(&lruvec->lru_lock);
> 
> 	if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
> 		spin_unlock(&lruvec->lru_lock);
> 		goto retry;
> 	}
> 	rcu_read_unlock();
> 
> 	return lruvec;
> }
> 
> And then this version calls rcu_read_unlock() in lruvec_unlock(),
> instead of folio_lruvec_lock().
> 
> I wonder if this is because the memcg or objcg can be released without
> rcu_read_lock(), or just to silence the warning in
> folio_memcg()->obj_cgroup_memcg()->lockdep_assert_once(rcu_read_lock_is_held())?

The latter is right.

Muchun,
Thanks.

> 
>>   From the perspective of memory cgroup removal, the entire reparenting
>>   process (altering the binding relationship between folio and its memory
>>   cgroup and moving the LRU lists to its parental memory cgroup) should be
>>   carried out under both the lruvec lock of the memory cgroup being removed
>>   and the lruvec lock of its parent.
>> 
>> 3. Thirdly, another lock that requires the same approach is the split-queue
>>   lock of THP.
>> 
>> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>>   reference to the original memory cgroup.
> 
> -- 
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2025-07-09  0:15 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15  2:45 [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
2025-04-15  2:45 ` [PATCH RFC 01/28] mm: memcontrol: remove dead code of checking parent memory cgroup Muchun Song
2025-04-17 14:35   ` Johannes Weiner
2025-04-15  2:45 ` [PATCH RFC 02/28] mm: memcontrol: use folio_memcg_charged() to avoid potential rcu lock holding Muchun Song
2025-04-17 14:48   ` Johannes Weiner
2025-04-18  2:38     ` Muchun Song
2025-04-15  2:45 ` [PATCH RFC 03/28] mm: workingset: use folio_lruvec() in workingset_refault() Muchun Song
2025-04-17 14:52   ` Johannes Weiner
2025-04-15  2:45 ` [PATCH RFC 04/28] mm: rename unlock_page_lruvec_irq and its variants Muchun Song
2025-04-17 14:53   ` Johannes Weiner
2025-04-15  2:45 ` [PATCH RFC 05/28] mm: thp: replace folio_memcg() with folio_memcg_charged() Muchun Song
2025-04-17 14:54   ` Johannes Weiner
2025-04-15  2:45 ` [PATCH RFC 06/28] mm: thp: introduce folio_split_queue_lock and its variants Muchun Song
2025-04-17 14:58   ` Johannes Weiner
2025-04-18 19:50   ` Johannes Weiner
2025-04-19 14:20     ` Muchun Song
2025-04-15  2:45 ` [PATCH RFC 07/28] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Muchun Song
2025-04-30 14:37   ` Johannes Weiner
2025-05-06  6:44     ` Hugh Dickins
2025-05-06 21:44       ` Hugh Dickins
2025-05-07  3:30         ` Muchun Song
2025-04-15  2:45 ` [PATCH RFC 08/28] mm: vmscan: refactor move_folios_to_lru() Muchun Song
2025-04-30 14:49   ` Johannes Weiner
2025-04-15  2:45 ` [PATCH RFC 09/28] mm: memcontrol: allocate object cgroup for non-kmem case Muchun Song
2025-04-15  2:45 ` [PATCH RFC 10/28] mm: memcontrol: return root object cgroup for root memory cgroup Muchun Song
2025-06-28  3:09   ` Chen Ridong
2025-06-30  7:16     ` Muchun Song
2025-04-15  2:45 ` [PATCH RFC 11/28] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 12/28] buffer: prevent memory cgroup release in folio_alloc_buffers() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 13/28] writeback: prevent memory cgroup release in writeback module Muchun Song
2025-04-15  2:45 ` [PATCH RFC 14/28] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 15/28] mm: page_io: prevent memory cgroup release in page_io module Muchun Song
2025-04-15  2:45 ` [PATCH RFC 16/28] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 17/28] mm: mglru: prevent memory cgroup release in mglru Muchun Song
2025-04-15  2:45 ` [PATCH RFC 18/28] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 19/28] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 20/28] mm: workingset: prevent lruvec release in workingset_refault() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 21/28] mm: zswap: prevent lruvec release in zswap_folio_swapin() Muchun Song
2025-04-17 17:39   ` Nhat Pham
2025-04-18  2:36   ` Chengming Zhou
2025-04-15  2:45 ` [PATCH RFC 22/28] mm: swap: prevent lruvec release in swap module Muchun Song
2025-04-15  2:45 ` [PATCH RFC 23/28] mm: workingset: prevent lruvec release in workingset_activation() Muchun Song
2025-04-15  2:45 ` [PATCH RFC 24/28] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Muchun Song
2025-04-15  2:45 ` [PATCH RFC 25/28] mm: thp: prepare for reparenting LRU pages for split queue lock Muchun Song
2025-04-15  2:45 ` [PATCH RFC 26/28] mm: memcontrol: introduce memcg_reparent_ops Muchun Song
2025-06-30 12:47   ` Harry Yoo
2025-07-01 22:12     ` Harry Yoo
2025-07-07  9:29       ` [External] " Muchun Song
2025-07-09  0:14         ` Harry Yoo
2025-04-15  2:45 ` [PATCH RFC 27/28] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Muchun Song
2025-05-20 11:27   ` Harry Yoo
2025-05-22  2:31     ` Muchun Song
2025-05-23  1:24       ` Harry Yoo
2025-04-15  2:45 ` [PATCH RFC 28/28] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Muchun Song
2025-04-15  2:53 ` [PATCH RFC 00/28] Eliminate Dying Memory Cgroup Muchun Song
2025-04-15  6:19 ` Kairui Song
2025-04-15  8:01   ` Muchun Song
2025-04-17 18:22     ` Kairui Song
2025-04-17 19:04       ` Johannes Weiner
2025-06-27  8:50         ` Chen Ridong
2025-04-17 21:45       ` Roman Gushchin
2025-04-28  3:43         ` Kairui Song
2025-06-27  9:02           ` Chen Ridong
2025-06-27 18:54             ` Kairui Song
2025-06-27 19:14               ` Shakeel Butt
2025-06-28  9:21                 ` Chen Ridong
2025-04-22 14:20       ` Yosry Ahmed
2025-05-23  1:23 ` Harry Yoo
2025-05-23  2:39   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).