* [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-11-07 1:40 ` Harry Yoo
2025-10-28 13:58 ` [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
` (26 subsequent siblings)
27 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Since the no-hierarchy mode has been deprecated after the commit:
commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").
As a result, parent_mem_cgroup() will not return NULL except when passing
the root memcg, and the root memcg cannot be offline. Hence, it's safe to
remove the check on the returned value of parent_mem_cgroup(). Remove the
corresponding dead code.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memcontrol.c | 5 -----
mm/shrinker.c | 6 +-----
2 files changed, 1 insertion(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 93f7c76f0ce96..d5257465c9d75 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3339,9 +3339,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
return;
parent = parent_mem_cgroup(memcg);
- if (!parent)
- parent = root_mem_cgroup;
-
memcg_reparent_list_lrus(memcg, parent);
/*
@@ -3632,8 +3629,6 @@ struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
break;
}
memcg = parent_mem_cgroup(memcg);
- if (!memcg)
- memcg = root_mem_cgroup;
}
return memcg;
}
diff --git a/mm/shrinker.c b/mm/shrinker.c
index 4a93fd433689a..e8e092a2f7f41 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -286,14 +286,10 @@ void reparent_shrinker_deferred(struct mem_cgroup *memcg)
{
int nid, index, offset;
long nr;
- struct mem_cgroup *parent;
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct shrinker_info *child_info, *parent_info;
struct shrinker_info_unit *child_unit, *parent_unit;
- parent = parent_mem_cgroup(memcg);
- if (!parent)
- parent = root_mem_cgroup;
-
/* Prevent from concurrent shrinker_info expand */
mutex_lock(&shrinker_mutex);
for_each_node(nid) {
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* Re: [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
2025-10-28 13:58 ` [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
@ 2025-11-07 1:40 ` Harry Yoo
0 siblings, 0 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-07 1:40 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng
On Tue, Oct 28, 2025 at 09:58:14PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
>
> Since the no-hierarchy mode has been deprecated after the commit:
>
> commit bef8620cd8e0 ("mm: memcg: deprecate the non-hierarchical mode").
>
> As a result, parent_mem_cgroup() will not return NULL except when passing
> the root memcg, and the root memcg cannot be offline. Hence, it's safe to
> remove the check on the returned value of parent_mem_cgroup(). Remove the
> corresponding dead code.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
2025-10-28 13:58 ` [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-11-07 1:55 ` Harry Yoo
2025-10-28 13:58 ` [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
` (25 subsequent siblings)
27 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Use folio_lruvec() to simplify the code.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/workingset.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 68a76a91111f4..8cad8ee6dec6a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -534,8 +534,6 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
void workingset_refault(struct folio *folio, void *shadow)
{
bool file = folio_is_file_lru(folio);
- struct pglist_data *pgdat;
- struct mem_cgroup *memcg;
struct lruvec *lruvec;
bool workingset;
long nr;
@@ -557,10 +555,7 @@ void workingset_refault(struct folio *folio, void *shadow)
* locked to guarantee folio_memcg() stability throughout.
*/
nr = folio_nr_pages(folio);
- memcg = folio_memcg(folio);
- pgdat = folio_pgdat(folio);
- lruvec = mem_cgroup_lruvec(memcg, pgdat);
-
+ lruvec = folio_lruvec(folio);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset, true))
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* Re: [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
2025-10-28 13:58 ` [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
@ 2025-11-07 1:55 ` Harry Yoo
0 siblings, 0 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-07 1:55 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng
On Tue, Oct 28, 2025 at 09:58:15PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
>
> Use folio_lruvec() to simplify the code.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
2025-10-28 13:58 ` [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup Qi Zheng
2025-10-28 13:58 ` [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-11-07 2:03 ` Harry Yoo
2025-10-28 13:58 ` [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
` (24 subsequent siblings)
27 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
It is inappropriate to use folio_lruvec_lock() variants in conjunction
with unlock_page_lruvec() variants, as this involves the inconsistent
operation of locking a folio while unlocking a page. To rectify this, the
functions unlock_page_lruvec{_irq, _irqrestore} are renamed to
lruvec_unlock{_irq,_irqrestore}.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 10 +++++-----
mm/compaction.c | 14 +++++++-------
mm/huge_memory.c | 2 +-
mm/mlock.c | 2 +-
mm/swap.c | 12 ++++++------
mm/vmscan.c | 4 ++--
6 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8d2e250535a8a..6185d8399a54e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1493,17 +1493,17 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
return mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec));
}
-static inline void unlock_page_lruvec(struct lruvec *lruvec)
+static inline void lruvec_unlock(struct lruvec *lruvec)
{
spin_unlock(&lruvec->lru_lock);
}
-static inline void unlock_page_lruvec_irq(struct lruvec *lruvec)
+static inline void lruvec_unlock_irq(struct lruvec *lruvec)
{
spin_unlock_irq(&lruvec->lru_lock);
}
-static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
+static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
unsigned long flags)
{
spin_unlock_irqrestore(&lruvec->lru_lock, flags);
@@ -1525,7 +1525,7 @@ static inline struct lruvec *folio_lruvec_relock_irq(struct folio *folio,
if (folio_matches_lruvec(folio, locked_lruvec))
return locked_lruvec;
- unlock_page_lruvec_irq(locked_lruvec);
+ lruvec_unlock_irq(locked_lruvec);
}
return folio_lruvec_lock_irq(folio);
@@ -1539,7 +1539,7 @@ static inline void folio_lruvec_relock_irqsave(struct folio *folio,
if (folio_matches_lruvec(folio, *lruvecp))
return;
- unlock_page_lruvec_irqrestore(*lruvecp, *flags);
+ lruvec_unlock_irqrestore(*lruvecp, *flags);
}
*lruvecp = folio_lruvec_lock_irqsave(folio, flags);
diff --git a/mm/compaction.c b/mm/compaction.c
index 8760d10bd0b32..4dce180f699b4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -913,7 +913,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (!(low_pfn % COMPACT_CLUSTER_MAX)) {
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
@@ -964,7 +964,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
}
/* for alloc_contig case */
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
@@ -1053,7 +1053,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (unlikely(page_has_movable_ops(page)) &&
!PageMovableOpsIsolated(page)) {
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
@@ -1158,7 +1158,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
/* If we already hold the lock, we can skip some rechecking */
if (lruvec != locked) {
if (locked)
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
locked = lruvec;
@@ -1226,7 +1226,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
isolate_fail_put:
/* Avoid potential deadlock in freeing page under lru_lock */
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
folio_put(folio);
@@ -1242,7 +1242,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
*/
if (nr_isolated) {
if (locked) {
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
locked = NULL;
}
putback_movable_pages(&cc->migratepages);
@@ -1274,7 +1274,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
isolate_abort:
if (locked)
- unlock_page_lruvec_irqrestore(locked, flags);
+ lruvec_unlock_irqrestore(locked, flags);
if (folio) {
folio_set_lru(folio);
folio_put(folio);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0a826b6e6aa7f..9d3594df6eedf 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4014,7 +4014,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
expected_refs = folio_expected_ref_count(folio) + 1;
folio_ref_unfreeze(folio, expected_refs);
- unlock_page_lruvec(lruvec);
+ lruvec_unlock(lruvec);
if (ci)
swap_cluster_unlock(ci);
diff --git a/mm/mlock.c b/mm/mlock.c
index bb0776f5ef7ca..5a81de8dd4875 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -205,7 +205,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)
}
if (lruvec)
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
folios_put(fbatch);
}
diff --git a/mm/swap.c b/mm/swap.c
index 2260dcd2775e7..ec0c654e128dc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -91,7 +91,7 @@ static void page_cache_release(struct folio *folio)
__page_cache_release(folio, &lruvec, &flags);
if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
}
void __folio_put(struct folio *folio)
@@ -175,7 +175,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
}
if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
folios_put(fbatch);
}
@@ -349,7 +349,7 @@ void folio_activate(struct folio *folio)
lruvec = folio_lruvec_lock_irq(folio);
lru_activate(lruvec, folio);
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
folio_set_lru(folio);
}
#endif
@@ -963,7 +963,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
if (folio_is_zone_device(folio)) {
if (lruvec) {
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
lruvec = NULL;
}
if (folio_ref_sub_and_test(folio, nr_refs))
@@ -977,7 +977,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
/* hugetlb has its own memcg */
if (folio_test_hugetlb(folio)) {
if (lruvec) {
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
lruvec = NULL;
}
free_huge_folio(folio);
@@ -991,7 +991,7 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
j++;
}
if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
+ lruvec_unlock_irqrestore(lruvec, flags);
if (!j) {
folio_batch_reinit(folios);
return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c922bad2b8fd4..3a1044ce30f1e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1829,7 +1829,7 @@ bool folio_isolate_lru(struct folio *folio)
folio_get(folio);
lruvec = folio_lruvec_lock_irq(folio);
lruvec_del_folio(lruvec, folio);
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
ret = true;
}
@@ -7849,7 +7849,7 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
if (lruvec) {
__count_vm_events(UNEVICTABLE_PGRESCUED, pgrescued);
__count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
- unlock_page_lruvec_irq(lruvec);
+ lruvec_unlock_irq(lruvec);
} else if (pgscanned) {
count_vm_events(UNEVICTABLE_PGSCANNED, pgscanned);
}
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* Re: [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
2025-10-28 13:58 ` [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
@ 2025-11-07 2:03 ` Harry Yoo
0 siblings, 0 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-07 2:03 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng
On Tue, Oct 28, 2025 at 09:58:16PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
>
> It is inappropriate to use folio_lruvec_lock() variants in conjunction
> with unlock_page_lruvec() variants, as this involves the inconsistent
> operation of locking a folio while unlocking a page. To rectify this, the
> functions unlock_page_lruvec{_irq, _irqrestore} are renamed to
> lruvec_unlock{_irq,_irqrestore}.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (2 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-11-07 5:11 ` Harry Yoo
2025-10-28 13:58 ` [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
` (23 subsequent siblings)
27 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In a subsequent patch, we'll reparent the LRU folios. The folios that are
moved to the appropriate LRU list can undergo reparenting during the
move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
a lruvec lock. Instead, we should utilize the more general interface of
folio_lruvec_relock_irq() to obtain the correct lruvec lock.
This patch involves only code refactoring and doesn't introduce any
functional changes.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3a1044ce30f1e..660cd40cfddd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1883,24 +1883,27 @@ static bool too_many_isolated(struct pglist_data *pgdat, int file,
/*
* move_folios_to_lru() moves folios from private @list to appropriate LRU list.
*
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate lruvec.
+ *
+ * Note: The caller must not hold any lruvec lock.
*/
-static unsigned int move_folios_to_lru(struct lruvec *lruvec,
- struct list_head *list)
+static unsigned int move_folios_to_lru(struct list_head *list)
{
int nr_pages, nr_moved = 0;
+ struct lruvec *lruvec = NULL;
struct folio_batch free_folios;
folio_batch_init(&free_folios);
while (!list_empty(list)) {
struct folio *folio = lru_to_folio(list);
+ lruvec = folio_lruvec_relock_irq(folio, lruvec);
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
list_del(&folio->lru);
if (unlikely(!folio_evictable(folio))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
folio_putback_lru(folio);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
continue;
}
@@ -1922,19 +1925,15 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
folio_unqueue_deferred_split(folio);
if (folio_batch_add(&free_folios, folio) == 0) {
- spin_unlock_irq(&lruvec->lru_lock);
+ lruvec_unlock_irq(lruvec);
mem_cgroup_uncharge_folios(&free_folios);
free_unref_folios(&free_folios);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
}
continue;
}
- /*
- * All pages were isolated from the same lruvec (and isolation
- * inhibits memcg migration).
- */
VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
lruvec_add_folio(lruvec, folio);
nr_pages = folio_nr_pages(folio);
@@ -1943,11 +1942,12 @@ static unsigned int move_folios_to_lru(struct lruvec *lruvec,
workingset_age_nonresident(lruvec, nr_pages);
}
+ if (lruvec)
+ lruvec_unlock_irq(lruvec);
+
if (free_folios.nr) {
- spin_unlock_irq(&lruvec->lru_lock);
mem_cgroup_uncharge_folios(&free_folios);
free_unref_folios(&free_folios);
- spin_lock_irq(&lruvec->lru_lock);
}
return nr_moved;
@@ -2016,9 +2016,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
lruvec_memcg(lruvec));
- spin_lock_irq(&lruvec->lru_lock);
- move_folios_to_lru(lruvec, &folio_list);
+ move_folios_to_lru(&folio_list);
+ spin_lock_irq(&lruvec->lru_lock);
__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
stat.nr_demoted);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
@@ -2166,11 +2166,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move folios back to the lru list.
*/
- spin_lock_irq(&lruvec->lru_lock);
-
- nr_activate = move_folios_to_lru(lruvec, &l_active);
- nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
+ nr_activate = move_folios_to_lru(&l_active);
+ nr_deactivate = move_folios_to_lru(&l_inactive);
+ spin_lock_irq(&lruvec->lru_lock);
__count_vm_events(PGDEACTIVATE, nr_deactivate);
count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
@@ -4735,14 +4734,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
}
- spin_lock_irq(&lruvec->lru_lock);
-
- move_folios_to_lru(lruvec, &list);
+ move_folios_to_lru(&list);
+ local_irq_disable();
walk = current->reclaim_state->mm_walk;
if (walk && walk->batched) {
walk->lruvec = lruvec;
+ spin_lock(&lruvec->lru_lock);
reset_batch_size(walk);
+ spin_unlock(&lruvec->lru_lock);
}
__mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
@@ -4754,7 +4754,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
count_memcg_events(memcg, item, reclaimed);
__count_vm_events(PGSTEAL_ANON + type, reclaimed);
- spin_unlock_irq(&lruvec->lru_lock);
+ local_irq_enable();
list_splice_init(&clean, &list);
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-10-28 13:58 ` [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
@ 2025-11-07 5:11 ` Harry Yoo
2025-11-07 6:41 ` Qi Zheng
2025-11-07 7:18 ` Sebastian Andrzej Siewior
0 siblings, 2 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-07 5:11 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Tue, Oct 28, 2025 at 09:58:17PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
>
> In a subsequent patch, we'll reparent the LRU folios. The folios that are
> moved to the appropriate LRU list can undergo reparenting during the
> move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
> a lruvec lock. Instead, we should utilize the more general interface of
> folio_lruvec_relock_irq() to obtain the correct lruvec lock.
>
> This patch involves only code refactoring and doesn't introduce any
> functional changes.
>
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
> 1 file changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 3a1044ce30f1e..660cd40cfddd4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2016,9 +2016,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
> lruvec_memcg(lruvec));
>
> - spin_lock_irq(&lruvec->lru_lock);
> - move_folios_to_lru(lruvec, &folio_list);
> + move_folios_to_lru(&folio_list);
>
> + spin_lock_irq(&lruvec->lru_lock);
> __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
> stat.nr_demoted);
Maybe I'm missing something or just confused for now, but let me ask...
How do we make sure the lruvec (and the mem_cgroup containing the
lruvec) did not disappear (due to offlining) after move_folios_to_lru()?
> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> @@ -2166,11 +2166,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
> /*
> * Move folios back to the lru list.
> */
> - spin_lock_irq(&lruvec->lru_lock);
> -
> - nr_activate = move_folios_to_lru(lruvec, &l_active);
> - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
> + nr_activate = move_folios_to_lru(&l_active);
> + nr_deactivate = move_folios_to_lru(&l_inactive);
>
> + spin_lock_irq(&lruvec->lru_lock);
> __count_vm_events(PGDEACTIVATE, nr_deactivate);
> count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>
> @@ -4735,14 +4734,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
> }
>
> - spin_lock_irq(&lruvec->lru_lock);
> -
> - move_folios_to_lru(lruvec, &list);
> + move_folios_to_lru(&list);
>
> + local_irq_disable();
> walk = current->reclaim_state->mm_walk;
> if (walk && walk->batched) {
> walk->lruvec = lruvec;
> + spin_lock(&lruvec->lru_lock);
> reset_batch_size(walk);
> + spin_unlock(&lruvec->lru_lock);
> }
Cc'ing RT folks as they may not want to disable IRQ on PREEMPT_RT.
IIRC there has been some effort in MM to reduce the scope of
IRQ-disabled section in MM when PREEMPT_RT config was added to the
mainline. spin_lock_irq() doesn't disable IRQ on PREEMPT_RT.
Also, this will break RT according to Documentation/locking/locktypes.rst:
> The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
> have a few implications. For example, on a non-PREEMPT_RT kernel
> the following code sequence works as expected:
>
> local_irq_disable();
> spin_lock(&lock);
>
> and is fully equivalent to:
>
> spin_lock_irq(&lock);
> Same applies to rwlock_t and the _irqsave() suffix variants.
>
> On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires
> a fully preemptible context. Instead, use spin_lock_irq() or
> spin_lock_irqsave() and their unlock counterparts.
>
> In cases where the interrupt disabling and locking must remain separate,
> PREEMPT_RT offers a local_lock mechanism. Acquiring the local_lock pins
> the task to a CPU, allowing things like per-CPU interrupt disabled locks
> to be acquired. However, this approach should be used only where absolutely
> necessary.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-07 5:11 ` Harry Yoo
@ 2025-11-07 6:41 ` Qi Zheng
2025-11-07 13:20 ` Harry Yoo
2025-11-07 7:18 ` Sebastian Andrzej Siewior
1 sibling, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-11-07 6:41 UTC (permalink / raw)
To: Harry Yoo
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
Hi Harry,
On 11/7/25 1:11 PM, Harry Yoo wrote:
> On Tue, Oct 28, 2025 at 09:58:17PM +0800, Qi Zheng wrote:
>> From: Muchun Song <songmuchun@bytedance.com>
>>
>> In a subsequent patch, we'll reparent the LRU folios. The folios that are
>> moved to the appropriate LRU list can undergo reparenting during the
>> move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
>> a lruvec lock. Instead, we should utilize the more general interface of
>> folio_lruvec_relock_irq() to obtain the correct lruvec lock.
>>
>> This patch involves only code refactoring and doesn't introduce any
>> functional changes.
>>
>> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
>> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
>> 1 file changed, 23 insertions(+), 23 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 3a1044ce30f1e..660cd40cfddd4 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2016,9 +2016,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>> nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
>> lruvec_memcg(lruvec));
>>
>> - spin_lock_irq(&lruvec->lru_lock);
>> - move_folios_to_lru(lruvec, &folio_list);
>> + move_folios_to_lru(&folio_list);
>>
>> + spin_lock_irq(&lruvec->lru_lock);
>> __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
>> stat.nr_demoted);
>
> Maybe I'm missing something or just confused for now, but let me ask...
>
> How do we make sure the lruvec (and the mem_cgroup containing the
> lruvec) did not disappear (due to offlining) after move_folios_to_lru()?
We obtained lruvec through the following method:
memcg = mem_cgroup_iter(target_memcg, NULL, partial);
do {
struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
shrink_lruvec(lruvec, sc);
--> shrink_inactive_list
} while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial)));
The mem_cgroup_iter() will hold the refcount of this memcg, so IIUC,
the memcg will not disappear at this time.
>
>> __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
>> @@ -2166,11 +2166,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
>> /*
>> * Move folios back to the lru list.
>> */
>> - spin_lock_irq(&lruvec->lru_lock);
>> -
>> - nr_activate = move_folios_to_lru(lruvec, &l_active);
>> - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
>> + nr_activate = move_folios_to_lru(&l_active);
>> + nr_deactivate = move_folios_to_lru(&l_inactive);
>>
>> + spin_lock_irq(&lruvec->lru_lock);
>> __count_vm_events(PGDEACTIVATE, nr_deactivate);
>> count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
>>
>> @@ -4735,14 +4734,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>> set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
>> }
>>
>> - spin_lock_irq(&lruvec->lru_lock);
>> -
>> - move_folios_to_lru(lruvec, &list);
>> + move_folios_to_lru(&list);
>>
>> + local_irq_disable();
>> walk = current->reclaim_state->mm_walk;
>> if (walk && walk->batched) {
>> walk->lruvec = lruvec;
>> + spin_lock(&lruvec->lru_lock);
>> reset_batch_size(walk);
>> + spin_unlock(&lruvec->lru_lock);
>> }
>
> Cc'ing RT folks as they may not want to disable IRQ on PREEMPT_RT.
>
> IIRC there has been some effort in MM to reduce the scope of
> IRQ-disabled section in MM when PREEMPT_RT config was added to the
> mainline. spin_lock_irq() doesn't disable IRQ on PREEMPT_RT.
Thanks for this information.
>
> Also, this will break RT according to Documentation/locking/locktypes.rst:
>> The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
>> have a few implications. For example, on a non-PREEMPT_RT kernel
>> the following code sequence works as expected:
>>
>> local_irq_disable();
>> spin_lock(&lock);
>>
>> and is fully equivalent to:
>>
>> spin_lock_irq(&lock);
>> Same applies to rwlock_t and the _irqsave() suffix variants.
>>
>> On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires
>> a fully preemptible context. Instead, use spin_lock_irq() or
>> spin_lock_irqsave() and their unlock counterparts.
>>
>> In cases where the interrupt disabling and locking must remain separate,
>> PREEMPT_RT offers a local_lock mechanism. Acquiring the local_lock pins
>> the task to a CPU, allowing things like per-CPU interrupt disabled locks
>> to be acquired. However, this approach should be used only where absolutely
>> necessary.
But how do we determine if it's necessary?
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-07 6:41 ` Qi Zheng
@ 2025-11-07 13:20 ` Harry Yoo
2025-11-08 6:32 ` Shakeel Butt
0 siblings, 1 reply; 58+ messages in thread
From: Harry Yoo @ 2025-11-07 13:20 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Fri, Nov 07, 2025 at 02:41:13PM +0800, Qi Zheng wrote:
> Hi Harry,
>
> On 11/7/25 1:11 PM, Harry Yoo wrote:
> > On Tue, Oct 28, 2025 at 09:58:17PM +0800, Qi Zheng wrote:
> > > From: Muchun Song <songmuchun@bytedance.com>
> > >
> > > In a subsequent patch, we'll reparent the LRU folios. The folios that are
> > > moved to the appropriate LRU list can undergo reparenting during the
> > > move_folios_to_lru() process. Hence, it's incorrect for the caller to hold
> > > a lruvec lock. Instead, we should utilize the more general interface of
> > > folio_lruvec_relock_irq() to obtain the correct lruvec lock.
> > >
> > > This patch involves only code refactoring and doesn't introduce any
> > > functional changes.
> > >
> > > Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> > > Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> > > Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> > > ---
> > > mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
> > > 1 file changed, 23 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 3a1044ce30f1e..660cd40cfddd4 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2016,9 +2016,9 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > > nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
> > > lruvec_memcg(lruvec));
> > > - spin_lock_irq(&lruvec->lru_lock);
> > > - move_folios_to_lru(lruvec, &folio_list);
> > > + move_folios_to_lru(&folio_list);
> > > + spin_lock_irq(&lruvec->lru_lock);
> > > __mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
> > > stat.nr_demoted);
> >
> > Maybe I'm missing something or just confused for now, but let me ask...
> >
> > How do we make sure the lruvec (and the mem_cgroup containing the
> > lruvec) did not disappear (due to offlining) after move_folios_to_lru()?
>
> We obtained lruvec through the following method:
>
> memcg = mem_cgroup_iter(target_memcg, NULL, partial);
> do {
> struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> shrink_lruvec(lruvec, sc);
> --> shrink_inactive_list
> } while ((memcg = mem_cgroup_iter(target_memcg, memcg, partial)));
>
> The mem_cgroup_iter() will hold the refcount of this memcg, so IIUC,
> the memcg will not disappear at this time.
Ah, right!
It can be offlined, but won't be released due to the refcount.
Thanks for the explanation.
> > > __mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
> > > @@ -2166,11 +2166,10 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > > /*
> > > * Move folios back to the lru list.
> > > */
> > > - spin_lock_irq(&lruvec->lru_lock);
> > > -
> > > - nr_activate = move_folios_to_lru(lruvec, &l_active);
> > > - nr_deactivate = move_folios_to_lru(lruvec, &l_inactive);
> > > + nr_activate = move_folios_to_lru(&l_active);
> > > + nr_deactivate = move_folios_to_lru(&l_inactive);
> > > + spin_lock_irq(&lruvec->lru_lock);
> > > __count_vm_events(PGDEACTIVATE, nr_deactivate);
> > > count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
> > > @@ -4735,14 +4734,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> > > set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
> > > }
> > > - spin_lock_irq(&lruvec->lru_lock);
> > > -
> > > - move_folios_to_lru(lruvec, &list);
> > > + move_folios_to_lru(&list);
> > > + local_irq_disable();
> > > walk = current->reclaim_state->mm_walk;
> > > if (walk && walk->batched) {
> > > walk->lruvec = lruvec;
> > > + spin_lock(&lruvec->lru_lock);
> > > reset_batch_size(walk);
> > > + spin_unlock(&lruvec->lru_lock);
> > > }
> >
> > Cc'ing RT folks as they may not want to disable IRQ on PREEMPT_RT.
> >
> > IIRC there has been some effort in MM to reduce the scope of
> > IRQ-disabled section in MM when PREEMPT_RT config was added to the
> > mainline. spin_lock_irq() doesn't disable IRQ on PREEMPT_RT.
>
> Thanks for this information.
>
> >
> > Also, this will break RT according to Documentation/locking/locktypes.rst:
> > > The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
> > > have a few implications. For example, on a non-PREEMPT_RT kernel
> > > the following code sequence works as expected:
> > >
> > > local_irq_disable();
> > > spin_lock(&lock);
> > >
> > > and is fully equivalent to:
> > >
> > > spin_lock_irq(&lock);
> > > Same applies to rwlock_t and the _irqsave() suffix variants.
> > >
> > > On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires
> > > a fully preemptible context. Instead, use spin_lock_irq() or
> > > spin_lock_irqsave() and their unlock counterparts.
> > >
> > > In cases where the interrupt disabling and locking must remain separate,
> > > PREEMPT_RT offers a local_lock mechanism. Acquiring the local_lock pins
> > > the task to a CPU, allowing things like per-CPU interrupt disabled locks
> > > to be acquired. However, this approach should be used only where absolutely
> > > necessary.
>
> But how do we determine if it's necessary?
Although it's mentioned in the locking documentation, I'm afraid that
local_lock is not the right interface to use here. Preemption will be
disabled anyway (on both PREEMPT_RT and !PREEMPT_RT) when the stats are
updated (in __mod_node_page_state()).
Here we just want to disable IRQ only on !PREEMPT_RT (to update
the stats safely).
Maybe we'll need some ifdeffery? I don't see a better way around...
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-07 13:20 ` Harry Yoo
@ 2025-11-08 6:32 ` Shakeel Butt
2025-11-10 2:13 ` Harry Yoo
0 siblings, 1 reply; 58+ messages in thread
From: Shakeel Butt @ 2025-11-08 6:32 UTC (permalink / raw)
To: Harry Yoo
Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Fri, Nov 07, 2025 at 10:20:57PM +0900, Harry Yoo wrote:
>
> Although it's mentioned in the locking documentation, I'm afraid that
> local_lock is not the right interface to use here. Preemption will be
> disabled anyway (on both PREEMPT_RT and !PREEMPT_RT) when the stats are
> updated (in __mod_node_page_state()).
>
> Here we just want to disable IRQ only on !PREEMPT_RT (to update
> the stats safely).
I don't think there is a need to disable IRQs. There are three stats
update functions called in that hunk.
1) __mod_lruvec_state
2) __count_vm_events
3) count_memcg_events
count_memcg_events() can be called with IRQs. __count_vm_events can be
replaced with count_vm_events. For __mod_lruvec_state, the
__mod_node_page_state() inside needs preemption disabled.
Easy way would be to just disable/enable preemption instead of IRQs.
Otherwise go a bit more fine-grained approach i.e. replace
__count_vm_events with count_vm_events and just disable preemption
across __mod_node_page_state().
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-08 6:32 ` Shakeel Butt
@ 2025-11-10 2:13 ` Harry Yoo
2025-11-10 4:30 ` Qi Zheng
0 siblings, 1 reply; 58+ messages in thread
From: Harry Yoo @ 2025-11-10 2:13 UTC (permalink / raw)
To: Shakeel Butt
Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Fri, Nov 07, 2025 at 10:32:52PM -0800, Shakeel Butt wrote:
> On Fri, Nov 07, 2025 at 10:20:57PM +0900, Harry Yoo wrote:
> >
> > Although it's mentioned in the locking documentation, I'm afraid that
> > local_lock is not the right interface to use here. Preemption will be
> > disabled anyway (on both PREEMPT_RT and !PREEMPT_RT) when the stats are
> > updated (in __mod_node_page_state()).
> >
> > Here we just want to disable IRQ only on !PREEMPT_RT (to update
> > the stats safely).
>
> I don't think there is a need to disable IRQs. There are three stats
> update functions called in that hunk.
>
> 1) __mod_lruvec_state
> 2) __count_vm_events
> 3) count_memcg_events
>
> count_memcg_events() can be called with IRQs. __count_vm_events can be
> replaced with count_vm_events.
Right.
> For __mod_lruvec_state, the
> __mod_node_page_state() inside needs preemption disabled.
The function __mod_node_page_state() disables preemption.
And there's a comment in __mod_zone_page_state():
> /*
> * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
> * atomicity is provided by IRQs being disabled -- either explicitly
> * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
> * CPU migrations and preemption potentially corrupts a counter so
> * disable preemption.
> */
> preempt_disable_nested();
We're relying on IRQs being disabled on !PREEMPT_RT.
Maybe we could make it safe against re-entrant IRQ handlers by using
read-modify-write operations?
--
Cheers,
Harry / Hyeonggon
> Easy way would be to just disable/enable preemption instead of IRQs.
> Otherwise go a bit more fine-grained approach i.e. replace
> __count_vm_events with count_vm_events and just disable preemption
> across __mod_node_page_state().
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-10 2:13 ` Harry Yoo
@ 2025-11-10 4:30 ` Qi Zheng
2025-11-10 5:43 ` Harry Yoo
0 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-11-10 4:30 UTC (permalink / raw)
To: Harry Yoo, Shakeel Butt
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On 11/10/25 10:13 AM, Harry Yoo wrote:
> On Fri, Nov 07, 2025 at 10:32:52PM -0800, Shakeel Butt wrote:
>> On Fri, Nov 07, 2025 at 10:20:57PM +0900, Harry Yoo wrote:
>>>
>>> Although it's mentioned in the locking documentation, I'm afraid that
>>> local_lock is not the right interface to use here. Preemption will be
>>> disabled anyway (on both PREEMPT_RT and !PREEMPT_RT) when the stats are
>>> updated (in __mod_node_page_state()).
>>>
>>> Here we just want to disable IRQ only on !PREEMPT_RT (to update
>>> the stats safely).
>>
>> I don't think there is a need to disable IRQs. There are three stats
>> update functions called in that hunk.
>>
>> 1) __mod_lruvec_state
>> 2) __count_vm_events
>> 3) count_memcg_events
>>
>> count_memcg_events() can be called with IRQs. __count_vm_events can be
>> replaced with count_vm_events.
>
> Right.
>
>> For __mod_lruvec_state, the
>> __mod_node_page_state() inside needs preemption disabled.
>
> The function __mod_node_page_state() disables preemption.
> And there's a comment in __mod_zone_page_state():
>
>> /*
>> * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
>> * atomicity is provided by IRQs being disabled -- either explicitly
>> * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
>> * CPU migrations and preemption potentially corrupts a counter so
>> * disable preemption.
>> */
>> preempt_disable_nested();
>
> We're relying on IRQs being disabled on !PREEMPT_RT.
So it's possible for us to update vmstat within an interrupt context,
right?
There is also a comment above __mod_zone_page_state():
/*
* For use when we know that interrupts are disabled,
* or when we know that preemption is disabled and that
* particular counter cannot be updated from interrupt context.
*/
BTW, the comment inside __mod_node_page_state() should be:
/* See __mod_zone_page_state */
instead of
/* See __mod_node_page_state */
Will fix it.
>
> Maybe we could make it safe against re-entrant IRQ handlers by using
> read-modify-write operations?
Isn't it because of the RMW operation that we need to use IRQ to
guarantee atomicity? Or have I misunderstood something?
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-10 4:30 ` Qi Zheng
@ 2025-11-10 5:43 ` Harry Yoo
2025-11-10 6:11 ` Qi Zheng
2025-11-10 16:47 ` Shakeel Butt
0 siblings, 2 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-10 5:43 UTC (permalink / raw)
To: Qi Zheng
Cc: Shakeel Butt, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
>
>
> On 11/10/25 10:13 AM, Harry Yoo wrote:
> > On Fri, Nov 07, 2025 at 10:32:52PM -0800, Shakeel Butt wrote:
> > > On Fri, Nov 07, 2025 at 10:20:57PM +0900, Harry Yoo wrote:
> > > >
> > > > Although it's mentioned in the locking documentation, I'm afraid that
> > > > local_lock is not the right interface to use here. Preemption will be
> > > > disabled anyway (on both PREEMPT_RT and !PREEMPT_RT) when the stats are
> > > > updated (in __mod_node_page_state()).
> > > >
> > > > Here we just want to disable IRQ only on !PREEMPT_RT (to update
> > > > the stats safely).
> > >
> > > I don't think there is a need to disable IRQs. There are three stats
> > > update functions called in that hunk.
> > >
> > > 1) __mod_lruvec_state
> > > 2) __count_vm_events
> > > 3) count_memcg_events
> > >
> > > count_memcg_events() can be called with IRQs. __count_vm_events can be
> > > replaced with count_vm_events.
> >
> > Right.
> >
> > > For __mod_lruvec_state, the
> > > __mod_node_page_state() inside needs preemption disabled.
> >
> > The function __mod_node_page_state() disables preemption.
> > And there's a comment in __mod_zone_page_state():
> >
> > > /*
> > > * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
> > > * atomicity is provided by IRQs being disabled -- either explicitly
> > > * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
> > > * CPU migrations and preemption potentially corrupts a counter so
> > > * disable preemption.
> > > */
> > > preempt_disable_nested();
> >
> > We're relying on IRQs being disabled on !PREEMPT_RT.
>
> So it's possible for us to update vmstat within an interrupt context,
> right?
Yes, for instance when freeing memory in an interrupt context we can
update vmstat and that's why we disable interrupts now.
> There is also a comment above __mod_zone_page_state():
>
> /*
> * For use when we know that interrupts are disabled,
> * or when we know that preemption is disabled and that
> * particular counter cannot be updated from interrupt context.
> */
Yeah we don't have to disable IRQs when we already know it's disabled.
> BTW, the comment inside __mod_node_page_state() should be:
>
> /* See __mod_zone_page_state */
>
> instead of
>
> /* See __mod_node_page_state */
>
> Will fix it.
Right :) Thanks!
> > Maybe we could make it safe against re-entrant IRQ handlers by using
> > read-modify-write operations?
>
> Isn't it because of the RMW operation that we need to use IRQ to
> guarantee atomicity? Or have I misunderstood something?
I meant using atomic operations instead of disabling IRQs, like, by
using this_cpu_add() or cmpxchg() instead.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-10 5:43 ` Harry Yoo
@ 2025-11-10 6:11 ` Qi Zheng
2025-11-10 16:47 ` Shakeel Butt
1 sibling, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-11-10 6:11 UTC (permalink / raw)
To: Harry Yoo
Cc: Shakeel Butt, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On 11/10/25 1:43 PM, Harry Yoo wrote:
> On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
>>
>>
>> On 11/10/25 10:13 AM, Harry Yoo wrote:
>>> On Fri, Nov 07, 2025 at 10:32:52PM -0800, Shakeel Butt wrote:
>>>> On Fri, Nov 07, 2025 at 10:20:57PM +0900, Harry Yoo wrote:
>>>>>
>>>>> Although it's mentioned in the locking documentation, I'm afraid that
>>>>> local_lock is not the right interface to use here. Preemption will be
>>>>> disabled anyway (on both PREEMPT_RT and !PREEMPT_RT) when the stats are
>>>>> updated (in __mod_node_page_state()).
>>>>>
>>>>> Here we just want to disable IRQ only on !PREEMPT_RT (to update
>>>>> the stats safely).
>>>>
>>>> I don't think there is a need to disable IRQs. There are three stats
>>>> update functions called in that hunk.
>>>>
>>>> 1) __mod_lruvec_state
>>>> 2) __count_vm_events
>>>> 3) count_memcg_events
>>>>
>>>> count_memcg_events() can be called with IRQs. __count_vm_events can be
>>>> replaced with count_vm_events.
>>>
>>> Right.
>>>
>>>> For __mod_lruvec_state, the
>>>> __mod_node_page_state() inside needs preemption disabled.
>>>
>>> The function __mod_node_page_state() disables preemption.
>>> And there's a comment in __mod_zone_page_state():
>>>
>>>> /*
>>>> * Accurate vmstat updates require a RMW. On !PREEMPT_RT kernels,
>>>> * atomicity is provided by IRQs being disabled -- either explicitly
>>>> * or via local_lock_irq. On PREEMPT_RT, local_lock_irq only disables
>>>> * CPU migrations and preemption potentially corrupts a counter so
>>>> * disable preemption.
>>>> */
>>>> preempt_disable_nested();
>>>
>>> We're relying on IRQs being disabled on !PREEMPT_RT.
>>
>> So it's possible for us to update vmstat within an interrupt context,
>> right?
>
> Yes, for instance when freeing memory in an interrupt context we can
> update vmstat and that's why we disable interrupts now.
Got it.
>
>> There is also a comment above __mod_zone_page_state():
>>
>> /*
>> * For use when we know that interrupts are disabled,
>> * or when we know that preemption is disabled and that
>> * particular counter cannot be updated from interrupt context.
>> */
>
> Yeah we don't have to disable IRQs when we already know it's disabled.
>
>> BTW, the comment inside __mod_node_page_state() should be:
>>
>> /* See __mod_zone_page_state */
>>
>> instead of
>>
>> /* See __mod_node_page_state */
>>
>> Will fix it.
>
> Right :) Thanks!
>
>>> Maybe we could make it safe against re-entrant IRQ handlers by using
>>> read-modify-write operations?
>>
>> Isn't it because of the RMW operation that we need to use IRQ to
>> guarantee atomicity? Or have I misunderstood something?
>
> I meant using atomic operations instead of disabling IRQs, like, by
> using this_cpu_add() or cmpxchg() instead.
Got it. I will give it a try.
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-10 5:43 ` Harry Yoo
2025-11-10 6:11 ` Qi Zheng
@ 2025-11-10 16:47 ` Shakeel Butt
2025-11-11 0:42 ` Harry Yoo
2025-11-11 3:04 ` Qi Zheng
1 sibling, 2 replies; 58+ messages in thread
From: Shakeel Butt @ 2025-11-10 16:47 UTC (permalink / raw)
To: Harry Yoo
Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
> On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
> > > Maybe we could make it safe against re-entrant IRQ handlers by using
> > > read-modify-write operations?
> >
> > Isn't it because of the RMW operation that we need to use IRQ to
> > guarantee atomicity? Or have I misunderstood something?
>
> I meant using atomic operations instead of disabling IRQs, like, by
> using this_cpu_add() or cmpxchg() instead.
We already have mod_node_page_state() which is safe from IRQs and is
optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
includes x86 and arm64.
Let me send the patch to cleanup the memcg code which uses
__mod_node_page_state.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-10 16:47 ` Shakeel Butt
@ 2025-11-11 0:42 ` Harry Yoo
2025-11-11 3:04 ` Qi Zheng
1 sibling, 0 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-11 0:42 UTC (permalink / raw)
To: Shakeel Butt
Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Mon, Nov 10, 2025 at 08:47:57AM -0800, Shakeel Butt wrote:
> On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
> > On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
> > > > Maybe we could make it safe against re-entrant IRQ handlers by using
> > > > read-modify-write operations?
> > >
> > > Isn't it because of the RMW operation that we need to use IRQ to
> > > guarantee atomicity? Or have I misunderstood something?
> >
> > I meant using atomic operations instead of disabling IRQs, like, by
> > using this_cpu_add() or cmpxchg() instead.
>
> We already have mod_node_page_state() which is safe from IRQs and is
> optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
> includes x86 and arm64.
Nice observation, thanks!
> Let me send the patch to cleanup the memcg code which uses
> __mod_node_page_state.
I'll take a look.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-10 16:47 ` Shakeel Butt
2025-11-11 0:42 ` Harry Yoo
@ 2025-11-11 3:04 ` Qi Zheng
2025-11-11 3:16 ` Harry Yoo
2025-11-11 3:17 ` Shakeel Butt
1 sibling, 2 replies; 58+ messages in thread
From: Qi Zheng @ 2025-11-11 3:04 UTC (permalink / raw)
To: Shakeel Butt, Harry Yoo
Cc: hannes, hughd, mhocko, roman.gushchin, muchun.song, david,
lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On 11/11/25 12:47 AM, Shakeel Butt wrote:
> On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
>> On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
>>>> Maybe we could make it safe against re-entrant IRQ handlers by using
>>>> read-modify-write operations?
>>>
>>> Isn't it because of the RMW operation that we need to use IRQ to
>>> guarantee atomicity? Or have I misunderstood something?
>>
>> I meant using atomic operations instead of disabling IRQs, like, by
>> using this_cpu_add() or cmpxchg() instead.
>
> We already have mod_node_page_state() which is safe from IRQs and is
> optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
> includes x86 and arm64.
However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
>
> Let me send the patch to cleanup the memcg code which uses
> __mod_node_page_state.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-11 3:04 ` Qi Zheng
@ 2025-11-11 3:16 ` Harry Yoo
2025-11-11 3:23 ` Qi Zheng
2025-11-11 8:49 ` Sebastian Andrzej Siewior
2025-11-11 3:17 ` Shakeel Butt
1 sibling, 2 replies; 58+ messages in thread
From: Harry Yoo @ 2025-11-11 3:16 UTC (permalink / raw)
To: Qi Zheng
Cc: Shakeel Butt, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Tue, Nov 11, 2025 at 11:04:09AM +0800, Qi Zheng wrote:
>
> On 11/11/25 12:47 AM, Shakeel Butt wrote:
> > On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
> > > On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
> > > > > Maybe we could make it safe against re-entrant IRQ handlers by using
> > > > > read-modify-write operations?
> > > >
> > > > Isn't it because of the RMW operation that we need to use IRQ to
> > > > guarantee atomicity? Or have I misunderstood something?
> > >
> > > I meant using atomic operations instead of disabling IRQs, like, by
> > > using this_cpu_add() or cmpxchg() instead.
> >
> > We already have mod_node_page_state() which is safe from IRQs and is
> > optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
> > includes x86 and arm64.
>
> However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
> still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
Hmm I was going to say it's necessary, but AFAICT we don't allocate
or free memory in hardirq context on PREEMPT_RT (that's the policy)
and so I'd say it's not necessary to disable IRQs.
Sounds like we still want to disable IRQs only on !PREEMPT_RT on
such architectures?
Not sure how seriously do PREEMPT_RT folks care about architectures
without HAVE_CMPXCHG_LOCAL. (riscv and loongarch have ARCH_SUPPORTS_RT
but doesn't have HAVE_CMPXCHG_LOCAL).
If they do care, this can be done as a separate patch series because
we already call local_irq_{save,restore}() in many places in mm/vmstat.c
if the architecture doesn't not have HAVE_CMPXCHG_LOCAL.
> > Let me send the patch to cleanup the memcg code which uses
> > __mod_node_page_state.
--
Cheers,
Harry / Hyeonggon
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-11 3:16 ` Harry Yoo
@ 2025-11-11 3:23 ` Qi Zheng
2025-11-11 8:49 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-11-11 3:23 UTC (permalink / raw)
To: Harry Yoo
Cc: Shakeel Butt, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On 11/11/25 11:16 AM, Harry Yoo wrote:
> On Tue, Nov 11, 2025 at 11:04:09AM +0800, Qi Zheng wrote:
>>
>> On 11/11/25 12:47 AM, Shakeel Butt wrote:
>>> On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
>>>> On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
>>>>>> Maybe we could make it safe against re-entrant IRQ handlers by using
>>>>>> read-modify-write operations?
>>>>>
>>>>> Isn't it because of the RMW operation that we need to use IRQ to
>>>>> guarantee atomicity? Or have I misunderstood something?
>>>>
>>>> I meant using atomic operations instead of disabling IRQs, like, by
>>>> using this_cpu_add() or cmpxchg() instead.
>>>
>>> We already have mod_node_page_state() which is safe from IRQs and is
>>> optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
>>> includes x86 and arm64.
>>
>> However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
>> still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
>
> Hmm I was going to say it's necessary, but AFAICT we don't allocate
> or free memory in hardirq context on PREEMPT_RT (that's the policy)
> and so I'd say it's not necessary to disable IRQs.
>
> Sounds like we still want to disable IRQs only on !PREEMPT_RT on
> such architectures?
>
> Not sure how seriously do PREEMPT_RT folks care about architectures
> without HAVE_CMPXCHG_LOCAL. (riscv and loongarch have ARCH_SUPPORTS_RT
> but doesn't have HAVE_CMPXCHG_LOCAL).
>
> If they do care, this can be done as a separate patch series because
> we already call local_irq_{save,restore}() in many places in mm/vmstat.c
> if the architecture doesn't not have HAVE_CMPXCHG_LOCAL.
Got it. I will ignore it for now.
>
>>> Let me send the patch to cleanup the memcg code which uses
>>> __mod_node_page_state.
>
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-11 3:16 ` Harry Yoo
2025-11-11 3:23 ` Qi Zheng
@ 2025-11-11 8:49 ` Sebastian Andrzej Siewior
2025-11-11 16:44 ` Shakeel Butt
1 sibling, 1 reply; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-11 8:49 UTC (permalink / raw)
To: Harry Yoo
Cc: Qi Zheng, Shakeel Butt, hannes, hughd, mhocko, roman.gushchin,
muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng, Clark Williams,
Steven Rostedt, linux-rt-devel
On 2025-11-11 12:16:43 [+0900], Harry Yoo wrote:
> > However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
> > still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
>
> Hmm I was going to say it's necessary, but AFAICT we don't allocate
> or free memory in hardirq context on PREEMPT_RT (that's the policy)
> and so I'd say it's not necessary to disable IRQs.
>
> Sounds like we still want to disable IRQs only on !PREEMPT_RT on
> such architectures?
>
> Not sure how seriously do PREEMPT_RT folks care about architectures
> without HAVE_CMPXCHG_LOCAL. (riscv and loongarch have ARCH_SUPPORTS_RT
> but doesn't have HAVE_CMPXCHG_LOCAL).
We take things seriously and you shouldn't make assumption based on
implementation. Either the API can be used as such or not.
In case of mod_node_page_state(), the non-IRQ off version
(__mod_node_page_state()) has a preempt_disable_nested() to ensure
atomic update on PREEMPT_RT without disabling interrupts.
Sebastian
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-11 8:49 ` Sebastian Andrzej Siewior
@ 2025-11-11 16:44 ` Shakeel Butt
0 siblings, 0 replies; 58+ messages in thread
From: Shakeel Butt @ 2025-11-11 16:44 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: Harry Yoo, Qi Zheng, hannes, hughd, mhocko, roman.gushchin,
muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng, Clark Williams,
Steven Rostedt, linux-rt-devel
On Tue, Nov 11, 2025 at 09:49:00AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-11-11 12:16:43 [+0900], Harry Yoo wrote:
> > > However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
> > > still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
> >
> > Hmm I was going to say it's necessary, but AFAICT we don't allocate
> > or free memory in hardirq context on PREEMPT_RT (that's the policy)
> > and so I'd say it's not necessary to disable IRQs.
> >
> > Sounds like we still want to disable IRQs only on !PREEMPT_RT on
> > such architectures?
> >
> > Not sure how seriously do PREEMPT_RT folks care about architectures
> > without HAVE_CMPXCHG_LOCAL. (riscv and loongarch have ARCH_SUPPORTS_RT
> > but doesn't have HAVE_CMPXCHG_LOCAL).
>
> We take things seriously and you shouldn't make assumption based on
> implementation. Either the API can be used as such or not.
> In case of mod_node_page_state(), the non-IRQ off version
> (__mod_node_page_state()) has a preempt_disable_nested() to ensure
> atomic update on PREEMPT_RT without disabling interrupts.
>
Harry is talking about mod_node_page_state() on
!CONFIG_HAVE_CMPXCHG_LOCAL which is disabling irqs.
void mod_node_page_state(struct pglist_data *pgdat, enum node_stat_item item,
long delta)
{
unsigned long flags;
local_irq_save(flags);
__mod_node_page_state(pgdat, item, delta);
local_irq_restore(flags);
}
Is PREEMPT_RT fine with this?
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-11 3:04 ` Qi Zheng
2025-11-11 3:16 ` Harry Yoo
@ 2025-11-11 3:17 ` Shakeel Butt
2025-11-11 3:24 ` Qi Zheng
1 sibling, 1 reply; 58+ messages in thread
From: Shakeel Butt @ 2025-11-11 3:17 UTC (permalink / raw)
To: Qi Zheng
Cc: Harry Yoo, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On Tue, Nov 11, 2025 at 11:04:09AM +0800, Qi Zheng wrote:
>
> On 11/11/25 12:47 AM, Shakeel Butt wrote:
> > On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
> > > On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
> > > > > Maybe we could make it safe against re-entrant IRQ handlers by using
> > > > > read-modify-write operations?
> > > >
> > > > Isn't it because of the RMW operation that we need to use IRQ to
> > > > guarantee atomicity? Or have I misunderstood something?
> > >
> > > I meant using atomic operations instead of disabling IRQs, like, by
> > > using this_cpu_add() or cmpxchg() instead.
> >
> > We already have mod_node_page_state() which is safe from IRQs and is
> > optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
> > includes x86 and arm64.
>
> However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
> still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
>
Yes we can disable irqs on PREEMPT_RT but it is usually frown upon and
it is usually requested to do so only for short window. However if
someone running PREEMPT_RT on an arch without HAVE_CMPXCHG_LOCAL and has
issues with mod_node_page_state() then they can solve it then. I don't
think we need to fix that now.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-11 3:17 ` Shakeel Butt
@ 2025-11-11 3:24 ` Qi Zheng
0 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-11-11 3:24 UTC (permalink / raw)
To: Shakeel Butt
Cc: Harry Yoo, hannes, hughd, mhocko, roman.gushchin, muchun.song,
david, lorenzo.stoakes, ziy, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Muchun Song, Qi Zheng, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, linux-rt-devel
On 11/11/25 11:17 AM, Shakeel Butt wrote:
> On Tue, Nov 11, 2025 at 11:04:09AM +0800, Qi Zheng wrote:
>>
>> On 11/11/25 12:47 AM, Shakeel Butt wrote:
>>> On Mon, Nov 10, 2025 at 02:43:21PM +0900, Harry Yoo wrote:
>>>> On Mon, Nov 10, 2025 at 12:30:06PM +0800, Qi Zheng wrote:
>>>>>> Maybe we could make it safe against re-entrant IRQ handlers by using
>>>>>> read-modify-write operations?
>>>>>
>>>>> Isn't it because of the RMW operation that we need to use IRQ to
>>>>> guarantee atomicity? Or have I misunderstood something?
>>>>
>>>> I meant using atomic operations instead of disabling IRQs, like, by
>>>> using this_cpu_add() or cmpxchg() instead.
>>>
>>> We already have mod_node_page_state() which is safe from IRQs and is
>>> optimized to not disable IRQs for archs with HAVE_CMPXCHG_LOCAL which
>>> includes x86 and arm64.
>>
>> However, in the !CONFIG_HAVE_CMPXCHG_LOCAL case, mod_node_page_state()
>> still calls local_irq_save(). Is this feasible in the PREEMPT_RT kernel?
>>
>
> Yes we can disable irqs on PREEMPT_RT but it is usually frown upon and
> it is usually requested to do so only for short window. However if
Got it.
> someone running PREEMPT_RT on an arch without HAVE_CMPXCHG_LOCAL and has
> issues with mod_node_page_state() then they can solve it then. I don't
> think we need to fix that now.
OK.
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
2025-11-07 5:11 ` Harry Yoo
2025-11-07 6:41 ` Qi Zheng
@ 2025-11-07 7:18 ` Sebastian Andrzej Siewior
1 sibling, 0 replies; 58+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-11-07 7:18 UTC (permalink / raw)
To: Harry Yoo
Cc: Qi Zheng, hannes, hughd, mhocko, roman.gushchin, shakeel.butt,
muchun.song, david, lorenzo.stoakes, ziy, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm, linux-mm,
linux-kernel, cgroups, Muchun Song, Qi Zheng, Clark Williams,
Steven Rostedt, linux-rt-devel
On 2025-11-07 14:11:27 [+0900], Harry Yoo wrote:
> > @@ -4735,14 +4734,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> > set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS, BIT(PG_active));
> > }
> >
> > - spin_lock_irq(&lruvec->lru_lock);
> > -
> > - move_folios_to_lru(lruvec, &list);
> > + move_folios_to_lru(&list);
> >
> > + local_irq_disable();
> > walk = current->reclaim_state->mm_walk;
> > if (walk && walk->batched) {
> > walk->lruvec = lruvec;
> > + spin_lock(&lruvec->lru_lock);
> > reset_batch_size(walk);
> > + spin_unlock(&lruvec->lru_lock);
> > }
>
> Cc'ing RT folks as they may not want to disable IRQ on PREEMPT_RT.
Thank you, this is not going to work. The local_irq_disable() shouldn't
be used.
> IIRC there has been some effort in MM to reduce the scope of
> IRQ-disabled section in MM when PREEMPT_RT config was added to the
> mainline. spin_lock_irq() doesn't disable IRQ on PREEMPT_RT.
Exactly.
Sebastian
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (3 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
` (22 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Pagecache pages are charged at allocation time and hold a reference
to the original memory cgroup until reclaimed. Depending on memory
pressure, page sharing patterns between different cgroups and cgroup
creation/destruction rates, many dying memory cgroups can be pinned
by pagecache pages, reducing page reclaim efficiency and wasting
memory. Converting LRU folios and most other raw memory cgroup pins
to the object cgroup direction can fix this long-living problem.
As a result, the objcg infrastructure is no longer solely applicable
to the kmem case. In this patch, we extend the scope of the objcg
infrastructure beyond the kmem case, enabling LRU folios to reuse
it for folio charging purposes.
It should be noted that LRU folios are not accounted for at the root
level, yet the folio->memcg_data points to the root_mem_cgroup. Hence,
the folio->memcg_data of LRU folios always points to a valid pointer.
However, the root_mem_cgroup does not possess an object cgroup.
Therefore, we also allocate an object cgroup for the root_mem_cgroup.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memcontrol.c | 51 +++++++++++++++++++++++--------------------------
1 file changed, 24 insertions(+), 27 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5257465c9d75..2afd7f99ca101 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -204,10 +204,10 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
- struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
+ struct mem_cgroup *parent = parent_mem_cgroup(memcg);
objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
@@ -3302,30 +3302,17 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
return val;
}
-static int memcg_online_kmem(struct mem_cgroup *memcg)
+static void memcg_online_kmem(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg;
-
if (mem_cgroup_kmem_disabled())
- return 0;
+ return;
if (unlikely(mem_cgroup_is_root(memcg)))
- return 0;
-
- objcg = obj_cgroup_alloc();
- if (!objcg)
- return -ENOMEM;
-
- objcg->memcg = memcg;
- rcu_assign_pointer(memcg->objcg, objcg);
- obj_cgroup_get(objcg);
- memcg->orig_objcg = objcg;
+ return;
static_branch_enable(&memcg_kmem_online_key);
memcg->kmemcg_id = memcg->id.id;
-
- return 0;
}
static void memcg_offline_kmem(struct mem_cgroup *memcg)
@@ -3340,12 +3327,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
parent = parent_mem_cgroup(memcg);
memcg_reparent_list_lrus(memcg, parent);
-
- /*
- * Objcg's reparenting must be after list_lru's, make sure list_lru
- * helpers won't use parent's list_lru until child is drained.
- */
- memcg_reparent_objcgs(memcg, parent);
}
#ifdef CONFIG_CGROUP_WRITEBACK
@@ -3862,9 +3843,9 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct obj_cgroup *objcg;
- if (memcg_online_kmem(memcg))
- goto remove_id;
+ memcg_online_kmem(memcg);
/*
* A memcg must be visible for expand_shrinker_info()
@@ -3874,6 +3855,15 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (alloc_shrinker_info(memcg))
goto offline_kmem;
+ objcg = obj_cgroup_alloc();
+ if (!objcg)
+ goto free_shrinker;
+
+ objcg->memcg = memcg;
+ rcu_assign_pointer(memcg->objcg, objcg);
+ obj_cgroup_get(objcg);
+ memcg->orig_objcg = objcg;
+
if (unlikely(mem_cgroup_is_root(memcg)) && !mem_cgroup_disabled())
queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
FLUSH_TIME);
@@ -3896,9 +3886,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
xa_store(&mem_cgroup_ids, memcg->id.id, memcg, GFP_KERNEL);
return 0;
+free_shrinker:
+ free_shrinker_info(memcg);
offline_kmem:
memcg_offline_kmem(memcg);
-remove_id:
mem_cgroup_id_remove(memcg);
return -ENOMEM;
}
@@ -3916,6 +3907,12 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
memcg_offline_kmem(memcg);
reparent_deferred_split_queue(memcg);
+ /*
+ * The reparenting of objcg must be after the reparenting of the
+ * list_lru and deferred_split_queue above, which ensures that they will
+ * not mistakenly get the parent list_lru and deferred_split_queue.
+ */
+ memcg_reparent_objcgs(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);
lru_gen_offline_memcg(memcg);
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (4 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
` (21 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Memory cgroup functions such as get_mem_cgroup_from_folio() and
get_mem_cgroup_from_mm() return a valid memory cgroup pointer,
even for the root memory cgroup. In contrast, the situation for
object cgroups has been different.
Previously, the root object cgroup couldn't be returned because
it didn't exist. Now that a valid root object cgroup exists, for
the sake of consistency, it's necessary to align the behavior of
object-cgroup-related operations with that of memory cgroup APIs.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 29 +++++++++++++++++-------
mm/memcontrol.c | 45 ++++++++++++++++++++------------------
mm/percpu.c | 2 +-
3 files changed, 46 insertions(+), 30 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6185d8399a54e..9fdbd4970021d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -332,6 +332,7 @@ struct mem_cgroup {
#define MEMCG_CHARGE_BATCH 64U
extern struct mem_cgroup *root_mem_cgroup;
+extern struct obj_cgroup *root_obj_cgroup;
enum page_memcg_data_flags {
/* page->memcg_data is a pointer to an slabobj_ext vector */
@@ -549,6 +550,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
return (memcg == root_mem_cgroup);
}
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+ return objcg == root_obj_cgroup;
+}
+
static inline bool mem_cgroup_disabled(void)
{
return !cgroup_subsys_enabled(memory_cgrp_subsys);
@@ -773,23 +779,26 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
static inline bool obj_cgroup_tryget(struct obj_cgroup *objcg)
{
+ if (obj_cgroup_is_root(objcg))
+ return true;
return percpu_ref_tryget(&objcg->refcnt);
}
-static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
+ unsigned long nr)
{
- percpu_ref_get(&objcg->refcnt);
+ if (!obj_cgroup_is_root(objcg))
+ percpu_ref_get_many(&objcg->refcnt, nr);
}
-static inline void obj_cgroup_get_many(struct obj_cgroup *objcg,
- unsigned long nr)
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
{
- percpu_ref_get_many(&objcg->refcnt, nr);
+ obj_cgroup_get_many(objcg, 1);
}
static inline void obj_cgroup_put(struct obj_cgroup *objcg)
{
- if (objcg)
+ if (objcg && !obj_cgroup_is_root(objcg))
percpu_ref_put(&objcg->refcnt);
}
@@ -1094,6 +1103,11 @@ static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
return true;
}
+static inline bool obj_cgroup_is_root(const struct obj_cgroup *objcg)
+{
+ return true;
+}
+
static inline bool mem_cgroup_disabled(void)
{
return true;
@@ -1710,8 +1724,7 @@ static inline struct obj_cgroup *get_obj_cgroup_from_current(void)
{
struct obj_cgroup *objcg = current_obj_cgroup();
- if (objcg)
- obj_cgroup_get(objcg);
+ obj_cgroup_get(objcg);
return objcg;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2afd7f99ca101..d484b632c790f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,8 @@ EXPORT_SYMBOL(memory_cgrp_subsys);
struct mem_cgroup *root_mem_cgroup __read_mostly;
EXPORT_SYMBOL(root_mem_cgroup);
+struct obj_cgroup *root_obj_cgroup __read_mostly;
+
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);
@@ -2642,15 +2644,14 @@ struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg = NULL;
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ struct obj_cgroup *objcg = rcu_dereference(memcg->objcg);
- for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
- objcg = rcu_dereference(memcg->objcg);
if (likely(objcg && obj_cgroup_tryget(objcg)))
- break;
- objcg = NULL;
+ return objcg;
}
- return objcg;
+
+ return NULL;
}
static struct obj_cgroup *current_objcg_update(void)
@@ -2724,18 +2725,17 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
* Objcg reference is kept by the task, so it's safe
* to use the objcg by the current task.
*/
- return objcg;
+ return objcg ? : root_obj_cgroup;
}
memcg = this_cpu_read(int_active_memcg);
if (unlikely(memcg))
goto from_memcg;
- return NULL;
+ return root_obj_cgroup;
from_memcg:
- objcg = NULL;
- for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
/*
* Memcg pointer is protected by scope (see set_active_memcg())
* and is pinning the corresponding objcg, so objcg can't go
@@ -2744,10 +2744,10 @@ __always_inline struct obj_cgroup *current_obj_cgroup(void)
*/
objcg = rcu_dereference_check(memcg->objcg, 1);
if (likely(objcg))
- break;
+ return objcg;
}
- return objcg;
+ return root_obj_cgroup;
}
struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
@@ -2761,14 +2761,8 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
objcg = __folio_objcg(folio);
obj_cgroup_get(objcg);
} else {
- struct mem_cgroup *memcg;
-
rcu_read_lock();
- memcg = __folio_memcg(folio);
- if (memcg)
- objcg = __get_obj_cgroup_from_memcg(memcg);
- else
- objcg = NULL;
+ objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
rcu_read_unlock();
}
return objcg;
@@ -2871,7 +2865,7 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
int ret = 0;
objcg = current_obj_cgroup();
- if (objcg) {
+ if (!obj_cgroup_is_root(objcg)) {
ret = obj_cgroup_charge_pages(objcg, gfp, 1 << order);
if (!ret) {
obj_cgroup_get(objcg);
@@ -3172,7 +3166,7 @@ bool __memcg_slab_post_alloc_hook(struct kmem_cache *s, struct list_lru *lru,
* obj_cgroup_get() is used to get a permanent reference.
*/
objcg = current_obj_cgroup();
- if (!objcg)
+ if (obj_cgroup_is_root(objcg))
return true;
/*
@@ -3859,6 +3853,9 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
if (!objcg)
goto free_shrinker;
+ if (unlikely(mem_cgroup_is_root(memcg)))
+ root_obj_cgroup = objcg;
+
objcg->memcg = memcg;
rcu_assign_pointer(memcg->objcg, objcg);
obj_cgroup_get(objcg);
@@ -5479,6 +5476,9 @@ void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
+ if (obj_cgroup_is_root(objcg))
+ return;
+
VM_WARN_ON_ONCE(!(current->flags & PF_MEMALLOC));
/* PF_MEMALLOC context, charging must succeed */
@@ -5506,6 +5506,9 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
+ if (obj_cgroup_is_root(objcg))
+ return;
+
obj_cgroup_uncharge(objcg, size);
rcu_read_lock();
diff --git a/mm/percpu.c b/mm/percpu.c
index 81462ce5866e1..78bdffe1fcb57 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -1616,7 +1616,7 @@ static bool pcpu_memcg_pre_alloc_hook(size_t size, gfp_t gfp,
return true;
objcg = current_obj_cgroup();
- if (!objcg)
+ if (obj_cgroup_is_root(objcg))
return true;
if (obj_cgroup_charge(objcg, gfp, pcpu_obj_full_size(size)))
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (5 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
` (20 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in get_mem_cgroup_from_folio().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memcontrol.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d484b632c790f..1da3ad77054d3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -973,14 +973,19 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
*/
struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
{
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
if (mem_cgroup_disabled())
return NULL;
+ if (!folio_memcg_charged(folio))
+ return root_mem_cgroup;
+
rcu_read_lock();
- if (!memcg || WARN_ON_ONCE(!css_tryget(&memcg->css)))
- memcg = root_mem_cgroup;
+retry:
+ memcg = folio_memcg(folio);
+ if (unlikely(!css_tryget(&memcg->css)))
+ goto retry;
rcu_read_unlock();
return memcg;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (6 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module Qi Zheng
` (19 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the function get_mem_cgroup_from_folio() is
employed to safeguard against the release of the memory cgroup.
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
fs/buffer.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 6a8752f7bbedb..bc93d0b1d0c30 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -925,8 +925,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
long offset;
struct mem_cgroup *memcg, *old_memcg;
- /* The folio lock pins the memcg */
- memcg = folio_memcg(folio);
+ memcg = get_mem_cgroup_from_folio(folio);
old_memcg = set_active_memcg(memcg);
head = NULL;
@@ -947,6 +946,7 @@ struct buffer_head *folio_alloc_buffers(struct folio *folio, unsigned long size,
}
out:
set_active_memcg(old_memcg);
+ mem_cgroup_put(memcg);
return head;
/*
* In case anything failed, we just free everything we got.
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (7 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
` (18 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the function get_mem_cgroup_css_from_folio()
and the rcu read lock are employed to safeguard against the release
of the memory cgroup.
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
fs/fs-writeback.c | 22 +++++++++++-----------
include/linux/memcontrol.h | 9 +++++++--
include/trace/events/writeback.h | 3 +++
mm/memcontrol.c | 14 ++++++++------
4 files changed, 29 insertions(+), 19 deletions(-)
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 2b35e80037fee..afd81fb8100cb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -269,15 +269,13 @@ void __inode_attach_wb(struct inode *inode, struct folio *folio)
if (inode_cgwb_enabled(inode)) {
struct cgroup_subsys_state *memcg_css;
- if (folio) {
- memcg_css = mem_cgroup_css_from_folio(folio);
- wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
- } else {
- /* must pin memcg_css, see wb_get_create() */
+ /* must pin memcg_css, see wb_get_create() */
+ if (folio)
+ memcg_css = get_mem_cgroup_css_from_folio(folio);
+ else
memcg_css = task_get_css(current, memory_cgrp_id);
- wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
- css_put(memcg_css);
- }
+ wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+ css_put(memcg_css);
}
if (!wb)
@@ -968,16 +966,16 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
if (!wbc->wb || wbc->no_cgroup_owner)
return;
- css = mem_cgroup_css_from_folio(folio);
+ css = get_mem_cgroup_css_from_folio(folio);
/* dead cgroups shouldn't contribute to inode ownership arbitration */
if (!(css->flags & CSS_ONLINE))
- return;
+ goto out;
id = css->id;
if (id == wbc->wb_id) {
wbc->wb_bytes += bytes;
- return;
+ goto out;
}
if (id == wbc->wb_lcand_id)
@@ -990,6 +988,8 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct folio *folio
wbc->wb_tcand_bytes += bytes;
else
wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
+out:
+ css_put(css);
}
EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9fdbd4970021d..174e52d8e7039 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -895,7 +895,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
return match;
}
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio);
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio);
ino_t page_cgroup_ino(struct page *page);
static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
@@ -1577,9 +1577,14 @@ static inline void mem_cgroup_track_foreign_dirty(struct folio *folio,
if (mem_cgroup_disabled())
return;
+ if (!folio_memcg_charged(folio))
+ return;
+
+ rcu_read_lock();
memcg = folio_memcg(folio);
- if (unlikely(memcg && &memcg->css != wb->memcg_css))
+ if (unlikely(&memcg->css != wb->memcg_css))
mem_cgroup_track_foreign_dirty_slowpath(folio, wb);
+ rcu_read_unlock();
}
void mem_cgroup_flush_foreign(struct bdi_writeback *wb);
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index c08aff044e807..3680828727f23 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -295,7 +295,10 @@ TRACE_EVENT(track_foreign_dirty,
__entry->ino = inode ? inode->i_ino : 0;
__entry->memcg_id = wb->memcg_css->id;
__entry->cgroup_ino = __trace_wb_assign_cgroup(wb);
+
+ rcu_read_lock();
__entry->page_cgroup_ino = cgroup_ino(folio_memcg(folio)->css.cgroup);
+ rcu_read_unlock();
),
TP_printk("bdi %s[%llu]: ino=%lu memcg_id=%u cgroup_ino=%lu page_cgroup_ino=%lu",
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1da3ad77054d3..aa8945c4ee383 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -241,7 +241,7 @@ DEFINE_STATIC_KEY_FALSE(memcg_bpf_enabled_key);
EXPORT_SYMBOL(memcg_bpf_enabled_key);
/**
- * mem_cgroup_css_from_folio - css of the memcg associated with a folio
+ * get_mem_cgroup_css_from_folio - acquire a css of the memcg associated with a folio
* @folio: folio of interest
*
* If memcg is bound to the default hierarchy, css of the memcg associated
@@ -251,14 +251,16 @@ EXPORT_SYMBOL(memcg_bpf_enabled_key);
* If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
* is returned.
*/
-struct cgroup_subsys_state *mem_cgroup_css_from_folio(struct folio *folio)
+struct cgroup_subsys_state *get_mem_cgroup_css_from_folio(struct folio *folio)
{
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
- if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
- memcg = root_mem_cgroup;
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return &root_mem_cgroup->css;
- return &memcg->css;
+ memcg = get_mem_cgroup_from_folio(folio);
+
+ return memcg ? &memcg->css : &root_mem_cgroup->css;
}
/**
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (8 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
` (17 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in count_memcg_folio_events().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 174e52d8e7039..ca8d4e09cbe7d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -984,10 +984,15 @@ void count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
static inline void count_memcg_folio_events(struct folio *folio,
enum vm_event_item idx, unsigned long nr)
{
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
- if (memcg)
- count_memcg_events(memcg, idx, nr);
+ if (!folio_memcg_charged(folio))
+ return;
+
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
+ count_memcg_events(memcg, idx, nr);
+ rcu_read_unlock();
}
static inline void count_memcg_events_mm(struct mm_struct *mm,
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (9 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
` (16 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in swap_writepage() and
bio_associate_blkg_from_page().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/page_io.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index 3c342db77ce38..ec7720762042c 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -276,10 +276,14 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
goto out_unlock;
}
+
+ rcu_read_lock();
if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
+ rcu_read_unlock();
folio_mark_dirty(folio);
return AOP_WRITEPAGE_ACTIVATE;
}
+ rcu_read_unlock();
__swap_writepage(folio, swap_plug);
return 0;
@@ -307,11 +311,11 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct folio *folio)
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg;
- memcg = folio_memcg(folio);
- if (!memcg)
+ if (!folio_memcg_charged(folio))
return;
rcu_read_lock();
+ memcg = folio_memcg(folio);
css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
bio_associate_blkg_from_css(bio, css);
rcu_read_unlock();
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (10 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
` (15 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in folio_migrate_mapping().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/migrate.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/migrate.c b/mm/migrate.c
index ceee354ef2152..cdab1b652c530 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -664,6 +664,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
struct lruvec *old_lruvec, *new_lruvec;
struct mem_cgroup *memcg;
+ rcu_read_lock();
memcg = folio_memcg(folio);
old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
@@ -691,6 +692,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
__mod_lruvec_state(new_lruvec, NR_FILE_DIRTY, nr);
__mod_zone_page_state(newzone, NR_ZONE_WRITE_PENDING, nr);
}
+ rcu_read_unlock();
}
local_irq_enable();
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (11 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
` (14 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mglru.
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/vmscan.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 660cd40cfddd4..676e6270e5b45 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3445,8 +3445,10 @@ static struct folio *get_pfn_folio(unsigned long pfn, struct mem_cgroup *memcg,
if (folio_nid(folio) != pgdat->node_id)
return NULL;
+ rcu_read_lock();
if (folio_memcg(folio) != memcg)
- return NULL;
+ folio = NULL;
+ rcu_read_unlock();
return folio;
}
@@ -4203,12 +4205,12 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
unsigned long addr = pvmw->address;
struct vm_area_struct *vma = pvmw->vma;
struct folio *folio = pfn_folio(pvmw->pfn);
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
struct pglist_data *pgdat = folio_pgdat(folio);
- struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat);
- struct lru_gen_mm_state *mm_state = get_mm_state(lruvec);
- DEFINE_MAX_SEQ(lruvec);
- int gen = lru_gen_from_seq(max_seq);
+ struct lruvec *lruvec;
+ struct lru_gen_mm_state *mm_state;
+ unsigned long max_seq;
+ int gen;
lockdep_assert_held(pvmw->ptl);
VM_WARN_ON_ONCE_FOLIO(folio_test_lru(folio), folio);
@@ -4243,6 +4245,13 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
}
}
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
+ lruvec = mem_cgroup_lruvec(memcg, pgdat);
+ max_seq = READ_ONCE((lruvec)->lrugen.max_seq);
+ gen = lru_gen_from_seq(max_seq);
+ mm_state = get_mm_state(lruvec);
+
arch_enter_lazy_mmu_mode();
pte -= (addr - start) / PAGE_SIZE;
@@ -4279,6 +4288,8 @@ bool lru_gen_look_around(struct page_vma_mapped_walk *pvmw)
arch_leave_lazy_mmu_mode();
+ rcu_read_unlock();
+
/* feedback from rmap walkers to page table walkers */
if (mm_state && suitable_to_scan(i, young))
update_bloom_filter(mm_state, max_seq, pvmw->pmd);
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (12 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
` (13 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in mem_cgroup_swap_full().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memcontrol.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index aa8945c4ee383..4b3c7d4f346b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5275,17 +5275,21 @@ bool mem_cgroup_swap_full(struct folio *folio)
if (do_memsw_account())
return false;
- memcg = folio_memcg(folio);
- if (!memcg)
+ if (!folio_memcg_charged(folio))
return false;
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
unsigned long usage = page_counter_read(&memcg->swap);
if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
- usage * 2 >= READ_ONCE(memcg->swap.max))
+ usage * 2 >= READ_ONCE(memcg->swap.max)) {
+ rcu_read_unlock();
return true;
+ }
}
+ rcu_read_unlock();
return false;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (13 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
` (12 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. To ensure safety, it will only be appropriate to
hold the rcu read lock or acquire a reference to the memory cgroup
returned by folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard
against the release of the memory cgroup in lru_gen_eviction().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/workingset.c | 9 +++++++--
1 file changed, 7 insertions(+), 2 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 8cad8ee6dec6a..c4d21c15bad51 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -241,11 +241,14 @@ static void *lru_gen_eviction(struct folio *folio)
int refs = folio_lru_refs(folio);
bool workingset = folio_test_workingset(folio);
int tier = lru_tier_from_refs(refs, workingset);
- struct mem_cgroup *memcg = folio_memcg(folio);
+ struct mem_cgroup *memcg;
struct pglist_data *pgdat = folio_pgdat(folio);
+ unsigned short memcg_id;
BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+ rcu_read_lock();
+ memcg = folio_memcg(folio);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
lrugen = &lruvec->lrugen;
min_seq = READ_ONCE(lrugen->min_seq[type]);
@@ -253,8 +256,10 @@ static void *lru_gen_eviction(struct folio *folio)
hist = lru_hist_from_seq(min_seq);
atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
+ memcg_id = mem_cgroup_id(memcg);
+ rcu_read_unlock();
- return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+ return pack_shadow(memcg_id, pgdat, token, workingset);
}
/*
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (14 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
` (11 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
In the near future, a folio will no longer pin its corresponding memory
cgroup. To ensure safety, it will only be appropriate to hold the rcu read
lock or acquire a reference to the memory cgroup returned by
folio_memcg(), thereby preventing it from being released.
In the current patch, the rcu read lock is employed to safeguard against
the release of the memory cgroup in folio_split_queue_lock{_irqsave}().
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/huge_memory.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9d3594df6eedf..067259a9e0809 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1153,13 +1153,25 @@ split_queue_lock_irqsave(int nid, struct mem_cgroup *memcg, unsigned long *flags
static struct deferred_split *folio_split_queue_lock(struct folio *folio)
{
- return split_queue_lock(folio_nid(folio), folio_memcg(folio));
+ struct deferred_split *queue;
+
+ rcu_read_lock();
+ queue = split_queue_lock(folio_nid(folio), folio_memcg(folio));
+ rcu_read_unlock();
+
+ return queue;
}
static struct deferred_split *
folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
{
- return split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
+ struct deferred_split *queue;
+
+ rcu_read_lock();
+ queue = split_queue_lock_irqsave(folio_nid(folio), folio_memcg(folio), flags);
+ rcu_read_unlock();
+
+ return queue;
}
static inline void split_queue_unlock(struct deferred_split *queue)
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (15 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
` (10 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_refault().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/workingset.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index c4d21c15bad51..a69cc7bf9246d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -560,11 +560,12 @@ void workingset_refault(struct folio *folio, void *shadow)
* locked to guarantee folio_memcg() stability throughout.
*/
nr = folio_nr_pages(folio);
+ rcu_read_lock();
lruvec = folio_lruvec(folio);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset, true))
- return;
+ goto out;
folio_set_active(folio);
workingset_age_nonresident(lruvec, nr);
@@ -580,6 +581,8 @@ void workingset_refault(struct folio *folio, void *shadow)
lru_note_cost_refault(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
+out:
+ rcu_read_unlock();
}
/**
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (16 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module Qi Zheng
` (9 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Nhat Pham,
Chengming Zhou, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in zswap_folio_swapin().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/zswap.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958d..a341814468b95 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -664,8 +664,10 @@ void zswap_folio_swapin(struct folio *folio)
struct lruvec *lruvec;
if (folio) {
+ rcu_read_lock();
lruvec = folio_lruvec(folio);
atomic_long_inc(&lruvec->zswap_lruvec_state.nr_disk_swapins);
+ rcu_read_unlock();
}
}
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (17 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
` (8 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in lru_note_cost_refault() and
lru_activate().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/swap.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/swap.c b/mm/swap.c
index ec0c654e128dc..0606795f3ccf3 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -412,18 +412,20 @@ static void lru_gen_inc_refs(struct folio *folio)
static bool lru_gen_clear_refs(struct folio *folio)
{
- struct lru_gen_folio *lrugen;
int gen = folio_lru_gen(folio);
int type = folio_is_file_lru(folio);
+ unsigned long seq;
if (gen < 0)
return true;
set_mask_bits(&folio->flags.f, LRU_REFS_FLAGS | BIT(PG_workingset), 0);
- lrugen = &folio_lruvec(folio)->lrugen;
+ rcu_read_lock();
+ seq = READ_ONCE(folio_lruvec(folio)->lrugen.min_seq[type]);
+ rcu_read_unlock();
/* whether can do without shuffling under the LRU lock */
- return gen == lru_gen_from_seq(READ_ONCE(lrugen->min_seq[type]));
+ return gen == lru_gen_from_seq(seq);
}
#else /* !CONFIG_LRU_GEN */
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (18 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
` (7 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
In the near future, a folio will no longer pin its corresponding
memory cgroup. So an lruvec returned by folio_lruvec() could be
released without the rcu read lock or a reference to its memory
cgroup.
In the current patch, the rcu read lock is employed to safeguard
against the release of the lruvec in workingset_activation().
This serves as a preparatory measure for the reparenting of the
LRU pages.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/workingset.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index a69cc7bf9246d..d0eb3636dcd1d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -595,8 +595,11 @@ void workingset_activation(struct folio *folio)
* Filter non-memcg pages here, e.g. unmap can call
* mark_page_accessed() on VDSO pages.
*/
- if (mem_cgroup_disabled() || folio_memcg_charged(folio))
+ if (mem_cgroup_disabled() || folio_memcg_charged(folio)) {
+ rcu_read_lock();
workingset_age_nonresident(folio_lruvec(folio), folio_nr_pages(folio));
+ rcu_read_unlock();
+ }
}
/*
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (19 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-11-04 6:49 ` kernel test robot
2025-10-28 13:58 ` [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
` (6 subsequent siblings)
27 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
The following diagram illustrates how to ensure the safety of the folio
lruvec lock when LRU folios undergo reparenting.
In the folio_lruvec_lock(folio) function:
```
rcu_read_lock();
retry:
lruvec = folio_lruvec(folio);
/* There is a possibility of folio reparenting at this point. */
spin_lock(&lruvec->lru_lock);
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
/*
* The wrong lruvec lock was acquired, and a retry is required.
* This is because the folio resides on the parent memcg lruvec
* list.
*/
spin_unlock(&lruvec->lru_lock);
goto retry;
}
/* Reaching here indicates that folio_memcg() is stable. */
```
In the memcg_reparent_objcgs(memcg) function:
```
spin_lock(&lruvec->lru_lock);
spin_lock(&lruvec_parent->lru_lock);
/* Transfer folios from the lruvec list to the parent's. */
spin_unlock(&lruvec_parent->lru_lock);
spin_unlock(&lruvec->lru_lock);
```
After acquiring the lruvec lock, it is necessary to verify whether
the folio has been reparented. If reparenting has occurred, the new
lruvec lock must be reacquired. During the LRU folio reparenting
process, the lruvec lock will also be acquired (this will be
implemented in a subsequent patch). Therefore, folio_memcg() remains
unchanged while the lruvec lock is held.
Given that lruvec_memcg(lruvec) is always equal to folio_memcg(folio)
after the lruvec lock is acquired, the lruvec_memcg_debug() check is
redundant. Hence, it is removed.
This patch serves as a preparation for the reparenting of LRU folios.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 23 ++++++-----------
mm/compaction.c | 29 ++++++++++++++++-----
mm/memcontrol.c | 53 +++++++++++++++++++-------------------
3 files changed, 58 insertions(+), 47 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ca8d4e09cbe7d..6f6b28f8f0f63 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -740,7 +740,11 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
* folio_lruvec - return lruvec for isolating/putting an LRU folio
* @folio: Pointer to the folio.
*
- * This function relies on folio->mem_cgroup being stable.
+ * The user should hold an rcu read lock to protect lruvec associated with
+ * the folio from being released. But it does not prevent binding stability
+ * between the folio and the returned lruvec from being changed to its parent
+ * or ancestor (e.g. like folio_lruvec_lock() does that holds LRU lock to
+ * prevent the change).
*/
static inline struct lruvec *folio_lruvec(struct folio *folio)
{
@@ -763,15 +767,6 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
unsigned long *flags);
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio);
-#else
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-#endif
-
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1204,11 +1199,6 @@ static inline struct lruvec *folio_lruvec(struct folio *folio)
return &pgdat->__lruvec;
}
-static inline
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
-}
-
static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
return NULL;
@@ -1515,17 +1505,20 @@ static inline struct lruvec *parent_lruvec(struct lruvec *lruvec)
static inline void lruvec_unlock(struct lruvec *lruvec)
{
spin_unlock(&lruvec->lru_lock);
+ rcu_read_unlock();
}
static inline void lruvec_unlock_irq(struct lruvec *lruvec)
{
spin_unlock_irq(&lruvec->lru_lock);
+ rcu_read_unlock();
}
static inline void lruvec_unlock_irqrestore(struct lruvec *lruvec,
unsigned long flags)
{
spin_unlock_irqrestore(&lruvec->lru_lock, flags);
+ rcu_read_unlock();
}
/* Test requires a stable folio->memcg binding, see folio_memcg() */
diff --git a/mm/compaction.c b/mm/compaction.c
index 4dce180f699b4..0d2a0e6239eb4 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -518,6 +518,24 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
return true;
}
+static struct lruvec *
+compact_folio_lruvec_lock_irqsave(struct folio *folio, unsigned long *flags,
+ struct compact_control *cc)
+{
+ struct lruvec *lruvec;
+
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
+ compact_lock_irqsave(&lruvec->lru_lock, flags, cc);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto retry;
+ }
+
+ return lruvec;
+}
+
/*
* Compaction requires the taking of some coarse locks that are potentially
* very heavily contended. The lock should be periodically unlocked to avoid
@@ -839,7 +857,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
{
pg_data_t *pgdat = cc->zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
- struct lruvec *lruvec;
+ struct lruvec *lruvec = NULL;
unsigned long flags = 0;
struct lruvec *locked = NULL;
struct folio *folio = NULL;
@@ -1153,18 +1171,17 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!folio_test_clear_lru(folio))
goto isolate_fail_put;
- lruvec = folio_lruvec(folio);
+ if (locked)
+ lruvec = folio_lruvec(folio);
/* If we already hold the lock, we can skip some rechecking */
- if (lruvec != locked) {
+ if (lruvec != locked || !locked) {
if (locked)
lruvec_unlock_irqrestore(locked, flags);
- compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
+ lruvec = compact_folio_lruvec_lock_irqsave(folio, &flags, cc);
locked = lruvec;
- lruvec_memcg_debug(lruvec, folio);
-
/*
* Try get exclusive access under lock. If marked for
* skip, the scan is aborted unless the current context
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b3c7d4f346b5..7969dd93d858a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1184,23 +1184,6 @@ void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
}
}
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
-{
- struct mem_cgroup *memcg;
-
- if (mem_cgroup_disabled())
- return;
-
- memcg = folio_memcg(folio);
-
- if (!memcg)
- VM_BUG_ON_FOLIO(!mem_cgroup_is_root(lruvec_memcg(lruvec)), folio);
- else
- VM_BUG_ON_FOLIO(lruvec_memcg(lruvec) != memcg, folio);
-}
-#endif
-
/**
* folio_lruvec_lock - Lock the lruvec for a folio.
* @folio: Pointer to the folio.
@@ -1210,14 +1193,20 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
* - folio_test_lru false
* - folio frozen (refcount of 0)
*
- * Return: The lruvec this folio is on with its lock held.
+ * Return: The lruvec this folio is on with its lock held and rcu read lock held.
*/
struct lruvec *folio_lruvec_lock(struct folio *folio)
{
- struct lruvec *lruvec = folio_lruvec(folio);
+ struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
spin_lock(&lruvec->lru_lock);
- lruvec_memcg_debug(lruvec, folio);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock(&lruvec->lru_lock);
+ goto retry;
+ }
return lruvec;
}
@@ -1232,14 +1221,20 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
* - folio frozen (refcount of 0)
*
* Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
*/
struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
{
- struct lruvec *lruvec = folio_lruvec(folio);
+ struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
spin_lock_irq(&lruvec->lru_lock);
- lruvec_memcg_debug(lruvec, folio);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ goto retry;
+ }
return lruvec;
}
@@ -1255,15 +1250,21 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
* - folio frozen (refcount of 0)
*
* Return: The lruvec this folio is on with its lock held and interrupts
- * disabled.
+ * disabled and rcu read lock held.
*/
struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
unsigned long *flags)
{
- struct lruvec *lruvec = folio_lruvec(folio);
+ struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
+ lruvec = folio_lruvec(folio);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
- lruvec_memcg_debug(lruvec, folio);
+ if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto retry;
+ }
return lruvec;
}
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* Re: [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2025-10-28 13:58 ` [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
@ 2025-11-04 6:49 ` kernel test robot
2025-11-04 8:59 ` Qi Zheng
0 siblings, 1 reply; 58+ messages in thread
From: kernel test robot @ 2025-11-04 6:49 UTC (permalink / raw)
To: Qi Zheng
Cc: oe-lkp, lkp, Qi Zheng, cgroups, linux-mm, hannes, hughd, mhocko,
roman.gushchin, shakeel.butt, muchun.song, david, lorenzo.stoakes,
ziy, harry.yoo, imran.f.khan, kamalesh.babulal, axelrasmussen,
yuanchu, weixugc, akpm, linux-kernel, Muchun Song, oliver.sang
Hello,
kernel test robot noticed "WARNING:bad_unlock_balance_detected" on:
commit: dd9e066d9677ca28748a63b16b33c858af75164b ("[PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock")
url: https://github.com/intel-lab-lkp/linux/commits/Qi-Zheng/mm-memcontrol-remove-dead-code-of-checking-parent-memory-cgroup/20251028-221021
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/d5d72d101212e9fb82727c941d581c68728c7f53.1761658311.git.zhengqi.arch@bytedance.com/
patch subject: [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
in testcase: boot
config: i386-randconfig-141-20251031
compiler: gcc-14
test machine: qemu-system-i386 -enable-kvm -cpu SandyBridge -smp 2 -m 4G
(please refer to attached dmesg/kmsg for entire log/backtrace)
+-----------------------------------------------------------------------------------+------------+------------+
| | c856dae1f8 | dd9e066d96 |
+-----------------------------------------------------------------------------------+------------+------------+
| WARNING:bad_unlock_balance_detected | 0 | 87 |
| is_trying_to_release_lock(rcu_read_lock)at | 0 | 87 |
| calltrace:rcu_read_unlock | 0 | 87 |
| WARNING:at_kernel/rcu/tree_plugin.h:#__rcu_read_unlock | 0 | 87 |
| EIP:__rcu_read_unlock | 0 | 87 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/filemap.c | 0 | 87 |
| WARNING:at_kernel/rcu/tree_plugin.h:#rcu_sched_clock_irq | 0 | 87 |
| EIP:rcu_sched_clock_irq | 0 | 87 |
| EIP:evm_inode_alloc_security | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/percpu-rwsem.h | 0 | 84 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/sched/mm.h | 0 | 86 |
| BUG:workqueue_leaked_atomic,lock_or_RCU:kworker##[#] | 0 | 84 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/slab_common.c | 0 | 23 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/workqueue.c | 0 | 60 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/rwsem.c | 0 | 73 |
| BUG:sleeping_function_called_from_invalid_context_at_lib/strncpy_from_user.c | 0 | 41 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/mmap.c | 0 | 16 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/locking/mutex.c | 0 | 70 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/uaccess.h | 0 | 40 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/memory.c | 0 | 55 |
| EIP:kmem_cache_alloc_noprof | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/dcache.c | 0 | 69 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/file_table.c | 0 | 32 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/task_work.c | 0 | 25 |
| BUG:sleeping_function_called_from_invalid_context_at_arch/x86/entry/syscall_32.c | 0 | 46 |
| WARNING:at_kernel/rcu/tree_exp.h:#rcu_exp_handler | 0 | 58 |
| EIP:rcu_exp_handler | 0 | 58 |
| EIP:find_bug | 0 | 13 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/pidfs.c | 0 | 9 |
| EIP:_parse_integer_limit | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_lib/iov_iter.c | 0 | 64 |
| EIP:ep_poll | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/mmu_gather.c | 0 | 18 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/select.c | 0 | 12 |
| BUG:sleeping_function_called_from_invalid_context_at_net/core/sock.c | 0 | 15 |
| EIP:handle_bug | 0 | 25 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/pagemap.h | 0 | 14 |
| EIP:inflate_fast | 0 | 20 |
| EIP:zlib_inflate_table | 0 | 2 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/shmem.c | 0 | 2 |
| EIP:unwind_get_return_address | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/vma.c | 0 | 11 |
| EIP:kmem_cache_alloc_lru_noprof | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/mprotect.c | 0 | 2 |
| EIP:console_emit_next_record | 0 | 10 |
| EIP:dump_stack_lvl | 0 | 31 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/mmu_notifier.h | 0 | 13 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/vmalloc.c | 0 | 2 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/nsproxy.c | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/exec.c | 0 | 1 |
| EIP:preempt_count_sub | 0 | 1 |
| EIP:kunmap_local_indexed | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/gup.c | 0 | 2 |
| EIP:seqcount_lockdep_reader_access | 0 | 1 |
| EIP:_raw_spin_unlock_irqrestore | 0 | 5 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/exit.c | 0 | 2 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/rmap.c | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_mm/truncate.c | 0 | 1 |
| EIP:finish_task_switch | 0 | 2 |
| EIP:inode_init_once | 0 | 1 |
| EIP:check_lifetime | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/namei.c | 0 | 3 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/readdir.c | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/rcu/srcutree.c | 0 | 1 |
| EIP:memset_no_sanitize_memory | 0 | 1 |
| EIP:lookup_one_qstr_excl | 0 | 1 |
| EIP:__up_read | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_fs/file.c | 0 | 3 |
| EIP:ramfs_symlink | 0 | 1 |
| EIP:do_dentry_open | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/printk/printk.c | 0 | 1 |
| EIP:unwind_next_frame | 0 | 1 |
| EIP:handle_softirqs | 0 | 1 |
| EIP:bad_range | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_include/linux/freezer.h | 0 | 2 |
| EIP:security_file_open | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/fork.c | 0 | 2 |
| EIP:kernel_init_pages | 0 | 1 |
| EIP:filemap_add_folio | 0 | 1 |
| EIP:update_stack_state | 0 | 1 |
| EIP:folio_nr_pages | 0 | 1 |
| EIP:mntput | 0 | 1 |
| BUG:sleeping_function_called_from_invalid_context_at_kernel/sched/completion.c | 0 | 1 |
| EIP:__alloc_frozen_pages_noprof | 0 | 1 |
| EIP:unwind_get_return_address_ptr | 0 | 1 |
| EIP:copy_folio_from_iter_atomic | 0 | 1 |
| EIP:filp_flush | 0 | 1 |
| EIP:__put_user_4 | 0 | 2 |
| EIP:balance_dirty_pages_ratelimited_flags | 0 | 1 |
| EIP:zlib_updatewindow | 0 | 1 |
+-----------------------------------------------------------------------------------+------------+------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202511041421.784bbd5e-lkp@intel.com
[ 1.392214][ T45] WARNING: bad unlock balance detected!
[ 1.393285][ T45] 6.18.0-rc3-00251-gdd9e066d9677 #1 Not tainted
[ 1.394579][ T45] -------------------------------------
[ 1.395442][ T45] kworker/u9:1/45 is trying to release lock (rcu_read_lock) at:
[ 1.396977][ T45] rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
[ 1.398160][ T45] but there are no more locks to release!
[ 1.399337][ T45]
[ 1.399337][ T45] other info that might help us debug this:
[ 1.399707][ T45] 5 locks held by kworker/u9:1/45:
[ 1.399707][ T45] #0: c01ad6c4 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3238)
[ 1.399707][ T45] #1: c08e9f18 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3239)
[ 1.399707][ T45] #2: c02ed27c (sb_writers#2){.+.+}-{0:0}, at: file_start_write+0x1e/0x30
[ 1.399707][ T45] #3: c042e0ec (&sb->s_type->i_mutex_key){++++}-{4:4}, at: generic_file_write_iter (mm/filemap.c:4404)
[ 1.399707][ T45] #4: eaa771a0 (lock#3){+.+.}-{3:3}, at: local_lock_acquire (include/linux/local_lock_internal.h:40)
[ 1.399707][ T45]
[ 1.399707][ T45] stack backtrace:
[ 1.399707][ T45] CPU: 0 UID: 0 PID: 45 Comm: kworker/u9:1 Not tainted 6.18.0-rc3-00251-gdd9e066d9677 #1 PREEMPT(none)
[ 1.399707][ T45] Workqueue: async async_run_entry_fn
[ 1.399707][ T45] Call Trace:
[ 1.399707][ T45] dump_stack_lvl (lib/dump_stack.c:122)
[ 1.399707][ T45] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
[ 1.399707][ T45] dump_stack (lib/dump_stack.c:130)
[ 1.399707][ T45] print_unlock_imbalance_bug (kernel/locking/lockdep.c:5300 kernel/locking/lockdep.c:5272)
[ 1.399707][ T45] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
[ 1.399707][ T45] __lock_release+0x5e/0x150
[ 1.399707][ T45] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
[ 1.399707][ T45] lock_release (kernel/locking/lockdep.c:470 kernel/locking/lockdep.c:5891 kernel/locking/lockdep.c:5875)
[ 1.399707][ T45] ? lru_deactivate_file (mm/swap.c:119)
[ 1.399707][ T45] rcu_read_unlock (include/linux/rcupdate.h:899)
[ 1.399707][ T45] lruvec_unlock_irqrestore (include/linux/memcontrol.h:1522)
[ 1.399707][ T45] folio_batch_move_lru (include/linux/mm.h:1501 mm/swap.c:179)
[ 1.399707][ T45] __folio_batch_add_and_move (mm/swap.c:196 (discriminator 2))
[ 1.399707][ T45] ? lru_deactivate_file (mm/swap.c:119)
[ 1.399707][ T45] folio_add_lru (mm/swap.c:514)
[ 1.399707][ T45] filemap_add_folio (mm/filemap.c:996)
[ 1.399707][ T45] __filemap_get_folio (mm/filemap.c:2023)
[ 1.399707][ T45] simple_write_begin (fs/libfs.c:932 (discriminator 1))
[ 1.399707][ T45] generic_perform_write (mm/filemap.c:4263)
[ 1.399707][ T45] __generic_file_write_iter (mm/filemap.c:4380)
[ 1.399707][ T45] generic_file_write_iter (mm/filemap.c:4406)
[ 1.399707][ T45] __kernel_write_iter (fs/read_write.c:619)
[ 1.399707][ T45] __kernel_write (fs/read_write.c:640)
[ 1.399707][ T45] kernel_write (fs/read_write.c:660 fs/read_write.c:650)
[ 1.399707][ T45] xwrite+0x27/0x80
[ 1.399707][ T45] do_copy (init/initramfs.c:417 (discriminator 1))
[ 1.399707][ T45] write_buffer (init/initramfs.c:470 (discriminator 1))
[ 1.399707][ T45] flush_buffer (init/initramfs.c:482 (discriminator 1))
[ 1.399707][ T45] __gunzip+0x21d/0x2c0
[ 1.399707][ T45] ? bunzip2 (lib/decompress_inflate.c:39)
[ 1.399707][ T45] ? __gunzip+0x2c0/0x2c0
[ 1.399707][ T45] gunzip (lib/decompress_inflate.c:208)
[ 1.399707][ T45] ? write_buffer (init/initramfs.c:476)
[ 1.399707][ T45] ? initrd_load (init/initramfs.c:64)
[ 1.399707][ T45] unpack_to_rootfs (init/initramfs.c:553)
[ 1.399707][ T45] ? write_buffer (init/initramfs.c:476)
[ 1.399707][ T45] ? initrd_load (init/initramfs.c:64)
[ 1.399707][ T45] ? reserve_initrd_mem (init/initramfs.c:719)
[ 1.399707][ T45] do_populate_rootfs (init/initramfs.c:734)
[ 1.399707][ T45] async_run_entry_fn (kernel/async.c:136 (discriminator 1))
[ 1.399707][ T45] ? async_schedule_node (kernel/async.c:118)
[ 1.399707][ T45] process_one_work (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/jump_label.h:262 include/trace/events/workqueue.h:110 kernel/workqueue.c:3268)
[ 1.399707][ T45] process_scheduled_works (kernel/workqueue.c:3346)
[ 1.399707][ T45] worker_thread (include/linux/list.h:381 (discriminator 2) kernel/workqueue.c:952 (discriminator 2) kernel/workqueue.c:3428 (discriminator 2))
[ 1.399707][ T45] kthread (kernel/kthread.c:465)
[ 1.399707][ T45] ? process_scheduled_works (kernel/workqueue.c:3373)
[ 1.399707][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
[ 1.399707][ T45] ret_from_fork (arch/x86/kernel/process.c:164)
[ 1.399707][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
[ 1.399707][ T45] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[ 1.399707][ T45] entry_INT80_32 (arch/x86/entry/entry_32.S:945)
[ 1.467118][ T32] Callback from call_rcu_tasks() invoked.
[ 1.468370][ T45] ------------[ cut here ]------------
[ 1.469508][ T45] WARNING: CPU: 0 PID: 45 at kernel/rcu/tree_plugin.h:443 __rcu_read_unlock (kernel/rcu/tree_plugin.h:443)
[ 1.471711][ T45] Modules linked in:
[ 1.472490][ T45] CPU: 0 UID: 0 PID: 45 Comm: kworker/u9:1 Not tainted 6.18.0-rc3-00251-gdd9e066d9677 #1 PREEMPT(none)
[ 1.474777][ T45] Workqueue: async async_run_entry_fn
[ 1.475823][ T45] EIP: __rcu_read_unlock (kernel/rcu/tree_plugin.h:443)
[ 1.476872][ T45] Code: 0c d0 56 c2 ff 8b a4 02 00 00 75 11 8b 83 a8 02 00 00 85 c0 74 07 89 d8 e8 7c fe ff ff 8b 83 a4 02 00 00 3d ff ff ff 3f 76 02 <0f> 0b 5b 5d 31 c0 c3 2e 8d b4 26 00 00 00 00 55 89 e5 57 56 89 c6
All code
========
0: 0c d0 or $0xd0,%al
2: 56 push %rsi
3: c2 ff 8b ret $0x8bff
6: a4 movsb %ds:(%rsi),%es:(%rdi)
7: 02 00 add (%rax),%al
9: 00 75 11 add %dh,0x11(%rbp)
c: 8b 83 a8 02 00 00 mov 0x2a8(%rbx),%eax
12: 85 c0 test %eax,%eax
14: 74 07 je 0x1d
16: 89 d8 mov %ebx,%eax
18: e8 7c fe ff ff call 0xfffffffffffffe99
1d: 8b 83 a4 02 00 00 mov 0x2a4(%rbx),%eax
23: 3d ff ff ff 3f cmp $0x3fffffff,%eax
28: 76 02 jbe 0x2c
2a:* 0f 0b ud2 <-- trapping instruction
2c: 5b pop %rbx
2d: 5d pop %rbp
2e: 31 c0 xor %eax,%eax
30: c3 ret
31: 2e 8d b4 26 00 00 00 cs lea 0x0(%rsi,%riz,1),%esi
38: 00
39: 55 push %rbp
3a: 89 e5 mov %esp,%ebp
3c: 57 push %rdi
3d: 56 push %rsi
3e: 89 c6 mov %eax,%esi
Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 5b pop %rbx
3: 5d pop %rbp
4: 31 c0 xor %eax,%eax
6: c3 ret
7: 2e 8d b4 26 00 00 00 cs lea 0x0(%rsi,%riz,1),%esi
e: 00
f: 55 push %rbp
10: 89 e5 mov %esp,%ebp
12: 57 push %rdi
13: 56 push %rsi
14: 89 c6 mov %eax,%esi
[ 1.480862][ T45] EAX: ffffffff EBX: c0bfd640 ECX: 00000000 EDX: 00000000
[ 1.482292][ T45] ESI: eaa771c0 EDI: c119e160 EBP: c08e9c0c ESP: c08e9c08
[ 1.483721][ T45] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
[ 1.485350][ T45] CR0: 80050033 CR2: ffbff000 CR3: 026c6000 CR4: 000006f0
[ 1.486725][ T45] Call Trace:
[ 1.487465][ T45] rcu_read_unlock (include/linux/rcupdate.h:900)
[ 1.488404][ T45] lruvec_unlock_irqrestore (include/linux/memcontrol.h:1522)
[ 1.489570][ T45] folio_batch_move_lru (include/linux/mm.h:1501 mm/swap.c:179)
[ 1.491100][ T45] __folio_batch_add_and_move (mm/swap.c:196 (discriminator 2))
[ 1.492318][ T45] ? lru_deactivate_file (mm/swap.c:119)
[ 1.493827][ T45] folio_add_lru (mm/swap.c:514)
[ 1.494817][ T45] filemap_add_folio (mm/filemap.c:996)
[ 1.495891][ T45] __filemap_get_folio (mm/filemap.c:2023)
[ 1.497735][ T45] simple_write_begin (fs/libfs.c:932 (discriminator 1))
[ 1.498813][ T45] generic_perform_write (mm/filemap.c:4263)
[ 1.500292][ T45] __generic_file_write_iter (mm/filemap.c:4380)
[ 1.501552][ T45] generic_file_write_iter (mm/filemap.c:4406)
[ 1.502790][ T45] __kernel_write_iter (fs/read_write.c:619)
[ 1.504090][ T45] __kernel_write (fs/read_write.c:640)
[ 1.505377][ T45] kernel_write (fs/read_write.c:660 fs/read_write.c:650)
[ 1.506306][ T45] xwrite+0x27/0x80
[ 1.507429][ T45] do_copy (init/initramfs.c:417 (discriminator 1))
[ 1.508361][ T45] write_buffer (init/initramfs.c:470 (discriminator 1))
[ 1.509300][ T45] flush_buffer (init/initramfs.c:482 (discriminator 1))
[ 1.510307][ T45] __gunzip+0x21d/0x2c0
[ 1.511424][ T45] ? bunzip2 (lib/decompress_inflate.c:39)
[ 1.512312][ T45] ? __gunzip+0x2c0/0x2c0
[ 1.513515][ T45] gunzip (lib/decompress_inflate.c:208)
[ 1.514359][ T45] ? write_buffer (init/initramfs.c:476)
[ 1.515304][ T45] ? initrd_load (init/initramfs.c:64)
[ 1.516217][ T45] unpack_to_rootfs (init/initramfs.c:553)
[ 1.517220][ T45] ? write_buffer (init/initramfs.c:476)
[ 1.518175][ T45] ? initrd_load (init/initramfs.c:64)
[ 1.519172][ T45] ? reserve_initrd_mem (init/initramfs.c:719)
[ 1.520375][ T45] do_populate_rootfs (init/initramfs.c:734)
[ 1.521401][ T45] async_run_entry_fn (kernel/async.c:136 (discriminator 1))
[ 1.522378][ T45] ? async_schedule_node (kernel/async.c:118)
[ 1.523492][ T45] process_one_work (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/jump_label.h:262 include/trace/events/workqueue.h:110 kernel/workqueue.c:3268)
[ 1.526446][ T45] process_scheduled_works (kernel/workqueue.c:3346)
[ 1.527578][ T45] worker_thread (include/linux/list.h:381 (discriminator 2) kernel/workqueue.c:952 (discriminator 2) kernel/workqueue.c:3428 (discriminator 2))
[ 1.528573][ T45] kthread (kernel/kthread.c:465)
[ 1.529483][ T45] ? process_scheduled_works (kernel/workqueue.c:3373)
[ 1.530780][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
[ 1.531779][ T45] ret_from_fork (arch/x86/kernel/process.c:164)
[ 1.532664][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
[ 1.533736][ T45] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
[ 1.534846][ T45] entry_INT80_32 (arch/x86/entry/entry_32.S:945)
[ 1.536182][ T45] irq event stamp: 2161
[ 1.536994][ T45] hardirqs last enabled at (2161): _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
[ 1.538763][ T45] hardirqs last disabled at (2160): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:162)
[ 1.540674][ T45] softirqs last enabled at (1956): handle_softirqs (kernel/softirq.c:469 (discriminator 2) kernel/softirq.c:650 (discriminator 2))
[ 1.542454][ T45] softirqs last disabled at (1949): __do_softirq (kernel/softirq.c:657)
[ 1.544105][ T45] ---[ end trace 0000000000000000 ]---
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20251104/202511041421.784bbd5e-lkp@intel.com
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
2025-11-04 6:49 ` kernel test robot
@ 2025-11-04 8:59 ` Qi Zheng
0 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-11-04 8:59 UTC (permalink / raw)
To: kernel test robot
Cc: oe-lkp, lkp, cgroups, linux-mm, hannes, hughd, mhocko,
roman.gushchin, shakeel.butt, muchun.song, david, lorenzo.stoakes,
ziy, harry.yoo, imran.f.khan, kamalesh.babulal, axelrasmussen,
yuanchu, weixugc, akpm, linux-kernel, Muchun Song
On 11/4/25 2:49 PM, kernel test robot wrote:
>
[...]
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202511041421.784bbd5e-lkp@intel.com
>
>
> [ 1.392214][ T45] WARNING: bad unlock balance detected!
> [ 1.393285][ T45] 6.18.0-rc3-00251-gdd9e066d9677 #1 Not tainted
> [ 1.394579][ T45] -------------------------------------
> [ 1.395442][ T45] kworker/u9:1/45 is trying to release lock (rcu_read_lock) at:
> [ 1.396977][ T45] rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
> [ 1.398160][ T45] but there are no more locks to release!
> [ 1.399337][ T45]
> [ 1.399337][ T45] other info that might help us debug this:
> [ 1.399707][ T45] 5 locks held by kworker/u9:1/45:
> [ 1.399707][ T45] #0: c01ad6c4 ((wq_completion)async){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3238)
> [ 1.399707][ T45] #1: c08e9f18 ((work_completion)(&entry->work)){+.+.}-{0:0}, at: process_one_work (kernel/workqueue.c:3239)
> [ 1.399707][ T45] #2: c02ed27c (sb_writers#2){.+.+}-{0:0}, at: file_start_write+0x1e/0x30
> [ 1.399707][ T45] #3: c042e0ec (&sb->s_type->i_mutex_key){++++}-{4:4}, at: generic_file_write_iter (mm/filemap.c:4404)
> [ 1.399707][ T45] #4: eaa771a0 (lock#3){+.+.}-{3:3}, at: local_lock_acquire (include/linux/local_lock_internal.h:40)
> [ 1.399707][ T45]
> [ 1.399707][ T45] stack backtrace:
> [ 1.399707][ T45] CPU: 0 UID: 0 PID: 45 Comm: kworker/u9:1 Not tainted 6.18.0-rc3-00251-gdd9e066d9677 #1 PREEMPT(none)
> [ 1.399707][ T45] Workqueue: async async_run_entry_fn
> [ 1.399707][ T45] Call Trace:
> [ 1.399707][ T45] dump_stack_lvl (lib/dump_stack.c:122)
> [ 1.399707][ T45] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
> [ 1.399707][ T45] dump_stack (lib/dump_stack.c:130)
> [ 1.399707][ T45] print_unlock_imbalance_bug (kernel/locking/lockdep.c:5300 kernel/locking/lockdep.c:5272)
> [ 1.399707][ T45] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
> [ 1.399707][ T45] __lock_release+0x5e/0x150
> [ 1.399707][ T45] ? rcu_read_unlock (include/linux/rcupdate.h:341 include/linux/rcupdate.h:897)
> [ 1.399707][ T45] lock_release (kernel/locking/lockdep.c:470 kernel/locking/lockdep.c:5891 kernel/locking/lockdep.c:5875)
> [ 1.399707][ T45] ? lru_deactivate_file (mm/swap.c:119)
> [ 1.399707][ T45] rcu_read_unlock (include/linux/rcupdate.h:899)
> [ 1.399707][ T45] lruvec_unlock_irqrestore (include/linux/memcontrol.h:1522)
> [ 1.399707][ T45] folio_batch_move_lru (include/linux/mm.h:1501 mm/swap.c:179)
> [ 1.399707][ T45] __folio_batch_add_and_move (mm/swap.c:196 (discriminator 2))
> [ 1.399707][ T45] ? lru_deactivate_file (mm/swap.c:119)
> [ 1.399707][ T45] folio_add_lru (mm/swap.c:514)
> [ 1.399707][ T45] filemap_add_folio (mm/filemap.c:996)
> [ 1.399707][ T45] __filemap_get_folio (mm/filemap.c:2023)
> [ 1.399707][ T45] simple_write_begin (fs/libfs.c:932 (discriminator 1))
> [ 1.399707][ T45] generic_perform_write (mm/filemap.c:4263)
> [ 1.399707][ T45] __generic_file_write_iter (mm/filemap.c:4380)
> [ 1.399707][ T45] generic_file_write_iter (mm/filemap.c:4406)
> [ 1.399707][ T45] __kernel_write_iter (fs/read_write.c:619)
> [ 1.399707][ T45] __kernel_write (fs/read_write.c:640)
> [ 1.399707][ T45] kernel_write (fs/read_write.c:660 fs/read_write.c:650)
> [ 1.399707][ T45] xwrite+0x27/0x80
> [ 1.399707][ T45] do_copy (init/initramfs.c:417 (discriminator 1))
> [ 1.399707][ T45] write_buffer (init/initramfs.c:470 (discriminator 1))
> [ 1.399707][ T45] flush_buffer (init/initramfs.c:482 (discriminator 1))
> [ 1.399707][ T45] __gunzip+0x21d/0x2c0
> [ 1.399707][ T45] ? bunzip2 (lib/decompress_inflate.c:39)
> [ 1.399707][ T45] ? __gunzip+0x2c0/0x2c0
> [ 1.399707][ T45] gunzip (lib/decompress_inflate.c:208)
> [ 1.399707][ T45] ? write_buffer (init/initramfs.c:476)
> [ 1.399707][ T45] ? initrd_load (init/initramfs.c:64)
> [ 1.399707][ T45] unpack_to_rootfs (init/initramfs.c:553)
> [ 1.399707][ T45] ? write_buffer (init/initramfs.c:476)
> [ 1.399707][ T45] ? initrd_load (init/initramfs.c:64)
> [ 1.399707][ T45] ? reserve_initrd_mem (init/initramfs.c:719)
> [ 1.399707][ T45] do_populate_rootfs (init/initramfs.c:734)
> [ 1.399707][ T45] async_run_entry_fn (kernel/async.c:136 (discriminator 1))
> [ 1.399707][ T45] ? async_schedule_node (kernel/async.c:118)
> [ 1.399707][ T45] process_one_work (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/jump_label.h:262 include/trace/events/workqueue.h:110 kernel/workqueue.c:3268)
> [ 1.399707][ T45] process_scheduled_works (kernel/workqueue.c:3346)
> [ 1.399707][ T45] worker_thread (include/linux/list.h:381 (discriminator 2) kernel/workqueue.c:952 (discriminator 2) kernel/workqueue.c:3428 (discriminator 2))
> [ 1.399707][ T45] kthread (kernel/kthread.c:465)
> [ 1.399707][ T45] ? process_scheduled_works (kernel/workqueue.c:3373)
> [ 1.399707][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
> [ 1.399707][ T45] ret_from_fork (arch/x86/kernel/process.c:164)
> [ 1.399707][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
> [ 1.399707][ T45] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 1.399707][ T45] entry_INT80_32 (arch/x86/entry/entry_32.S:945)
> [ 1.467118][ T32] Callback from call_rcu_tasks() invoked.
> [ 1.468370][ T45] ------------[ cut here ]------------
> [ 1.469508][ T45] WARNING: CPU: 0 PID: 45 at kernel/rcu/tree_plugin.h:443 __rcu_read_unlock (kernel/rcu/tree_plugin.h:443)
> [ 1.471711][ T45] Modules linked in:
> [ 1.472490][ T45] CPU: 0 UID: 0 PID: 45 Comm: kworker/u9:1 Not tainted 6.18.0-rc3-00251-gdd9e066d9677 #1 PREEMPT(none)
> [ 1.474777][ T45] Workqueue: async async_run_entry_fn
> [ 1.475823][ T45] EIP: __rcu_read_unlock (kernel/rcu/tree_plugin.h:443)
> [ 1.476872][ T45] Code: 0c d0 56 c2 ff 8b a4 02 00 00 75 11 8b 83 a8 02 00 00 85 c0 74 07 89 d8 e8 7c fe ff ff 8b 83 a4 02 00 00 3d ff ff ff 3f 76 02 <0f> 0b 5b 5d 31 c0 c3 2e 8d b4 26 00 00 00 00 55 89 e5 57 56 89 c6
> All code
> ========
> 0: 0c d0 or $0xd0,%al
> 2: 56 push %rsi
> 3: c2 ff 8b ret $0x8bff
> 6: a4 movsb %ds:(%rsi),%es:(%rdi)
> 7: 02 00 add (%rax),%al
> 9: 00 75 11 add %dh,0x11(%rbp)
> c: 8b 83 a8 02 00 00 mov 0x2a8(%rbx),%eax
> 12: 85 c0 test %eax,%eax
> 14: 74 07 je 0x1d
> 16: 89 d8 mov %ebx,%eax
> 18: e8 7c fe ff ff call 0xfffffffffffffe99
> 1d: 8b 83 a4 02 00 00 mov 0x2a4(%rbx),%eax
> 23: 3d ff ff ff 3f cmp $0x3fffffff,%eax
> 28: 76 02 jbe 0x2c
> 2a:* 0f 0b ud2 <-- trapping instruction
> 2c: 5b pop %rbx
> 2d: 5d pop %rbp
> 2e: 31 c0 xor %eax,%eax
> 30: c3 ret
> 31: 2e 8d b4 26 00 00 00 cs lea 0x0(%rsi,%riz,1),%esi
> 38: 00
> 39: 55 push %rbp
> 3a: 89 e5 mov %esp,%ebp
> 3c: 57 push %rdi
> 3d: 56 push %rsi
> 3e: 89 c6 mov %eax,%esi
>
> Code starting with the faulting instruction
> ===========================================
> 0: 0f 0b ud2
> 2: 5b pop %rbx
> 3: 5d pop %rbp
> 4: 31 c0 xor %eax,%eax
> 6: c3 ret
> 7: 2e 8d b4 26 00 00 00 cs lea 0x0(%rsi,%riz,1),%esi
> e: 00
> f: 55 push %rbp
> 10: 89 e5 mov %esp,%ebp
> 12: 57 push %rdi
> 13: 56 push %rsi
> 14: 89 c6 mov %eax,%esi
> [ 1.480862][ T45] EAX: ffffffff EBX: c0bfd640 ECX: 00000000 EDX: 00000000
> [ 1.482292][ T45] ESI: eaa771c0 EDI: c119e160 EBP: c08e9c0c ESP: c08e9c08
> [ 1.483721][ T45] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010286
> [ 1.485350][ T45] CR0: 80050033 CR2: ffbff000 CR3: 026c6000 CR4: 000006f0
> [ 1.486725][ T45] Call Trace:
> [ 1.487465][ T45] rcu_read_unlock (include/linux/rcupdate.h:900)
> [ 1.488404][ T45] lruvec_unlock_irqrestore (include/linux/memcontrol.h:1522)
> [ 1.489570][ T45] folio_batch_move_lru (include/linux/mm.h:1501 mm/swap.c:179)
> [ 1.491100][ T45] __folio_batch_add_and_move (mm/swap.c:196 (discriminator 2))
> [ 1.492318][ T45] ? lru_deactivate_file (mm/swap.c:119)
> [ 1.493827][ T45] folio_add_lru (mm/swap.c:514)
> [ 1.494817][ T45] filemap_add_folio (mm/filemap.c:996)
> [ 1.495891][ T45] __filemap_get_folio (mm/filemap.c:2023)
> [ 1.497735][ T45] simple_write_begin (fs/libfs.c:932 (discriminator 1))
> [ 1.498813][ T45] generic_perform_write (mm/filemap.c:4263)
> [ 1.500292][ T45] __generic_file_write_iter (mm/filemap.c:4380)
> [ 1.501552][ T45] generic_file_write_iter (mm/filemap.c:4406)
> [ 1.502790][ T45] __kernel_write_iter (fs/read_write.c:619)
> [ 1.504090][ T45] __kernel_write (fs/read_write.c:640)
> [ 1.505377][ T45] kernel_write (fs/read_write.c:660 fs/read_write.c:650)
> [ 1.506306][ T45] xwrite+0x27/0x80
> [ 1.507429][ T45] do_copy (init/initramfs.c:417 (discriminator 1))
> [ 1.508361][ T45] write_buffer (init/initramfs.c:470 (discriminator 1))
> [ 1.509300][ T45] flush_buffer (init/initramfs.c:482 (discriminator 1))
> [ 1.510307][ T45] __gunzip+0x21d/0x2c0
> [ 1.511424][ T45] ? bunzip2 (lib/decompress_inflate.c:39)
> [ 1.512312][ T45] ? __gunzip+0x2c0/0x2c0
> [ 1.513515][ T45] gunzip (lib/decompress_inflate.c:208)
> [ 1.514359][ T45] ? write_buffer (init/initramfs.c:476)
> [ 1.515304][ T45] ? initrd_load (init/initramfs.c:64)
> [ 1.516217][ T45] unpack_to_rootfs (init/initramfs.c:553)
> [ 1.517220][ T45] ? write_buffer (init/initramfs.c:476)
> [ 1.518175][ T45] ? initrd_load (init/initramfs.c:64)
> [ 1.519172][ T45] ? reserve_initrd_mem (init/initramfs.c:719)
> [ 1.520375][ T45] do_populate_rootfs (init/initramfs.c:734)
> [ 1.521401][ T45] async_run_entry_fn (kernel/async.c:136 (discriminator 1))
> [ 1.522378][ T45] ? async_schedule_node (kernel/async.c:118)
> [ 1.523492][ T45] process_one_work (arch/x86/include/asm/atomic.h:23 include/linux/atomic/atomic-arch-fallback.h:457 include/linux/jump_label.h:262 include/trace/events/workqueue.h:110 kernel/workqueue.c:3268)
> [ 1.526446][ T45] process_scheduled_works (kernel/workqueue.c:3346)
> [ 1.527578][ T45] worker_thread (include/linux/list.h:381 (discriminator 2) kernel/workqueue.c:952 (discriminator 2) kernel/workqueue.c:3428 (discriminator 2))
> [ 1.528573][ T45] kthread (kernel/kthread.c:465)
> [ 1.529483][ T45] ? process_scheduled_works (kernel/workqueue.c:3373)
> [ 1.530780][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
> [ 1.531779][ T45] ret_from_fork (arch/x86/kernel/process.c:164)
> [ 1.532664][ T45] ? kthread_is_per_cpu (kernel/kthread.c:412)
> [ 1.533736][ T45] ret_from_fork_asm (arch/x86/entry/entry_32.S:737)
> [ 1.534846][ T45] entry_INT80_32 (arch/x86/entry/entry_32.S:945)
> [ 1.536182][ T45] irq event stamp: 2161
> [ 1.536994][ T45] hardirqs last enabled at (2161): _raw_spin_unlock_irqrestore (arch/x86/include/asm/irqflags.h:26 arch/x86/include/asm/irqflags.h:109 arch/x86/include/asm/irqflags.h:151 include/linux/spinlock_api_smp.h:151 kernel/locking/spinlock.c:194)
> [ 1.538763][ T45] hardirqs last disabled at (2160): _raw_spin_lock_irqsave (include/linux/spinlock_api_smp.h:109 kernel/locking/spinlock.c:162)
> [ 1.540674][ T45] softirqs last enabled at (1956): handle_softirqs (kernel/softirq.c:469 (discriminator 2) kernel/softirq.c:650 (discriminator 2))
> [ 1.542454][ T45] softirqs last disabled at (1949): __do_softirq (kernel/softirq.c:657)
> [ 1.544105][ T45] ---[ end trace 0000000000000000 ]---
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20251104/202511041421.784bbd5e-lkp@intel.com
In this config file, CONFIG_MEMCG is not set:
# CONFIG_MEMCG is not set
In this case, folio_lruvec_lock*() was not modified (add
rcu_read_lock()), will fix it in the next version.
Thanks,
Qi
>
>
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (20 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
` (5 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
To reslove the dying memcg issue, we need to reparent LRU folios of child
memcg to its parent memcg. For traditional LRU list, each lruvec of every
memcg comprises four LRU lists. Due to the symmetry of the LRU lists, it
is feasible to transfer the LRU lists from a memcg to its parent memcg
during the reparenting process.
This commit implements the specific function, which will be used during
the reparenting process.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/mmzone.h | 4 ++++
mm/vmscan.c | 39 +++++++++++++++++++++++++++++++++++++++
2 files changed, 43 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4398e027f450e..0d8776e5b6747 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -366,6 +366,10 @@ enum lruvec_flags {
LRUVEC_NODE_CONGESTED,
};
+#ifdef CONFIG_MEMCG
+void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst);
+#endif /* CONFIG_MEMCG */
+
#endif /* !__GENERATING_BOUNDS_H */
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 676e6270e5b45..7aa8e1472d10d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2649,6 +2649,45 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
lruvec_memcg(lruvec));
}
+
+#ifdef CONFIG_MEMCG
+static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
+ enum lru_list lru)
+{
+ int zid;
+ struct mem_cgroup_per_node *mz_src, *mz_dst;
+
+ mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
+ mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
+
+ if (lru != LRU_UNEVICTABLE)
+ list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
+ mz_src->lru_zone_size[zid][lru] = 0;
+ }
+}
+
+void lru_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+ int nid;
+
+ for_each_node(nid) {
+ enum lru_list lru;
+ struct lruvec *src_lruvec, *dst_lruvec;
+
+ src_lruvec = mem_cgroup_lruvec(src, NODE_DATA(nid));
+ dst_lruvec = mem_cgroup_lruvec(dst, NODE_DATA(nid));
+ dst_lruvec->anon_cost += src_lruvec->anon_cost;
+ dst_lruvec->file_cost += src_lruvec->file_cost;
+
+ for_each_lru(lru)
+ lruvec_reparent_lru(src_lruvec, dst_lruvec, lru);
+ }
+}
+#endif
+
#ifdef CONFIG_LRU_GEN
#ifdef CONFIG_LRU_GEN_ENABLED
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (21 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
` (4 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.
However, there are the following challenges:
1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
number of generations of the parent and child memcg may be different,
so we cannot simply transfer MGLRU folios in the child memcg to the
parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
traverse these folios while holding the lru lock, otherwise it may
cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
may be updated, but the folio is not immediately moved to the
corresponding lru list. Therefore, there may be folios of different
generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
found based on the generation information in folio->flags, and the
corresponding LRU size will be updated. Therefore, we need to update
the lru size correctly during reparenting, otherwise the lru size may
be updated incorrectly in lru_gen_del_folio().
Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.
Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).
To prepare for the above work, this commit implements the specific
functions, which will be used during reparenting.
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Suggested-by: Imran Khan <imran.f.khan@oracle.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/mmzone.h | 16 ++++++++
mm/vmscan.c | 86 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 102 insertions(+)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0d8776e5b6747..0a71bf015d12b 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -628,6 +628,9 @@ void lru_gen_online_memcg(struct mem_cgroup *memcg);
void lru_gen_offline_memcg(struct mem_cgroup *memcg);
void lru_gen_release_memcg(struct mem_cgroup *memcg);
void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid);
+void max_lru_gen_memcg(struct mem_cgroup *memcg);
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg);
+void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst);
#else /* !CONFIG_LRU_GEN */
@@ -668,6 +671,19 @@ static inline void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
{
}
+static inline void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+}
+
+static inline bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+ return true;
+}
+
+static inline void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+}
+
#endif /* CONFIG_LRU_GEN */
struct lruvec {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7aa8e1472d10d..3ee7fb96b8aeb 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4468,6 +4468,92 @@ void lru_gen_soft_reclaim(struct mem_cgroup *memcg, int nid)
lru_gen_rotate_memcg(lruvec, MEMCG_LRU_HEAD);
}
+bool recheck_lru_gen_max_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+ int type;
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ if (get_nr_gens(lruvec, type) != MAX_NR_GENS)
+ return false;
+ }
+ }
+
+ return true;
+}
+
+/*
+ * We need to ensure that the folios of child memcg can be reparented to the
+ * same gen of the parent memcg, so the gens of the parent memcg needed be
+ * incremented to the MAX_NR_GENS before reparenting.
+ */
+void max_lru_gen_memcg(struct mem_cgroup *memcg)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *lruvec = get_lruvec(memcg, nid);
+ int type;
+
+ for (type = 0; type < ANON_AND_FILE; type++) {
+ while (get_nr_gens(lruvec, type) < MAX_NR_GENS) {
+ DEFINE_MAX_SEQ(lruvec);
+
+ inc_max_seq(lruvec, max_seq, mem_cgroup_swappiness(memcg));
+ cond_resched();
+ }
+ }
+ }
+}
+
+static void __lru_gen_reparent_memcg(struct lruvec *src_lruvec, struct lruvec *dst_lruvec,
+ int zone, int type)
+{
+ struct lru_gen_folio *src_lrugen, *dst_lrugen;
+ enum lru_list lru = type * LRU_INACTIVE_FILE;
+ int i;
+
+ src_lrugen = &src_lruvec->lrugen;
+ dst_lrugen = &dst_lruvec->lrugen;
+
+ for (i = 0; i < get_nr_gens(src_lruvec, type); i++) {
+ int gen = lru_gen_from_seq(src_lrugen->max_seq - i);
+ int nr_pages = src_lrugen->nr_pages[gen][type][zone];
+ int src_lru_active = lru_gen_is_active(src_lruvec, gen) ? LRU_ACTIVE : 0;
+ int dst_lru_active = lru_gen_is_active(dst_lruvec, gen) ? LRU_ACTIVE : 0;
+
+ list_splice_tail_init(&src_lrugen->folios[gen][type][zone],
+ &dst_lrugen->folios[gen][type][zone]);
+
+ WRITE_ONCE(src_lrugen->nr_pages[gen][type][zone], 0);
+ WRITE_ONCE(dst_lrugen->nr_pages[gen][type][zone],
+ dst_lrugen->nr_pages[gen][type][zone] + nr_pages);
+
+ __update_lru_size(src_lruvec, lru + src_lru_active, zone, -nr_pages);
+ __update_lru_size(dst_lruvec, lru + dst_lru_active, zone, nr_pages);
+ }
+}
+
+void lru_gen_reparent_memcg(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+ int nid;
+
+ for_each_node(nid) {
+ struct lruvec *src_lruvec, *dst_lruvec;
+ int type, zone;
+
+ src_lruvec = get_lruvec(src, nid);
+ dst_lruvec = get_lruvec(dst, nid);
+
+ for (zone = 0; zone < MAX_NR_ZONES; zone++)
+ for (type = 0; type < ANON_AND_FILE; type++)
+ __lru_gen_reparent_memcg(src_lruvec, dst_lruvec, zone, type);
+ }
+}
+
#endif /* CONFIG_MEMCG */
/******************************************************************************
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (22 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
` (3 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Qi Zheng
From: Qi Zheng <zhengqi.arch@bytedance.com>
Refactor the memcg_reparent_objcgs() to facilitate subsequent reparenting
LRU folios here.
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
mm/memcontrol.c | 37 +++++++++++++++++++++++++++----------
mm/vmscan.c | 1 -
2 files changed, 27 insertions(+), 11 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7969dd93d858a..ee98c9e8fdcea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -206,24 +206,41 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
+static void __memcg_reparent_objcgs(struct mem_cgroup *src,
+ struct mem_cgroup *dst)
{
struct obj_cgroup *objcg, *iter;
- struct mem_cgroup *parent = parent_mem_cgroup(memcg);
-
- objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
-
- spin_lock_irq(&objcg_lock);
+ objcg = rcu_replace_pointer(src->objcg, NULL, true);
/* 1) Ready to reparent active objcg. */
- list_add(&objcg->list, &memcg->objcg_list);
+ list_add(&objcg->list, &src->objcg_list);
/* 2) Reparent active objcg and already reparented objcgs to parent. */
- list_for_each_entry(iter, &memcg->objcg_list, list)
- WRITE_ONCE(iter->memcg, parent);
+ list_for_each_entry(iter, &src->objcg_list, list)
+ WRITE_ONCE(iter->memcg, dst);
/* 3) Move already reparented objcgs to the parent's list */
- list_splice(&memcg->objcg_list, &parent->objcg_list);
+ list_splice(&src->objcg_list, &dst->objcg_list);
+}
+
+static void reparent_locks(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
+ spin_lock_irq(&objcg_lock);
+}
+static void reparent_unlocks(struct mem_cgroup *src, struct mem_cgroup *dst)
+{
spin_unlock_irq(&objcg_lock);
+}
+
+static void memcg_reparent_objcgs(struct mem_cgroup *src)
+{
+ struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
+ struct mem_cgroup *dst = parent_mem_cgroup(src);
+
+ reparent_locks(src, dst);
+
+ __memcg_reparent_objcgs(src, dst);
+
+ reparent_unlocks(src, dst);
percpu_ref_kill(&objcg->refcnt);
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ee7fb96b8aeb..82c4cc6edbca5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2649,7 +2649,6 @@ static bool can_age_anon_pages(struct lruvec *lruvec,
lruvec_memcg(lruvec));
}
-
#ifdef CONFIG_MEMCG
static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
enum lru_list lru)
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (23 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs() Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 13:58 ` [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
` (2 subsequent siblings)
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
Pagecache pages are charged at allocation time and hold a reference
to the original memory cgroup until reclaimed. Depending on memory
pressure, page sharing patterns between different cgroups and cgroup
creation/destruction rates, many dying memory cgroups can be pinned
by pagecache pages, reducing page reclaim efficiency and wasting
memory. Converting LRU folios and most other raw memory cgroup pins
to the object cgroup direction can fix this long-living problem.
Finally, folio->memcg_data of LRU folios and kmem folios will always
point to an object cgroup pointer. The folio->memcg_data of slab
folios will point to an vector of object cgroups.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/memcontrol.h | 77 +++++----------
mm/memcontrol-v1.c | 15 +--
mm/memcontrol.c | 189 +++++++++++++++++++++++--------------
3 files changed, 150 insertions(+), 131 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6f6b28f8f0f63..f87aa43d8e54a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -369,9 +369,6 @@ enum objext_flags {
#define OBJEXTS_FLAGS_MASK (__NR_OBJEXTS_FLAGS - 1)
#ifdef CONFIG_MEMCG
-
-static inline bool folio_memcg_kmem(struct folio *folio);
-
/*
* After the initialization objcg->memcg is always pointing at
* a valid memcg, but can be atomically swapped to the parent memcg.
@@ -385,43 +382,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
}
/*
- * __folio_memcg - Get the memory cgroup associated with a non-kmem folio
- * @folio: Pointer to the folio.
- *
- * Returns a pointer to the memory cgroup associated with the folio,
- * or NULL. This function assumes that the folio is known to have a
- * proper memory cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * kmem folios.
- */
-static inline struct mem_cgroup *__folio_memcg(struct folio *folio)
-{
- unsigned long memcg_data = folio->memcg_data;
-
- VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
- VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_KMEM, folio);
-
- return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
-}
-
-/*
- * __folio_objcg - get the object cgroup associated with a kmem folio.
+ * folio_objcg - get the object cgroup associated with a folio.
* @folio: Pointer to the folio.
*
* Returns a pointer to the object cgroup associated with the folio,
* or NULL. This function assumes that the folio is known to have a
- * proper object cgroup pointer. It's not safe to call this function
- * against some type of folios, e.g. slab folios or ex-slab folios or
- * LRU folios.
+ * proper object cgroup pointer.
*/
-static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
+static inline struct obj_cgroup *folio_objcg(struct folio *folio)
{
unsigned long memcg_data = folio->memcg_data;
VM_BUG_ON_FOLIO(folio_test_slab(folio), folio);
VM_BUG_ON_FOLIO(memcg_data & MEMCG_DATA_OBJEXTS, folio);
- VM_BUG_ON_FOLIO(!(memcg_data & MEMCG_DATA_KMEM), folio);
return (struct obj_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
}
@@ -435,21 +408,30 @@ static inline struct obj_cgroup *__folio_objcg(struct folio *folio)
* proper memory cgroup pointer. It's not safe to call this function
* against some type of folios, e.g. slab folios or ex-slab folios.
*
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * For a folio any of the following ensures folio and objcg binding stability:
*
* - the folio lock
* - LRU isolation
* - exclusive reference
*
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * Based on the stable binding of folio and objcg, for a folio any of the
+ * following ensures folio and memcg binding stability:
+ *
+ * - cgroup_mutex
+ * - the lruvec lock
+ *
+ * If the caller only want to ensure that the page counters of memcg are
+ * updated correctly, ensure that the binding stability of folio and objcg
+ * is sufficient.
+ *
+ * Note: The caller should hold an rcu read lock or cgroup_mutex to protect
+ * memcg associated with a folio from being released.
*/
static inline struct mem_cgroup *folio_memcg(struct folio *folio)
{
- if (folio_memcg_kmem(folio))
- return obj_cgroup_memcg(__folio_objcg(folio));
- return __folio_memcg(folio);
+ struct obj_cgroup *objcg = folio_objcg(folio);
+
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
/*
@@ -473,15 +455,10 @@ static inline bool folio_memcg_charged(struct folio *folio)
* has an associated memory cgroup pointer or an object cgroups vector or
* an object cgroup.
*
- * For a non-kmem folio any of the following ensures folio and memcg binding
- * stability:
+ * The page and objcg or memcg binding rules can refer to folio_memcg().
*
- * - the folio lock
- * - LRU isolation
- * - exclusive reference
- *
- * For a kmem folio a caller should hold an rcu read lock to protect memcg
- * associated with a kmem folio from being released.
+ * A caller should hold an rcu read lock to protect memcg associated with a
+ * page from being released.
*/
static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
{
@@ -490,18 +467,14 @@ static inline struct mem_cgroup *folio_memcg_check(struct folio *folio)
* for slabs, READ_ONCE() should be used here.
*/
unsigned long memcg_data = READ_ONCE(folio->memcg_data);
+ struct obj_cgroup *objcg;
if (memcg_data & MEMCG_DATA_OBJEXTS)
return NULL;
- if (memcg_data & MEMCG_DATA_KMEM) {
- struct obj_cgroup *objcg;
-
- objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
- return obj_cgroup_memcg(objcg);
- }
+ objcg = (void *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
- return (struct mem_cgroup *)(memcg_data & ~OBJEXTS_FLAGS_MASK);
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
static inline struct mem_cgroup *page_memcg_check(struct page *page)
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff7426..23c07df2063c8 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -591,6 +591,7 @@ void memcg1_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
void memcg1_swapout(struct folio *folio, swp_entry_t entry)
{
struct mem_cgroup *memcg, *swap_memcg;
+ struct obj_cgroup *objcg;
unsigned int nr_entries;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
@@ -602,12 +603,13 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
if (!do_memsw_account())
return;
- memcg = folio_memcg(folio);
-
- VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
- if (!memcg)
+ objcg = folio_objcg(folio);
+ VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+ if (!objcg)
return;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
/*
* In case the memcg owning these pages has been offlined and doesn't
* have an ID allocated to it anymore, charge the closest online
@@ -625,7 +627,7 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
folio_unqueue_deferred_split(folio);
folio->memcg_data = 0;
- if (!mem_cgroup_is_root(memcg))
+ if (!obj_cgroup_is_root(objcg))
page_counter_uncharge(&memcg->memory, nr_entries);
if (memcg != swap_memcg) {
@@ -646,7 +648,8 @@ void memcg1_swapout(struct folio *folio, swp_entry_t entry)
preempt_enable_nested();
memcg1_check_events(memcg, folio_nid(folio));
- css_put(&memcg->css);
+ rcu_read_unlock();
+ obj_cgroup_put(objcg);
}
/*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ee98c9e8fdcea..759197e19c50b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -223,22 +223,55 @@ static void __memcg_reparent_objcgs(struct mem_cgroup *src,
static void reparent_locks(struct mem_cgroup *src, struct mem_cgroup *dst)
{
+ int nid, nest = 0;
+
spin_lock_irq(&objcg_lock);
+ for_each_node(nid) {
+ spin_lock_nested(&mem_cgroup_lruvec(src,
+ NODE_DATA(nid))->lru_lock, nest++);
+ spin_lock_nested(&mem_cgroup_lruvec(dst,
+ NODE_DATA(nid))->lru_lock, nest++);
+ }
}
static void reparent_unlocks(struct mem_cgroup *src, struct mem_cgroup *dst)
{
+ int nid;
+
+ for_each_node(nid) {
+ spin_unlock(&mem_cgroup_lruvec(dst, NODE_DATA(nid))->lru_lock);
+ spin_unlock(&mem_cgroup_lruvec(src, NODE_DATA(nid))->lru_lock);
+ }
spin_unlock_irq(&objcg_lock);
}
+static void memcg_reparent_lru_folios(struct mem_cgroup *src,
+ struct mem_cgroup *dst)
+{
+ if (lru_gen_enabled())
+ lru_gen_reparent_memcg(src, dst);
+ else
+ lru_reparent_memcg(src, dst);
+}
+
static void memcg_reparent_objcgs(struct mem_cgroup *src)
{
struct obj_cgroup *objcg = rcu_dereference_protected(src->objcg, true);
struct mem_cgroup *dst = parent_mem_cgroup(src);
+retry:
+ if (lru_gen_enabled())
+ max_lru_gen_memcg(dst);
+
reparent_locks(src, dst);
+ if (lru_gen_enabled() && !recheck_lru_gen_max_memcg(dst)) {
+ reparent_unlocks(src, dst);
+ cond_resched();
+ goto retry;
+ }
__memcg_reparent_objcgs(src, dst);
+ memcg_reparent_lru_folios(src, dst);
reparent_unlocks(src, dst);
@@ -989,6 +1022,8 @@ struct mem_cgroup *get_mem_cgroup_from_current(void)
/**
* get_mem_cgroup_from_folio - Obtain a reference on a given folio's memcg.
* @folio: folio from which memcg should be extracted.
+ *
+ * The folio and objcg or memcg binding rules can refer to folio_memcg().
*/
struct mem_cgroup *get_mem_cgroup_from_folio(struct folio *folio)
{
@@ -2557,17 +2592,17 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
return try_charge_memcg(memcg, gfp_mask, nr_pages);
}
-static void commit_charge(struct folio *folio, struct mem_cgroup *memcg)
+static void commit_charge(struct folio *folio, struct obj_cgroup *objcg)
{
VM_BUG_ON_FOLIO(folio_memcg_charged(folio), folio);
/*
- * Any of the following ensures page's memcg stability:
+ * Any of the following ensures folio's objcg stability:
*
* - the page lock
* - LRU isolation
* - exclusive reference
*/
- folio->memcg_data = (unsigned long)memcg;
+ folio->memcg_data = (unsigned long)objcg;
}
#ifdef CONFIG_MEMCG_NMI_SAFETY_REQUIRES_ATOMIC
@@ -2679,6 +2714,17 @@ static struct obj_cgroup *__get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
return NULL;
}
+static inline struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+ struct obj_cgroup *objcg;
+
+ rcu_read_lock();
+ objcg = __get_obj_cgroup_from_memcg(memcg);
+ rcu_read_unlock();
+
+ return objcg;
+}
+
static struct obj_cgroup *current_objcg_update(void)
{
struct mem_cgroup *memcg;
@@ -2779,17 +2825,10 @@ struct obj_cgroup *get_obj_cgroup_from_folio(struct folio *folio)
{
struct obj_cgroup *objcg;
- if (!memcg_kmem_online())
- return NULL;
-
- if (folio_memcg_kmem(folio)) {
- objcg = __folio_objcg(folio);
+ objcg = folio_objcg(folio);
+ if (objcg)
obj_cgroup_get(objcg);
- } else {
- rcu_read_lock();
- objcg = __get_obj_cgroup_from_memcg(__folio_memcg(folio));
- rcu_read_unlock();
- }
+
return objcg;
}
@@ -3296,7 +3335,7 @@ void folio_split_memcg_refs(struct folio *folio, unsigned old_order,
return;
new_refs = (1 << (old_order - new_order)) - 1;
- css_get_many(&__folio_memcg(folio)->css, new_refs);
+ obj_cgroup_get_many(folio_objcg(folio), new_refs);
}
unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
@@ -4745,16 +4784,20 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
gfp_t gfp)
{
- int ret;
-
- ret = try_charge(memcg, gfp, folio_nr_pages(folio));
- if (ret)
- goto out;
+ int ret = 0;
+ struct obj_cgroup *objcg;
- css_get(&memcg->css);
- commit_charge(folio, memcg);
+ objcg = get_obj_cgroup_from_memcg(memcg);
+ /* Do not account at the root objcg level. */
+ if (!obj_cgroup_is_root(objcg))
+ ret = try_charge(memcg, gfp, folio_nr_pages(folio));
+ if (ret) {
+ obj_cgroup_put(objcg);
+ return ret;
+ }
+ commit_charge(folio, objcg);
memcg1_commit_charge(folio, memcg);
-out:
+
return ret;
}
@@ -4840,7 +4883,7 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
}
struct uncharge_gather {
- struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
unsigned long nr_memory;
unsigned long pgpgout;
unsigned long nr_kmem;
@@ -4854,58 +4897,52 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug)
static void uncharge_batch(const struct uncharge_gather *ug)
{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(ug->objcg);
if (ug->nr_memory) {
- memcg_uncharge(ug->memcg, ug->nr_memory);
+ memcg_uncharge(memcg, ug->nr_memory);
if (ug->nr_kmem) {
- mod_memcg_state(ug->memcg, MEMCG_KMEM, -ug->nr_kmem);
- memcg1_account_kmem(ug->memcg, -ug->nr_kmem);
+ mod_memcg_state(memcg, MEMCG_KMEM, -ug->nr_kmem);
+ memcg1_account_kmem(memcg, -ug->nr_kmem);
}
- memcg1_oom_recover(ug->memcg);
+ memcg1_oom_recover(memcg);
}
- memcg1_uncharge_batch(ug->memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+ memcg1_uncharge_batch(memcg, ug->pgpgout, ug->nr_memory, ug->nid);
+ rcu_read_unlock();
/* drop reference from uncharge_folio */
- css_put(&ug->memcg->css);
+ obj_cgroup_put(ug->objcg);
}
static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
{
long nr_pages;
- struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
/*
* Nobody should be changing or seriously looking at
- * folio memcg or objcg at this point, we have fully
- * exclusive access to the folio.
+ * folio objcg at this point, we have fully exclusive
+ * access to the folio.
*/
- if (folio_memcg_kmem(folio)) {
- objcg = __folio_objcg(folio);
- /*
- * This get matches the put at the end of the function and
- * kmem pages do not hold memcg references anymore.
- */
- memcg = get_mem_cgroup_from_objcg(objcg);
- } else {
- memcg = __folio_memcg(folio);
- }
-
- if (!memcg)
+ objcg = folio_objcg(folio);
+ if (!objcg)
return;
- if (ug->memcg != memcg) {
- if (ug->memcg) {
+ if (ug->objcg != objcg) {
+ if (ug->objcg) {
uncharge_batch(ug);
uncharge_gather_clear(ug);
}
- ug->memcg = memcg;
+ ug->objcg = objcg;
ug->nid = folio_nid(folio);
- /* pairs with css_put in uncharge_batch */
- css_get(&memcg->css);
+ /* pairs with obj_cgroup_put in uncharge_batch */
+ obj_cgroup_get(objcg);
}
nr_pages = folio_nr_pages(folio);
@@ -4913,20 +4950,17 @@ static void uncharge_folio(struct folio *folio, struct uncharge_gather *ug)
if (folio_memcg_kmem(folio)) {
ug->nr_memory += nr_pages;
ug->nr_kmem += nr_pages;
-
- folio->memcg_data = 0;
- obj_cgroup_put(objcg);
} else {
/* LRU pages aren't accounted at the root level */
- if (!mem_cgroup_is_root(memcg))
+ if (!obj_cgroup_is_root(objcg))
ug->nr_memory += nr_pages;
ug->pgpgout++;
WARN_ON_ONCE(folio_unqueue_deferred_split(folio));
- folio->memcg_data = 0;
}
- css_put(&memcg->css);
+ folio->memcg_data = 0;
+ obj_cgroup_put(objcg);
}
void __mem_cgroup_uncharge(struct folio *folio)
@@ -4950,7 +4984,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
uncharge_gather_clear(&ug);
for (i = 0; i < folios->nr; i++)
uncharge_folio(folios->folios[i], &ug);
- if (ug.memcg)
+ if (ug.objcg)
uncharge_batch(&ug);
}
@@ -4967,6 +5001,7 @@ void __mem_cgroup_uncharge_folios(struct folio_batch *folios)
void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
{
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
long nr_pages = folio_nr_pages(new);
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
@@ -4981,21 +5016,24 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
if (folio_memcg_charged(new))
return;
- memcg = folio_memcg(old);
- VM_WARN_ON_ONCE_FOLIO(!memcg, old);
- if (!memcg)
+ objcg = folio_objcg(old);
+ VM_WARN_ON_ONCE_FOLIO(!objcg, old);
+ if (!objcg)
return;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
/* Force-charge the new page. The old one will be freed soon */
- if (!mem_cgroup_is_root(memcg)) {
+ if (!obj_cgroup_is_root(objcg)) {
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
}
- css_get(&memcg->css);
- commit_charge(new, memcg);
+ obj_cgroup_get(objcg);
+ commit_charge(new, objcg);
memcg1_commit_charge(new, memcg);
+ rcu_read_unlock();
}
/**
@@ -5011,7 +5049,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
*/
void mem_cgroup_migrate(struct folio *old, struct folio *new)
{
- struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
VM_BUG_ON_FOLIO(!folio_test_locked(old), old);
VM_BUG_ON_FOLIO(!folio_test_locked(new), new);
@@ -5022,18 +5060,18 @@ void mem_cgroup_migrate(struct folio *old, struct folio *new)
if (mem_cgroup_disabled())
return;
- memcg = folio_memcg(old);
+ objcg = folio_objcg(old);
/*
- * Note that it is normal to see !memcg for a hugetlb folio.
+ * Note that it is normal to see !objcg for a hugetlb folio.
* For e.g, itt could have been allocated when memory_hugetlb_accounting
* was not selected.
*/
- VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !memcg, old);
- if (!memcg)
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_hugetlb(old) && !objcg, old);
+ if (!objcg)
return;
- /* Transfer the charge and the css ref */
- commit_charge(new, memcg);
+ /* Transfer the charge and the objcg ref */
+ commit_charge(new, objcg);
/* Warning should never happen, so don't worry about refcount non-0 */
WARN_ON_ONCE(folio_unqueue_deferred_split(old));
@@ -5208,22 +5246,27 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
unsigned int nr_pages = folio_nr_pages(folio);
struct page_counter *counter;
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
if (do_memsw_account())
return 0;
- memcg = folio_memcg(folio);
-
- VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
- if (!memcg)
+ objcg = folio_objcg(folio);
+ VM_WARN_ON_ONCE_FOLIO(!objcg, folio);
+ if (!objcg)
return 0;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
if (!entry.val) {
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+ rcu_read_unlock();
return 0;
}
memcg = mem_cgroup_id_get_online(memcg);
+ /* memcg is pined by memcg ID. */
+ rcu_read_unlock();
if (!mem_cgroup_is_root(memcg) &&
!page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (24 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios Qi Zheng
@ 2025-10-28 13:58 ` Qi Zheng
2025-10-28 20:58 ` [syzbot ci] Re: Eliminate Dying Memory Cgroup syzbot ci
2025-10-29 7:53 ` [PATCH v1 00/26] " Michal Hocko
27 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-28 13:58 UTC (permalink / raw)
To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
david, lorenzo.stoakes, ziy, harry.yoo, imran.f.khan,
kamalesh.babulal, axelrasmussen, yuanchu, weixugc, akpm
Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng
From: Muchun Song <songmuchun@bytedance.com>
We must ensure the folio is deleted from or added to the correct lruvec
list. So, add VM_WARN_ON_ONCE_FOLIO() to catch invalid users. The
VM_BUG_ON_PAGE() in move_pages_to_lru() can be removed as
add_page_to_lru_list() will perform the necessary check.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
include/linux/mm_inline.h | 6 ++++++
mm/vmscan.c | 1 -
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f6a2b2d200162..dfed0523e0c43 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -342,6 +342,8 @@ void lruvec_add_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
if (lru_gen_add_folio(lruvec, folio, false))
return;
@@ -356,6 +358,8 @@ void lruvec_add_folio_tail(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
if (lru_gen_add_folio(lruvec, folio, true))
return;
@@ -370,6 +374,8 @@ void lruvec_del_folio(struct lruvec *lruvec, struct folio *folio)
{
enum lru_list lru = folio_lru_list(folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
+
if (lru_gen_del_folio(lruvec, folio, false))
return;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 82c4cc6edbca5..5f22ec438c018 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1934,7 +1934,6 @@ static unsigned int move_folios_to_lru(struct list_head *list)
continue;
}
- VM_BUG_ON_FOLIO(!folio_matches_lruvec(folio, lruvec), folio);
lruvec_add_folio(lruvec, folio);
nr_pages = folio_nr_pages(folio);
nr_moved += nr_pages;
--
2.20.1
^ permalink raw reply related [flat|nested] 58+ messages in thread* [syzbot ci] Re: Eliminate Dying Memory Cgroup
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (25 preceding siblings ...)
2025-10-28 13:58 ` [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers Qi Zheng
@ 2025-10-28 20:58 ` syzbot ci
2025-10-29 0:22 ` Harry Yoo
2025-10-29 7:53 ` [PATCH v1 00/26] " Michal Hocko
27 siblings, 1 reply; 58+ messages in thread
From: syzbot ci @ 2025-10-28 20:58 UTC (permalink / raw)
To: akpm, axelrasmussen, cgroups, chengming.zhou, david, hannes,
harry.yoo, hughd, imran.f.khan, kamalesh.babulal, linux-kernel,
linux-mm, lorenzo.stoakes, mhocko, muchun.song, nphamcs, qi.zheng,
roman.gushchin, shakeel.butt, songmuchun, weixugc, yuanchu,
zhengqi.arch, ziy
Cc: syzbot, syzkaller-bugs
syzbot ci has tested the following series
[v1] Eliminate Dying Memory Cgroup
https://lore.kernel.org/all/cover.1761658310.git.zhengqi.arch@bytedance.com
* [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
* [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
* [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
* [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
* [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
* [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
* [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
* [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
* [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
* [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
* [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
* [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
* [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
* [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
* [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
* [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
* [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
* [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
* [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
* [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
* [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
* [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
* [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
* [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
* [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
* [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
and found the following issue:
WARNING in folio_memcg
Full report is available here:
https://ci.syzbot.org/series/0d48a77a-fb4f-485d-9fd6-086afd6fb650
***
WARNING in folio_memcg
tree: mm-new
URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base: b227c04932039bccc21a0a89cd6df50fa57e4716
arch: amd64
compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
config: https://ci.syzbot.org/builds/503d7034-ae99-44d1-8fb2-62e7ef5e1c7c/config
C repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/c_repro
syz repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/syz_repro
exFAT-fs (loop0): failed to load upcase table (idx : 0x00010000, chksum : 0xe5674ec2, utbl_chksum : 0xe619d30d)
exFAT-fs (loop0): failed to load alloc-bitmap
exFAT-fs (loop0): failed to recognize exfat type
------------[ cut here ]------------
WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
Modules linked in:
CPU: 1 UID: 0 PID: 5965 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
RIP: 0010:folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
Code: 48 c1 e8 03 42 80 3c 20 00 74 08 48 89 df e8 5f c8 06 00 48 8b 03 5b 41 5c 41 5e 41 5f 5d e9 cf 89 2a 09 cc e8 a9 bb a0 ff 90 <0f> 0b 90 eb ca 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c ef fe ff ff
RSP: 0018:ffffc90003ec66b0 EFLAGS: 00010293
RAX: ffffffff821f4b57 RBX: ffff888108b31480 RCX: ffff88816be91d00
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffff88816be91d00 R09: 0000000000000002
R10: 00000000fffffff0 R11: 0000000000000000 R12: dffffc0000000000
R13: 00000000ffffffe4 R14: ffffea0006d5f840 R15: ffffea0006d5f870
FS: 000055555db87500(0000) GS:ffff8882a9f35000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2ee63fff CR3: 000000010c308000 CR4: 00000000000006f0
Call Trace:
<TASK>
zswap_compress mm/zswap.c:900 [inline]
zswap_store_page mm/zswap.c:1430 [inline]
zswap_store+0xfa2/0x1f80 mm/zswap.c:1541
swap_writeout+0x6e8/0xf20 mm/page_io.c:275
writeout mm/vmscan.c:651 [inline]
pageout mm/vmscan.c:699 [inline]
shrink_folio_list+0x34ec/0x4c40 mm/vmscan.c:1418
reclaim_folio_list+0xeb/0x500 mm/vmscan.c:2196
reclaim_pages+0x454/0x520 mm/vmscan.c:2233
madvise_cold_or_pageout_pte_range+0x1974/0x1d00 mm/madvise.c:565
walk_pmd_range mm/pagewalk.c:130 [inline]
walk_pud_range mm/pagewalk.c:224 [inline]
walk_p4d_range mm/pagewalk.c:262 [inline]
walk_pgd_range+0xfe9/0x1d40 mm/pagewalk.c:303
__walk_page_range+0x14c/0x710 mm/pagewalk.c:410
walk_page_range_vma+0x393/0x440 mm/pagewalk.c:717
madvise_pageout_page_range mm/madvise.c:624 [inline]
madvise_pageout mm/madvise.c:649 [inline]
madvise_vma_behavior+0x311f/0x3a10 mm/madvise.c:1352
madvise_walk_vmas+0x51c/0xa30 mm/madvise.c:1669
madvise_do_behavior+0x38e/0x550 mm/madvise.c:1885
do_madvise+0x1bc/0x270 mm/madvise.c:1978
__do_sys_madvise mm/madvise.c:1987 [inline]
__se_sys_madvise mm/madvise.c:1985 [inline]
__x64_sys_madvise+0xa7/0xc0 mm/madvise.c:1985
do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fccac38efc9
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 a8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffd9cc58708 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007fccac5e5fa0 RCX: 00007fccac38efc9
RDX: 0000000000000015 RSI: 7fffffffffffffff RDI: 0000200000000000
RBP: 00007fccac411f91 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fccac5e5fa0 R14: 00007fccac5e5fa0 R15: 0000000000000003
</TASK>
***
If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
Tested-by: syzbot@syzkaller.appspotmail.com
---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup
2025-10-28 20:58 ` [syzbot ci] Re: Eliminate Dying Memory Cgroup syzbot ci
@ 2025-10-29 0:22 ` Harry Yoo
2025-10-29 0:25 ` syzbot ci
2025-10-29 3:12 ` Qi Zheng
0 siblings, 2 replies; 58+ messages in thread
From: Harry Yoo @ 2025-10-29 0:22 UTC (permalink / raw)
To: syzbot ci
Cc: akpm, axelrasmussen, cgroups, chengming.zhou, david, hannes,
hughd, imran.f.khan, kamalesh.babulal, linux-kernel, linux-mm,
lorenzo.stoakes, mhocko, muchun.song, nphamcs, qi.zheng,
roman.gushchin, shakeel.butt, songmuchun, weixugc, yuanchu,
zhengqi.arch, ziy, syzbot, syzkaller-bugs
On Tue, Oct 28, 2025 at 01:58:33PM -0700, syzbot ci wrote:
> syzbot ci has tested the following series
>
> [v1] Eliminate Dying Memory Cgroup
> https://lore.kernel.org/all/cover.1761658310.git.zhengqi.arch@bytedance.com
> * [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
> * [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
> * [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
> * [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
> * [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
> * [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
> * [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
> * [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
> * [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
> * [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
> * [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
> * [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
> * [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
> * [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
> * [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
> * [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
> * [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
> * [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
> * [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
> * [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
> * [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
> * [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
> * [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
> * [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
> * [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
> * [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
>
> and found the following issue:
> WARNING in folio_memcg
>
> Full report is available here:
> https://ci.syzbot.org/series/0d48a77a-fb4f-485d-9fd6-086afd6fb650
>
> ***
>
> WARNING in folio_memcg
>
> tree: mm-new
> URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
> base: b227c04932039bccc21a0a89cd6df50fa57e4716
> arch: amd64
> compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
> config: https://ci.syzbot.org/builds/503d7034-ae99-44d1-8fb2-62e7ef5e1c7c/config
> C repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/c_repro
> syz repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/syz_repro
>
> exFAT-fs (loop0): failed to load upcase table (idx : 0x00010000, chksum : 0xe5674ec2, utbl_chksum : 0xe619d30d)
> exFAT-fs (loop0): failed to load alloc-bitmap
> exFAT-fs (loop0): failed to recognize exfat type
> ------------[ cut here ]------------
> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
This is understandable as the code snippet was added fairly recently
and is easy to miss during rebasing.
#syz test
diff --git a/mm/zswap.c b/mm/zswap.c
index a341814468b9..738d914e5354 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -896,11 +896,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
* to the active LRU list in the case.
*/
if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
+ rcu_read_lock();
if (!mem_cgroup_zswap_writeback_enabled(
folio_memcg(page_folio(page)))) {
+ rcu_read_unlock();
comp_ret = comp_ret ? comp_ret : -EINVAL;
goto unlock;
}
+ rcu_read_unlock();
comp_ret = 0;
dlen = PAGE_SIZE;
dst = kmap_local_page(page);
^ permalink raw reply related [flat|nested] 58+ messages in thread* Re: Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup
2025-10-29 0:22 ` Harry Yoo
@ 2025-10-29 0:25 ` syzbot ci
2025-10-29 3:12 ` Qi Zheng
1 sibling, 0 replies; 58+ messages in thread
From: syzbot ci @ 2025-10-29 0:25 UTC (permalink / raw)
To: harry.yoo
Cc: akpm, axelrasmussen, cgroups, chengming.zhou, david, hannes,
harry.yoo, hughd, imran.f.khan, kamalesh.babulal, linux-kernel,
linux-mm, lorenzo.stoakes, mhocko, muchun.song, nphamcs, qi.zheng,
roman.gushchin, shakeel.butt, songmuchun, syzbot, syzkaller-bugs,
weixugc, yuanchu, zhengqi.arch, ziy
Unknown command
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [syzbot ci] Re: Eliminate Dying Memory Cgroup
2025-10-29 0:22 ` Harry Yoo
2025-10-29 0:25 ` syzbot ci
@ 2025-10-29 3:12 ` Qi Zheng
1 sibling, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-10-29 3:12 UTC (permalink / raw)
To: Harry Yoo, syzbot ci
Cc: akpm, axelrasmussen, cgroups, chengming.zhou, david, hannes,
hughd, imran.f.khan, kamalesh.babulal, linux-kernel, linux-mm,
lorenzo.stoakes, mhocko, muchun.song, nphamcs, roman.gushchin,
shakeel.butt, songmuchun, weixugc, yuanchu, zhengqi.arch, ziy,
syzbot, syzkaller-bugs
Hi Harry,
On 10/29/25 8:22 AM, Harry Yoo wrote:
> On Tue, Oct 28, 2025 at 01:58:33PM -0700, syzbot ci wrote:
>> syzbot ci has tested the following series
>>
>> [v1] Eliminate Dying Memory Cgroup
>> https://lore.kernel.org/all/cover.1761658310.git.zhengqi.arch@bytedance.com
>> * [PATCH v1 01/26] mm: memcontrol: remove dead code of checking parent memory cgroup
>> * [PATCH v1 02/26] mm: workingset: use folio_lruvec() in workingset_refault()
>> * [PATCH v1 03/26] mm: rename unlock_page_lruvec_irq and its variants
>> * [PATCH v1 04/26] mm: vmscan: refactor move_folios_to_lru()
>> * [PATCH v1 05/26] mm: memcontrol: allocate object cgroup for non-kmem case
>> * [PATCH v1 06/26] mm: memcontrol: return root object cgroup for root memory cgroup
>> * [PATCH v1 07/26] mm: memcontrol: prevent memory cgroup release in get_mem_cgroup_from_folio()
>> * [PATCH v1 08/26] buffer: prevent memory cgroup release in folio_alloc_buffers()
>> * [PATCH v1 09/26] writeback: prevent memory cgroup release in writeback module
>> * [PATCH v1 10/26] mm: memcontrol: prevent memory cgroup release in count_memcg_folio_events()
>> * [PATCH v1 11/26] mm: page_io: prevent memory cgroup release in page_io module
>> * [PATCH v1 12/26] mm: migrate: prevent memory cgroup release in folio_migrate_mapping()
>> * [PATCH v1 13/26] mm: mglru: prevent memory cgroup release in mglru
>> * [PATCH v1 14/26] mm: memcontrol: prevent memory cgroup release in mem_cgroup_swap_full()
>> * [PATCH v1 15/26] mm: workingset: prevent memory cgroup release in lru_gen_eviction()
>> * [PATCH v1 16/26] mm: thp: prevent memory cgroup release in folio_split_queue_lock{_irqsave}()
>> * [PATCH v1 17/26] mm: workingset: prevent lruvec release in workingset_refault()
>> * [PATCH v1 18/26] mm: zswap: prevent lruvec release in zswap_folio_swapin()
>> * [PATCH v1 19/26] mm: swap: prevent lruvec release in swap module
>> * [PATCH v1 20/26] mm: workingset: prevent lruvec release in workingset_activation()
>> * [PATCH v1 21/26] mm: memcontrol: prepare for reparenting LRU pages for lruvec lock
>> * [PATCH v1 22/26] mm: vmscan: prepare for reparenting traditional LRU folios
>> * [PATCH v1 23/26] mm: vmscan: prepare for reparenting MGLRU folios
>> * [PATCH v1 24/26] mm: memcontrol: refactor memcg_reparent_objcgs()
>> * [PATCH v1 25/26] mm: memcontrol: eliminate the problem of dying memory cgroup for LRU folios
>> * [PATCH v1 26/26] mm: lru: add VM_WARN_ON_ONCE_FOLIO to lru maintenance helpers
>>
>> and found the following issue:
>> WARNING in folio_memcg
>>
>> Full report is available here:
>> https://ci.syzbot.org/series/0d48a77a-fb4f-485d-9fd6-086afd6fb650
>>
>> ***
>>
>> WARNING in folio_memcg
>>
>> tree: mm-new
>> URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
>> base: b227c04932039bccc21a0a89cd6df50fa57e4716
>> arch: amd64
>> compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8
>> config: https://ci.syzbot.org/builds/503d7034-ae99-44d1-8fb2-62e7ef5e1c7c/config
>> C repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/c_repro
>> syz repro: https://ci.syzbot.org/findings/880c374a-1b49-436e-9be2-63d5e2c6b6ab/syz_repro
>>
>> exFAT-fs (loop0): failed to load upcase table (idx : 0x00010000, chksum : 0xe5674ec2, utbl_chksum : 0xe619d30d)
>> exFAT-fs (loop0): failed to load alloc-bitmap
>> exFAT-fs (loop0): failed to recognize exfat type
>> ------------[ cut here ]------------
>> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 obj_cgroup_memcg include/linux/memcontrol.h:380 [inline]
>> WARNING: CPU: 1 PID: 5965 at ./include/linux/memcontrol.h:380 folio_memcg+0x148/0x1c0 include/linux/memcontrol.h:434
>
> This is understandable as the code snippet was added fairly recently
> and is easy to miss during rebasing.
My mistake, I should recheck it.
>
> #syz test
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index a341814468b9..738d914e5354 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -896,11 +896,14 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> * to the active LRU list in the case.
> */
> if (comp_ret || !dlen || dlen >= PAGE_SIZE) {
> + rcu_read_lock();
> if (!mem_cgroup_zswap_writeback_enabled(
> folio_memcg(page_folio(page)))) {
> + rcu_read_unlock();
> comp_ret = comp_ret ? comp_ret : -EINVAL;
> goto unlock;
> }
> + rcu_read_unlock();
> comp_ret = 0;
> dlen = PAGE_SIZE;
> dst = kmap_local_page(page);
LGTM, will do in the next version.
Thanks!
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
2025-10-28 13:58 [PATCH v1 00/26] Eliminate Dying Memory Cgroup Qi Zheng
` (26 preceding siblings ...)
2025-10-28 20:58 ` [syzbot ci] Re: Eliminate Dying Memory Cgroup syzbot ci
@ 2025-10-29 7:53 ` Michal Hocko
2025-10-29 8:05 ` Qi Zheng
27 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2025-10-29 7:53 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, roman.gushchin, shakeel.butt, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups, Qi Zheng
On Tue 28-10-25 21:58:13, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> Hi all,
>
> This series aims to eliminate the problem of dying memory cgroup. It completes
> the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
I high level summary and main design decisions should be describe in the
cover letter.
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
2025-10-29 7:53 ` [PATCH v1 00/26] " Michal Hocko
@ 2025-10-29 8:05 ` Qi Zheng
2025-10-31 10:35 ` Michal Hocko
0 siblings, 1 reply; 58+ messages in thread
From: Qi Zheng @ 2025-10-29 8:05 UTC (permalink / raw)
To: Michal Hocko
Cc: hannes, hughd, roman.gushchin, shakeel.butt, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups
Hi Michal,
On 10/29/25 3:53 PM, Michal Hocko wrote:
> On Tue 28-10-25 21:58:13, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> Hi all,
>>
>> This series aims to eliminate the problem of dying memory cgroup. It completes
>> the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
>
> I high level summary and main design decisions should be describe in the
> cover letter.
Got it. Will add it in the next version.
I've pasted the contents of Muchun Song's cover letter below:
```
## Introduction
This patchset is intended to transfer the LRU pages to the object cgroup
without holding a reference to the original memory cgroup in order to
address the issue of the dying memory cgroup. A consensus has already been
reached regarding this approach recently [1].
## Background
The issue of a dying memory cgroup refers to a situation where a memory
cgroup is no longer being used by users, but memory (the metadata
associated with memory cgroups) remains allocated to it. This situation
may potentially result in memory leaks or inefficiencies in memory
reclamation and has persisted as an issue for several years. Any memory
allocation that endures longer than the lifespan (from the users'
perspective) of a memory cgroup can lead to the issue of dying memory
cgroup. We have exerted greater efforts to tackle this problem by
introducing the infrastructure of object cgroup [2].
Presently, numerous types of objects (slab objects, non-slab kernel
allocations, per-CPU objects) are charged to the object cgroup without
holding a reference to the original memory cgroup. The final allocations
for LRU pages (anonymous pages and file pages) are charged at allocation
time and continues to hold a reference to the original memory cgroup
until reclaimed.
File pages are more complex than anonymous pages as they can be shared
among different memory cgroups and may persist beyond the lifespan of
the memory cgroup. The long-term pinning of file pages to memory cgroups
is a widespread issue that causes recurring problems in practical
scenarios [3]. File pages remain unreclaimed for extended periods.
Additionally, they are accessed by successive instances (second, third,
fourth, etc.) of the same job, which is restarted into a new cgroup each
time. As a result, unreclaimable dying memory cgroups accumulate,
leading to memory wastage and significantly reducing the efficiency
of page reclamation.
## Fundamentals
A folio will no longer pin its corresponding memory cgroup. It is necessary
to ensure that the memory cgroup or the lruvec associated with the memory
cgroup is not released when a user obtains a pointer to the memory cgroup
or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
to hold the RCU read lock or acquire a reference to the memory cgroup
associated with the folio to prevent its release if they are not concerned
about the binding stability between the folio and its corresponding memory
cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
desire a stable binding between the folio and its corresponding memory
cgroup. An approach is needed to ensure the stability of the binding while
the lruvec lock is held, and to detect the situation of holding the
incorrect lruvec lock when there is a race condition during memory cgroup
reparenting. The following four steps are taken to achieve these goals.
1. The first step to be taken is to identify all users of both functions
(folio_memcg() and folio_lruvec()) who are not concerned about binding
stability and implement appropriate measures (such as holding a RCU read
lock or temporarily obtaining a reference to the memory cgroup for a
brief period) to prevent the release of the memory cgroup.
2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
how to ensure the binding stability from the user's perspective of
folio_lruvec().
struct lruvec *folio_lruvec_lock(struct folio *folio)
{
struct lruvec *lruvec;
rcu_read_lock();
retry:
lruvec = folio_lruvec(folio);
spin_lock(&lruvec->lru_lock);
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
spin_unlock(&lruvec->lru_lock);
goto retry;
}
return lruvec;
}
From the perspective of memory cgroup removal, the entire reparenting
process (altering the binding relationship between folio and its memory
cgroup and moving the LRU lists to its parental memory cgroup) should be
carried out under both the lruvec lock of the memory cgroup being
removed
and the lruvec lock of its parent.
3. Thirdly, another lock that requires the same approach is the split-queue
lock of THP.
4. Finally, transfer the LRU pages to the object cgroup without holding a
reference to the original memory cgroup.
```
And the details of the adaptation are below:
```
Similar to traditional LRU folios, in order to solve the dying memcg
problem, we also need to reparenting MGLRU folios to the parent memcg when
memcg offline.
However, there are the following challenges:
1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
number of generations of the parent and child memcg may be different,
so we cannot simply transfer MGLRU folios in the child memcg to the
parent memcg as we did for traditional LRU folios.
2. The generation information is stored in folio->flags, but we cannot
traverse these folios while holding the lru lock, otherwise it may
cause softlockup.
3. In walk_update_folio(), the gen of folio and corresponding lru size
may be updated, but the folio is not immediately moved to the
corresponding lru list. Therefore, there may be folios of different
generations on an LRU list.
4. In lru_gen_del_folio(), the generation to which the folio belongs is
found based on the generation information in folio->flags, and the
corresponding LRU size will be updated. Therefore, we need to update
the lru size correctly during reparenting, otherwise the lru size may
be updated incorrectly in lru_gen_del_folio().
Finally, this patch chose a compromise method, which is to splice the lru
list in the child memcg to the lru list of the same generation in the
parent memcg during reparenting. And in order to ensure that the parent
memcg has the same generation, we need to increase the generations in the
parent memcg to the MAX_NR_GENS before reparenting.
Of course, the same generation has different meanings in the parent and
child memcg, this will cause confusion in the hot and cold information of
folios. But other than that, this method is simple enough, the lru size
is correct, and there is no need to consider some concurrency issues (such
as lru_gen_del_folio()).
```
Thanks,
Qi
>
> Thanks!
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
2025-10-29 8:05 ` Qi Zheng
@ 2025-10-31 10:35 ` Michal Hocko
2025-11-03 3:33 ` Qi Zheng
0 siblings, 1 reply; 58+ messages in thread
From: Michal Hocko @ 2025-10-31 10:35 UTC (permalink / raw)
To: Qi Zheng
Cc: hannes, hughd, roman.gushchin, shakeel.butt, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups
On Wed 29-10-25 16:05:16, Qi Zheng wrote:
> Hi Michal,
>
> On 10/29/25 3:53 PM, Michal Hocko wrote:
> > On Tue 28-10-25 21:58:13, Qi Zheng wrote:
> > > From: Qi Zheng <zhengqi.arch@bytedance.com>
> > >
> > > Hi all,
> > >
> > > This series aims to eliminate the problem of dying memory cgroup. It completes
> > > the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
> >
> > I high level summary and main design decisions should be describe in the
> > cover letter.
>
> Got it. Will add it in the next version.
>
> I've pasted the contents of Muchun Song's cover letter below:
>
> ```
> ## Introduction
>
> This patchset is intended to transfer the LRU pages to the object cgroup
> without holding a reference to the original memory cgroup in order to
> address the issue of the dying memory cgroup. A consensus has already been
> reached regarding this approach recently [1].
Could you add those referenced links as well please?
> ## Background
>
> The issue of a dying memory cgroup refers to a situation where a memory
> cgroup is no longer being used by users, but memory (the metadata
> associated with memory cgroups) remains allocated to it. This situation
> may potentially result in memory leaks or inefficiencies in memory
> reclamation and has persisted as an issue for several years. Any memory
> allocation that endures longer than the lifespan (from the users'
> perspective) of a memory cgroup can lead to the issue of dying memory
> cgroup. We have exerted greater efforts to tackle this problem by
> introducing the infrastructure of object cgroup [2].
>
> Presently, numerous types of objects (slab objects, non-slab kernel
> allocations, per-CPU objects) are charged to the object cgroup without
> holding a reference to the original memory cgroup. The final allocations
> for LRU pages (anonymous pages and file pages) are charged at allocation
> time and continues to hold a reference to the original memory cgroup
> until reclaimed.
>
> File pages are more complex than anonymous pages as they can be shared
> among different memory cgroups and may persist beyond the lifespan of
> the memory cgroup. The long-term pinning of file pages to memory cgroups
> is a widespread issue that causes recurring problems in practical
> scenarios [3]. File pages remain unreclaimed for extended periods.
> Additionally, they are accessed by successive instances (second, third,
> fourth, etc.) of the same job, which is restarted into a new cgroup each
> time. As a result, unreclaimable dying memory cgroups accumulate,
> leading to memory wastage and significantly reducing the efficiency
> of page reclamation.
Very useful introduction to the problem. Thanks!
> ## Fundamentals
>
> A folio will no longer pin its corresponding memory cgroup. It is necessary
> to ensure that the memory cgroup or the lruvec associated with the memory
> cgroup is not released when a user obtains a pointer to the memory cgroup
> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
> to hold the RCU read lock or acquire a reference to the memory cgroup
> associated with the folio to prevent its release if they are not concerned
> about the binding stability between the folio and its corresponding memory
> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
> desire a stable binding between the folio and its corresponding memory
> cgroup. An approach is needed to ensure the stability of the binding while
> the lruvec lock is held, and to detect the situation of holding the
> incorrect lruvec lock when there is a race condition during memory cgroup
> reparenting. The following four steps are taken to achieve these goals.
>
> 1. The first step to be taken is to identify all users of both functions
> (folio_memcg() and folio_lruvec()) who are not concerned about binding
> stability and implement appropriate measures (such as holding a RCU read
> lock or temporarily obtaining a reference to the memory cgroup for a
> brief period) to prevent the release of the memory cgroup.
>
> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
> how to ensure the binding stability from the user's perspective of
> folio_lruvec().
>
> struct lruvec *folio_lruvec_lock(struct folio *folio)
> {
> struct lruvec *lruvec;
>
> rcu_read_lock();
> retry:
> lruvec = folio_lruvec(folio);
> spin_lock(&lruvec->lru_lock);
> if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
> spin_unlock(&lruvec->lru_lock);
> goto retry;
> }
>
> return lruvec;
> }
>
> From the perspective of memory cgroup removal, the entire reparenting
> process (altering the binding relationship between folio and its memory
> cgroup and moving the LRU lists to its parental memory cgroup) should be
> carried out under both the lruvec lock of the memory cgroup being removed
> and the lruvec lock of its parent.
>
> 3. Thirdly, another lock that requires the same approach is the split-queue
> lock of THP.
>
> 4. Finally, transfer the LRU pages to the object cgroup without holding a
> reference to the original memory cgroup.
> ```
>
> And the details of the adaptation are below:
>
> ```
> Similar to traditional LRU folios, in order to solve the dying memcg
> problem, we also need to reparenting MGLRU folios to the parent memcg when
> memcg offline.
>
> However, there are the following challenges:
>
> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
> number of generations of the parent and child memcg may be different,
> so we cannot simply transfer MGLRU folios in the child memcg to the
> parent memcg as we did for traditional LRU folios.
> 2. The generation information is stored in folio->flags, but we cannot
> traverse these folios while holding the lru lock, otherwise it may
> cause softlockup.
> 3. In walk_update_folio(), the gen of folio and corresponding lru size
> may be updated, but the folio is not immediately moved to the
> corresponding lru list. Therefore, there may be folios of different
> generations on an LRU list.
> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
> found based on the generation information in folio->flags, and the
> corresponding LRU size will be updated. Therefore, we need to update
> the lru size correctly during reparenting, otherwise the lru size may
> be updated incorrectly in lru_gen_del_folio().
>
> Finally, this patch chose a compromise method, which is to splice the lru
> list in the child memcg to the lru list of the same generation in the
> parent memcg during reparenting. And in order to ensure that the parent
> memcg has the same generation, we need to increase the generations in the
> parent memcg to the MAX_NR_GENS before reparenting.
>
> Of course, the same generation has different meanings in the parent and
> child memcg, this will cause confusion in the hot and cold information of
> folios. But other than that, this method is simple enough, the lru size
> is correct, and there is no need to consider some concurrency issues (such
> as lru_gen_del_folio()).
> ```
Thanks you this is very useful.
A high level overview on how the patch series (of this size) would be
appreaciate as well.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 58+ messages in thread* Re: [PATCH v1 00/26] Eliminate Dying Memory Cgroup
2025-10-31 10:35 ` Michal Hocko
@ 2025-11-03 3:33 ` Qi Zheng
0 siblings, 0 replies; 58+ messages in thread
From: Qi Zheng @ 2025-11-03 3:33 UTC (permalink / raw)
To: Michal Hocko
Cc: hannes, hughd, roman.gushchin, shakeel.butt, muchun.song, david,
lorenzo.stoakes, ziy, harry.yoo, imran.f.khan, kamalesh.babulal,
axelrasmussen, yuanchu, weixugc, akpm, linux-mm, linux-kernel,
cgroups
Hi Michal,
On 10/31/25 6:35 PM, Michal Hocko wrote:
> On Wed 29-10-25 16:05:16, Qi Zheng wrote:
>> Hi Michal,
>>
>> On 10/29/25 3:53 PM, Michal Hocko wrote:
>>> On Tue 28-10-25 21:58:13, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>
>>>> Hi all,
>>>>
>>>> This series aims to eliminate the problem of dying memory cgroup. It completes
>>>> the adaptation to the MGLRU scenarios based on the Muchun Song's patchset[1].
>>>
>>> I high level summary and main design decisions should be describe in the
>>> cover letter.
>>
>> Got it. Will add it in the next version.
>>
>> I've pasted the contents of Muchun Song's cover letter below:
>>
>> ```
>> ## Introduction
>>
>> This patchset is intended to transfer the LRU pages to the object cgroup
>> without holding a reference to the original memory cgroup in order to
>> address the issue of the dying memory cgroup. A consensus has already been
>> reached regarding this approach recently [1].
>
> Could you add those referenced links as well please?
Oh, I missed that.
[1]. https://lore.kernel.org/linux-mm/Z6OkXXYDorPrBvEQ@hm-sls2/
>
>> ## Background
>>
>> The issue of a dying memory cgroup refers to a situation where a memory
>> cgroup is no longer being used by users, but memory (the metadata
>> associated with memory cgroups) remains allocated to it. This situation
>> may potentially result in memory leaks or inefficiencies in memory
>> reclamation and has persisted as an issue for several years. Any memory
>> allocation that endures longer than the lifespan (from the users'
>> perspective) of a memory cgroup can lead to the issue of dying memory
>> cgroup. We have exerted greater efforts to tackle this problem by
>> introducing the infrastructure of object cgroup [2].
[2]. https://lwn.net/Articles/895431/
>>
>> Presently, numerous types of objects (slab objects, non-slab kernel
>> allocations, per-CPU objects) are charged to the object cgroup without
>> holding a reference to the original memory cgroup. The final allocations
>> for LRU pages (anonymous pages and file pages) are charged at allocation
>> time and continues to hold a reference to the original memory cgroup
>> until reclaimed.
>>
>> File pages are more complex than anonymous pages as they can be shared
>> among different memory cgroups and may persist beyond the lifespan of
>> the memory cgroup. The long-term pinning of file pages to memory cgroups
>> is a widespread issue that causes recurring problems in practical
>> scenarios [3]. File pages remain unreclaimed for extended periods.
[3]. https://github.com/systemd/systemd/pull/36827
>> Additionally, they are accessed by successive instances (second, third,
>> fourth, etc.) of the same job, which is restarted into a new cgroup each
>> time. As a result, unreclaimable dying memory cgroups accumulate,
>> leading to memory wastage and significantly reducing the efficiency
>> of page reclamation.
>
> Very useful introduction to the problem. Thanks!
>
>> ## Fundamentals
>>
>> A folio will no longer pin its corresponding memory cgroup. It is necessary
>> to ensure that the memory cgroup or the lruvec associated with the memory
>> cgroup is not released when a user obtains a pointer to the memory cgroup
>> or lruvec returned by folio_memcg() or folio_lruvec(). Users are required
>> to hold the RCU read lock or acquire a reference to the memory cgroup
>> associated with the folio to prevent its release if they are not concerned
>> about the binding stability between the folio and its corresponding memory
>> cgroup. However, some users of folio_lruvec() (i.e., the lruvec lock)
>> desire a stable binding between the folio and its corresponding memory
>> cgroup. An approach is needed to ensure the stability of the binding while
>> the lruvec lock is held, and to detect the situation of holding the
>> incorrect lruvec lock when there is a race condition during memory cgroup
>> reparenting. The following four steps are taken to achieve these goals.
>>
>> 1. The first step to be taken is to identify all users of both functions
>> (folio_memcg() and folio_lruvec()) who are not concerned about binding
>> stability and implement appropriate measures (such as holding a RCU read
>> lock or temporarily obtaining a reference to the memory cgroup for a
>> brief period) to prevent the release of the memory cgroup.
>>
>> 2. Secondly, the following refactoring of folio_lruvec_lock() demonstrates
>> how to ensure the binding stability from the user's perspective of
>> folio_lruvec().
>>
>> struct lruvec *folio_lruvec_lock(struct folio *folio)
>> {
>> struct lruvec *lruvec;
>>
>> rcu_read_lock();
>> retry:
>> lruvec = folio_lruvec(folio);
>> spin_lock(&lruvec->lru_lock);
>> if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
>> spin_unlock(&lruvec->lru_lock);
>> goto retry;
>> }
>>
>> return lruvec;
>> }
>>
>> From the perspective of memory cgroup removal, the entire reparenting
>> process (altering the binding relationship between folio and its memory
>> cgroup and moving the LRU lists to its parental memory cgroup) should be
>> carried out under both the lruvec lock of the memory cgroup being removed
>> and the lruvec lock of its parent.
>>
>> 3. Thirdly, another lock that requires the same approach is the split-queue
>> lock of THP.
>>
>> 4. Finally, transfer the LRU pages to the object cgroup without holding a
>> reference to the original memory cgroup.
>> ```
>>
>> And the details of the adaptation are below:
>>
>> ```
>> Similar to traditional LRU folios, in order to solve the dying memcg
>> problem, we also need to reparenting MGLRU folios to the parent memcg when
>> memcg offline.
>>
>> However, there are the following challenges:
>>
>> 1. Each lruvec has between MIN_NR_GENS and MAX_NR_GENS generations, the
>> number of generations of the parent and child memcg may be different,
>> so we cannot simply transfer MGLRU folios in the child memcg to the
>> parent memcg as we did for traditional LRU folios.
>> 2. The generation information is stored in folio->flags, but we cannot
>> traverse these folios while holding the lru lock, otherwise it may
>> cause softlockup.
>> 3. In walk_update_folio(), the gen of folio and corresponding lru size
>> may be updated, but the folio is not immediately moved to the
>> corresponding lru list. Therefore, there may be folios of different
>> generations on an LRU list.
>> 4. In lru_gen_del_folio(), the generation to which the folio belongs is
>> found based on the generation information in folio->flags, and the
>> corresponding LRU size will be updated. Therefore, we need to update
>> the lru size correctly during reparenting, otherwise the lru size may
>> be updated incorrectly in lru_gen_del_folio().
>>
>> Finally, this patch chose a compromise method, which is to splice the lru
>> list in the child memcg to the lru list of the same generation in the
>> parent memcg during reparenting. And in order to ensure that the parent
>> memcg has the same generation, we need to increase the generations in the
>> parent memcg to the MAX_NR_GENS before reparenting.
>>
>> Of course, the same generation has different meanings in the parent and
>> child memcg, this will cause confusion in the hot and cold information of
>> folios. But other than that, this method is simple enough, the lru size
>> is correct, and there is no need to consider some concurrency issues (such
>> as lru_gen_del_folio()).
>> ```
>
> Thanks you this is very useful.
>
> A high level overview on how the patch series (of this size) would be
> appreaciate as well.
OK. Will add this to the cover letter in the next version.
Thanks,
Qi
^ permalink raw reply [flat|nested] 58+ messages in thread