[PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
@ 2026-06-25 15:15 Qi Zheng
  2026-06-25 18:41 ` Johannes Weiner
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-25 15:15 UTC (permalink / raw)
  To: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, hannes, harry, muchun.song, peiyang_he, mhocko,
	roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable

From: Qi Zheng <zhengqi.arch@bytedance.com>

The mglru page table walker batches per-generation size deltas in
walk->nr_pages while walking page tables without holding the lruvec lock.
The reset_batch_size() later folds those deltas into walk->lruvec under
the lruvec lock.

The page table walker can run concurrently with the memcg reparenting path
as follows:

CPU0                           CPU1
====                           ====

walk_mm
--> walk_page_range
    --> update_batch_size
        --> walk->nr_pages += delta

                              mem_cgroup_css_offline
                              --> memcg_reparent_objcgs
                                  --> lock lruvec
                                      lru_gen_reparent_memcg
                                      --> reparent child folios to parent
                                      unlock lruvec

    lock lruvec
    reset_batch_size
    --> child lrugen->nr_pages += delta

This will trigger the following warning in lru_gen_exit_memcg():

	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
				   sizeof(lruvec->lrugen.nr_pages)));

And the user-visible impact of underestimated nr_pages in MGLRU was
premature OOMs because MGLRU does not try to reclaim memory when nr_pages
reaches zero, but there are still more pages.

To fix it, make reset_batch_size() check CSS_DYING under RCU before
flushing the pending batch. A non-dying memcg keeps the original lruvec
stable against RCU-delayed offlining; a dying memcg redirects the deltas
to the first non-dying ancestor.

Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
Cc: <stable@vger.kernel.org>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
Changes in v3:
 - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
   (suggested by Harry)
 - update the commit message (suggested by Harry)
 - temporarily drop the previous Reviewed-by tags
   (since the sync method has changed)
 - rebase onto the next-20260624

Changes in v2:
 - update the commit message (pointed by Barry)
 - collect Reviewed-by

 mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 38 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 35c3bb15ae96..1ec8c23c72b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
 	walk->nr_pages[new_gen][type][zone] += delta;
 }
 
+#ifdef CONFIG_MEMCG
+static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
+{
+	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	rcu_read_lock();
+
+	/*
+	 * The memcg can be NULL when the memory controller is disabled.
+	 * Otherwise, the caller keeps the memcg owning @lruvec alive.
+	 */
+	if (!memcg || !css_is_dying(&memcg->css))
+		goto lock;
+
+	do {
+		memcg = parent_mem_cgroup(memcg);
+	} while (memcg && css_is_dying(&memcg->css));
+	lruvec = mem_cgroup_lruvec(memcg, pgdat);
+
+lock:
+	spin_lock_irq(&lruvec->lru_lock);
+
+	return lruvec;
+}
+#else
+static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
+{
+	lruvec_lock_irq(lruvec);
+
+	return lruvec;
+}
+#endif
+
 static void reset_batch_size(struct lru_gen_mm_walk *walk)
 {
 	int gen, type, zone;
-	struct lruvec *lruvec = walk->lruvec;
+	struct lruvec *lruvec = lock_batch_lruvec(walk->lruvec);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
 	walk->batched = 0;
@@ -3285,6 +3319,8 @@ static void reset_batch_size(struct lru_gen_mm_walk *walk)
 			lru += LRU_ACTIVE;
 		__update_lru_size(lruvec, lru, zone, delta);
 	}
+
+	lruvec_unlock_irq(lruvec);
 }
 
 static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args)
@@ -3779,11 +3815,8 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
 			mmap_read_unlock(mm);
 		}
 
-		if (walk->batched) {
-			lruvec_lock_irq(lruvec);
+		if (walk->batched)
 			reset_batch_size(walk);
-			lruvec_unlock_irq(lruvec);
-		}
 
 		cond_resched();
 	} while (err == -EAGAIN);
@@ -4867,9 +4900,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	walk = current->reclaim_state->mm_walk;
 	if (walk && walk->batched) {
 		walk->lruvec = lruvec;
-		lruvec_lock_irq(lruvec);
 		reset_batch_size(walk);
-		lruvec_unlock_irq(lruvec);
 	}
 
 	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
-- 
2.54.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
@ 2026-06-25 18:41 ` Johannes Weiner
  2026-06-26  2:27   ` Qi Zheng
  2026-06-25 20:22 ` Shakeel Butt
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2026-06-25 18:41 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
	ljs, linux-mm, linux-kernel, Qi Zheng, stable

On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
> 
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
> 
> CPU0                           CPU1
> ====                           ====
> 
> walk_mm
> --> walk_page_range
>     --> update_batch_size
>         --> walk->nr_pages += delta
> 
>                               mem_cgroup_css_offline
>                               --> memcg_reparent_objcgs
>                                   --> lock lruvec
>                                       lru_gen_reparent_memcg
>                                       --> reparent child folios to parent
>                                       unlock lruvec
> 
>     lock lruvec
>     reset_batch_size
>     --> child lrugen->nr_pages += delta
> 
> This will trigger the following warning in lru_gen_exit_memcg():
> 
> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> 				   sizeof(lruvec->lrugen.nr_pages)));
> 
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
> 
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
> 
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Changes in v3:
>  - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>    (suggested by Harry)
>  - update the commit message (suggested by Harry)
>  - temporarily drop the previous Reviewed-by tags
>    (since the sync method has changed)
>  - rebase onto the next-20260624
> 
> Changes in v2:
>  - update the commit message (pointed by Barry)
>  - collect Reviewed-by
> 
>  mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 38 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35c3bb15ae96..1ec8c23c72b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>  	walk->nr_pages[new_gen][type][zone] += delta;
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> +{
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> +	rcu_read_lock();

Where is this unlocked?

> +	/*
> +	 * The memcg can be NULL when the memory controller is disabled.
> +	 * Otherwise, the caller keeps the memcg owning @lruvec alive.
> +	 */
> +	if (!memcg || !css_is_dying(&memcg->css))
> +		goto lock;
> +
> +	do {
> +		memcg = parent_mem_cgroup(memcg);
> +	} while (memcg && css_is_dying(&memcg->css));
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);

	while (unlikely(memcg && css_is_dying(&memcg->css))) {
		memcg = parent_mem_cgroup(memcg);
		lruvec = mem_cgroup_lruvec(memcg, pgdat);
	}


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-25 18:41 ` Johannes Weiner
@ 2026-06-26  2:27   ` Qi Zheng
  2026-06-26  4:43     ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26  2:27 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
	ljs, linux-mm, linux-kernel, Qi Zheng, stable

Hi Johannes,

On 6/26/26 2:41 AM, Johannes Weiner wrote:
> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> The mglru page table walker batches per-generation size deltas in
>> walk->nr_pages while walking page tables without holding the lruvec lock.
>> The reset_batch_size() later folds those deltas into walk->lruvec under
>> the lruvec lock.
>>
>> The page table walker can run concurrently with the memcg reparenting path
>> as follows:
>>
>> CPU0                           CPU1
>> ====                           ====
>>
>> walk_mm
>> --> walk_page_range
>>      --> update_batch_size
>>          --> walk->nr_pages += delta
>>
>>                                mem_cgroup_css_offline
>>                                --> memcg_reparent_objcgs
>>                                    --> lock lruvec
>>                                        lru_gen_reparent_memcg
>>                                        --> reparent child folios to parent
>>                                        unlock lruvec
>>
>>      lock lruvec
>>      reset_batch_size
>>      --> child lrugen->nr_pages += delta
>>
>> This will trigger the following warning in lru_gen_exit_memcg():
>>
>> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>> 				   sizeof(lruvec->lrugen.nr_pages)));
>>
>> And the user-visible impact of underestimated nr_pages in MGLRU was
>> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
>> reaches zero, but there are still more pages.
>>
>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>> to the first non-dying ancestor.
>>
>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> Changes in v3:
>>   - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>>     (suggested by Harry)
>>   - update the commit message (suggested by Harry)
>>   - temporarily drop the previous Reviewed-by tags
>>     (since the sync method has changed)
>>   - rebase onto the next-20260624
>>
>> Changes in v2:
>>   - update the commit message (pointed by Barry)
>>   - collect Reviewed-by
>>
>>   mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 38 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 35c3bb15ae96..1ec8c23c72b9 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>>   	walk->nr_pages[new_gen][type][zone] += delta;
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> +{
>> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +
>> +	rcu_read_lock();
> 
> Where is this unlocked?

The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.

> 
>> +	/*
>> +	 * The memcg can be NULL when the memory controller is disabled.
>> +	 * Otherwise, the caller keeps the memcg owning @lruvec alive.
>> +	 */
>> +	if (!memcg || !css_is_dying(&memcg->css))
>> +		goto lock;
>> +
>> +	do {
>> +		memcg = parent_mem_cgroup(memcg);
>> +	} while (memcg && css_is_dying(&memcg->css));
>> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> 
> 	while (unlikely(memcg && css_is_dying(&memcg->css))) {
> 		memcg = parent_mem_cgroup(memcg);
> 		lruvec = mem_cgroup_lruvec(memcg, pgdat);

There is no need to acquire the lruvec before finding the first
non-dying memcg.

Thanks,
Qi

> 	}



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  2:27   ` Qi Zheng
@ 2026-06-26  4:43     ` Harry Yoo
  2026-06-26  4:48       ` Qi Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-06-26  4:43 UTC (permalink / raw)
  To: Qi Zheng, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable


[-- Attachment #1.1: Type: text/plain, Size: 5175 bytes --]



On 6/26/26 11:27 AM, Qi Zheng wrote:
> Hi Johannes,
> 
> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>
>>> The mglru page table walker batches per-generation size deltas in
>>> walk->nr_pages while walking page tables without holding the lruvec
>>> lock.
>>> The reset_batch_size() later folds those deltas into walk->lruvec under
>>> the lruvec lock.
>>>
>>> The page table walker can run concurrently with the memcg reparenting
>>> path
>>> as follows:
>>>
>>> CPU0                           CPU1
>>> ====                           ====
>>>
>>> walk_mm
>>> --> walk_page_range
>>>      --> update_batch_size
>>>          --> walk->nr_pages += delta
>>>
>>>                                mem_cgroup_css_offline
>>>                                --> memcg_reparent_objcgs
>>>                                    --> lock lruvec
>>>                                        lru_gen_reparent_memcg
>>>                                        --> reparent child folios to
>>> parent
>>>                                        unlock lruvec
>>>
>>>      lock lruvec
>>>      reset_batch_size
>>>      --> child lrugen->nr_pages += delta
>>>
>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>
>>>     VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>                    sizeof(lruvec->lrugen.nr_pages)));
>>>
>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>> premature OOMs because MGLRU does not try to reclaim memory when
>>> nr_pages
>>> reaches zero, but there are still more pages.
>>>
>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>>> to the first non-dying ancestor.
>>>
>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>>> Cc: <stable@vger.kernel.org>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> Changes in v3:
>>>   - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>> RCU lock
>>>     (suggested by Harry)
>>>   - update the commit message (suggested by Harry)
>>>   - temporarily drop the previous Reviewed-by tags
>>>     (since the sync method has changed)
>>>   - rebase onto the next-20260624
>>>
>>> Changes in v2:
>>>   - update the commit message (pointed by Barry)
>>>   - collect Reviewed-by
>>>
>>>   mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>   1 file changed, 38 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>       walk->nr_pages[new_gen][type][zone] += delta;
>>>   }
>>>   +#ifdef CONFIG_MEMCG
>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>> +{
>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>> +
>>> +    rcu_read_lock();
>>
>> Where is this unlocked?
> 
> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
> 
>>
>>> +    /*
>>> +     * The memcg can be NULL when the memory controller is disabled.
>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>> +     */
>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>> +        goto lock;
>>> +
>>> +    do {
>>> +        memcg = parent_mem_cgroup(memcg);
>>> +    } while (memcg && css_is_dying(&memcg->css));
>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>>     while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>         memcg = parent_mem_cgroup(memcg);
>>         lruvec = mem_cgroup_lruvec(memcg, pgdat);
> 
> There is no need to acquire the lruvec before finding the first
> non-dying memcg.

struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);

rcu_read_lock()

while (unlikely(memcg_is_dying(memcg)))
        memcg = parent_mem_cgroup(memcg);

lruvec = mem_cgroup_lruvec(memcg, pgdat);
spin_lock_irq(&lruvec->lru_lock);

return lruvec;

should work?

if the memory controller is disabled, it's equivalent to:

rcu_read_lock();
spin_lock_irq(&lruvec->lru_lock);
return lruvec;

-- 
Cheers,
Harry / Hyeonggon


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  4:43     ` Harry Yoo
@ 2026-06-26  4:48       ` Qi Zheng
  2026-06-26  4:59         ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26  4:48 UTC (permalink / raw)
  To: Harry Yoo, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable



On 6/26/26 12:43 PM, Harry Yoo wrote:
> 
> 
> On 6/26/26 11:27 AM, Qi Zheng wrote:
>> Hi Johannes,
>>
>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>
>>>> The mglru page table walker batches per-generation size deltas in
>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>> lock.
>>>> The reset_batch_size() later folds those deltas into walk->lruvec under
>>>> the lruvec lock.
>>>>
>>>> The page table walker can run concurrently with the memcg reparenting
>>>> path
>>>> as follows:
>>>>
>>>> CPU0                           CPU1
>>>> ====                           ====
>>>>
>>>> walk_mm
>>>> --> walk_page_range
>>>>       --> update_batch_size
>>>>           --> walk->nr_pages += delta
>>>>
>>>>                                 mem_cgroup_css_offline
>>>>                                 --> memcg_reparent_objcgs
>>>>                                     --> lock lruvec
>>>>                                         lru_gen_reparent_memcg
>>>>                                         --> reparent child folios to
>>>> parent
>>>>                                         unlock lruvec
>>>>
>>>>       lock lruvec
>>>>       reset_batch_size
>>>>       --> child lrugen->nr_pages += delta
>>>>
>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>
>>>>      VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>                     sizeof(lruvec->lrugen.nr_pages)));
>>>>
>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>> nr_pages
>>>> reaches zero, but there are still more pages.
>>>>
>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>>>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>>>> to the first non-dying ancestor.
>>>>
>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>>>> Cc: <stable@vger.kernel.org>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> Changes in v3:
>>>>    - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>>> RCU lock
>>>>      (suggested by Harry)
>>>>    - update the commit message (suggested by Harry)
>>>>    - temporarily drop the previous Reviewed-by tags
>>>>      (since the sync method has changed)
>>>>    - rebase onto the next-20260624
>>>>
>>>> Changes in v2:
>>>>    - update the commit message (pointed by Barry)
>>>>    - collect Reviewed-by
>>>>
>>>>    mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>    1 file changed, 38 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>        walk->nr_pages[new_gen][type][zone] += delta;
>>>>    }
>>>>    +#ifdef CONFIG_MEMCG
>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>> +{
>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>> +
>>>> +    rcu_read_lock();
>>>
>>> Where is this unlocked?
>>
>> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>>
>>>
>>>> +    /*
>>>> +     * The memcg can be NULL when the memory controller is disabled.
>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>> +     */
>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>> +        goto lock;
>>>> +
>>>> +    do {
>>>> +        memcg = parent_mem_cgroup(memcg);
>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>>      while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>          memcg = parent_mem_cgroup(memcg);
>>>          lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> There is no need to acquire the lruvec before finding the first
>> non-dying memcg.
> 
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> 
> rcu_read_lock()
> 
> while (unlikely(memcg_is_dying(memcg)))
>          memcg = parent_mem_cgroup(memcg);
> 
> lruvec = mem_cgroup_lruvec(memcg, pgdat);

If the first memcg is already non-dying, there's no need to re-acquire
the lruvec. ;)

Thanks,
Qi

> spin_lock_irq(&lruvec->lru_lock);
> 
> return lruvec;
> 
> should work?
> 
> if the memory controller is disabled, it's equivalent to:
> 
> rcu_read_lock();
> spin_lock_irq(&lruvec->lru_lock);
> return lruvec;
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  4:48       ` Qi Zheng
@ 2026-06-26  4:59         ` Harry Yoo
  2026-06-26  6:24           ` Qi Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-06-26  4:59 UTC (permalink / raw)
  To: Qi Zheng, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable


[-- Attachment #1.1: Type: text/plain, Size: 5253 bytes --]



On 6/26/26 1:48 PM, Qi Zheng wrote:
> 
> 
> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>
>>
>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>> Hi Johannes,
>>>
>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>
>>>>> The mglru page table walker batches per-generation size deltas in
>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>> lock.
>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>> under
>>>>> the lruvec lock.
>>>>>
>>>>> The page table walker can run concurrently with the memcg reparenting
>>>>> path
>>>>> as follows:
>>>>>
>>>>> CPU0                           CPU1
>>>>> ====                           ====
>>>>>
>>>>> walk_mm
>>>>> --> walk_page_range
>>>>>       --> update_batch_size
>>>>>           --> walk->nr_pages += delta
>>>>>
>>>>>                                 mem_cgroup_css_offline
>>>>>                                 --> memcg_reparent_objcgs
>>>>>                                     --> lock lruvec
>>>>>                                         lru_gen_reparent_memcg
>>>>>                                         --> reparent child folios to
>>>>> parent
>>>>>                                         unlock lruvec
>>>>>
>>>>>       lock lruvec
>>>>>       reset_batch_size
>>>>>       --> child lrugen->nr_pages += delta
>>>>>
>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>
>>>>>      VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>                     sizeof(lruvec->lrugen.nr_pages)));
>>>>>
>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>> nr_pages
>>>>> reaches zero, but there are still more pages.
>>>>>
>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>> lruvec
>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>> deltas
>>>>> to the first non-dying ancestor.
>>>>>
>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>> folios")
>>>>> Cc: <stable@vger.kernel.org>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>> Changes in v3:
>>>>>    - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>>>> RCU lock
>>>>>      (suggested by Harry)
>>>>>    - update the commit message (suggested by Harry)
>>>>>    - temporarily drop the previous Reviewed-by tags
>>>>>      (since the sync method has changed)
>>>>>    - rebase onto the next-20260624
>>>>>
>>>>> Changes in v2:
>>>>>    - update the commit message (pointed by Barry)
>>>>>    - collect Reviewed-by
>>>>>
>>>>>    mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>    1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>        walk->nr_pages[new_gen][type][zone] += delta;
>>>>>    }
>>>>>    +#ifdef CONFIG_MEMCG
>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>> +{
>>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>> +
>>>>> +    rcu_read_lock();
>>>>
>>>> Where is this unlocked?
>>>
>>> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>>>
>>>>
>>>>> +    /*
>>>>> +     * The memcg can be NULL when the memory controller is disabled.
>>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>> +     */
>>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>>> +        goto lock;
>>>>> +
>>>>> +    do {
>>>>> +        memcg = parent_mem_cgroup(memcg);
>>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>>      while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>          memcg = parent_mem_cgroup(memcg);
>>>>          lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> There is no need to acquire the lruvec before finding the first
>>> non-dying memcg.
>>
>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>> rcu_read_lock()
>>
>> while (unlikely(memcg_is_dying(memcg)))
>>          memcg = parent_mem_cgroup(memcg);
>>
>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
> 
> If the first memcg is already non-dying, there's no need to re-acquire
> the lruvec. ;)

Oh, right :)

Hmm but I still think Johannes' suggestion makes the code cleaner.
Observing a dying cgroup should be rare anyway, it's worth focusing
more on readability?

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  4:59         ` Harry Yoo
@ 2026-06-26  6:24           ` Qi Zheng
  2026-06-26  6:48             ` Harry Yoo
  0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26  6:24 UTC (permalink / raw)
  To: Harry Yoo, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable



On 6/26/26 12:59 PM, Harry Yoo wrote:
> 
> 
> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>
>>
>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>
>>>
>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>> Hi Johannes,
>>>>
>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>
>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>> lock.
>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>> under
>>>>>> the lruvec lock.
>>>>>>
>>>>>> The page table walker can run concurrently with the memcg reparenting
>>>>>> path
>>>>>> as follows:
>>>>>>
>>>>>> CPU0                           CPU1
>>>>>> ====                           ====
>>>>>>
>>>>>> walk_mm
>>>>>> --> walk_page_range
>>>>>>        --> update_batch_size
>>>>>>            --> walk->nr_pages += delta
>>>>>>
>>>>>>                                  mem_cgroup_css_offline
>>>>>>                                  --> memcg_reparent_objcgs
>>>>>>                                      --> lock lruvec
>>>>>>                                          lru_gen_reparent_memcg
>>>>>>                                          --> reparent child folios to
>>>>>> parent
>>>>>>                                          unlock lruvec
>>>>>>
>>>>>>        lock lruvec
>>>>>>        reset_batch_size
>>>>>>        --> child lrugen->nr_pages += delta
>>>>>>
>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>
>>>>>>       VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>                      sizeof(lruvec->lrugen.nr_pages)));
>>>>>>
>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>> nr_pages
>>>>>> reaches zero, but there are still more pages.
>>>>>>
>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>> lruvec
>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>> deltas
>>>>>> to the first non-dying ancestor.
>>>>>>
>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>> folios")
>>>>>> Cc: <stable@vger.kernel.org>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>> ---
>>>>>> Changes in v3:
>>>>>>     - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>>>>> RCU lock
>>>>>>       (suggested by Harry)
>>>>>>     - update the commit message (suggested by Harry)
>>>>>>     - temporarily drop the previous Reviewed-by tags
>>>>>>       (since the sync method has changed)
>>>>>>     - rebase onto the next-20260624
>>>>>>
>>>>>> Changes in v2:
>>>>>>     - update the commit message (pointed by Barry)
>>>>>>     - collect Reviewed-by
>>>>>>
>>>>>>     mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>>     1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>         walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>     }
>>>>>>     +#ifdef CONFIG_MEMCG
>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>> +{
>>>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>> +
>>>>>> +    rcu_read_lock();
>>>>>
>>>>> Where is this unlocked?
>>>>
>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>>>>
>>>>>
>>>>>> +    /*
>>>>>> +     * The memcg can be NULL when the memory controller is disabled.
>>>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>> +     */
>>>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>>>> +        goto lock;
>>>>>> +
>>>>>> +    do {
>>>>>> +        memcg = parent_mem_cgroup(memcg);
>>>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>>       while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>           memcg = parent_mem_cgroup(memcg);
>>>>>           lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> There is no need to acquire the lruvec before finding the first
>>>> non-dying memcg.
>>>
>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>
>>> rcu_read_lock()
>>>
>>> while (unlikely(memcg_is_dying(memcg)))
>>>           memcg = parent_mem_cgroup(memcg);
>>>
>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> If the first memcg is already non-dying, there's no need to re-acquire
>> the lruvec. ;)
> 
> Oh, right :)
> 
> Hmm but I still think Johannes' suggestion makes the code cleaner.

I don't have a strong preference on which of the two coding styles is
more readable. BTW, is there any kernel documentation I could refer to
for this?

> Observing a dying cgroup should be rare anyway, it's worth focusing
> more on readability?

While it's rare to encounter consecutive dying memcgs, it can still
happen, right?

Thanks,
Qi

> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  6:24           ` Qi Zheng
@ 2026-06-26  6:48             ` Harry Yoo
  2026-06-26  7:04               ` Qi Zheng
  0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-06-26  6:48 UTC (permalink / raw)
  To: Qi Zheng, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable


[-- Attachment #1.1: Type: text/plain, Size: 6770 bytes --]



On 6/26/26 3:24 PM, Qi Zheng wrote:
> 
> 
> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>
>>
>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>
>>>
>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>
>>>>
>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>> Hi Johannes,
>>>>>
>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>
>>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>>> lock.
>>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>>> under
>>>>>>> the lruvec lock.
>>>>>>>
>>>>>>> The page table walker can run concurrently with the memcg
>>>>>>> reparenting
>>>>>>> path
>>>>>>> as follows:
>>>>>>>
>>>>>>> CPU0                           CPU1
>>>>>>> ====                           ====
>>>>>>>
>>>>>>> walk_mm
>>>>>>> --> walk_page_range
>>>>>>>        --> update_batch_size
>>>>>>>            --> walk->nr_pages += delta
>>>>>>>
>>>>>>>                                  mem_cgroup_css_offline
>>>>>>>                                  --> memcg_reparent_objcgs
>>>>>>>                                      --> lock lruvec
>>>>>>>                                          lru_gen_reparent_memcg
>>>>>>>                                          --> reparent child
>>>>>>> folios to
>>>>>>> parent
>>>>>>>                                          unlock lruvec
>>>>>>>
>>>>>>>        lock lruvec
>>>>>>>        reset_batch_size
>>>>>>>        --> child lrugen->nr_pages += delta
>>>>>>>
>>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>>
>>>>>>>       VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>>                      sizeof(lruvec->lrugen.nr_pages)));
>>>>>>>
>>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>>> nr_pages
>>>>>>> reaches zero, but there are still more pages.
>>>>>>>
>>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>>> lruvec
>>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>>> deltas
>>>>>>> to the first non-dying ancestor.
>>>>>>>
>>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>>> folios")
>>>>>>> Cc: <stable@vger.kernel.org>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>> ---
>>>>>>> Changes in v3:
>>>>>>>     - re-implement lock_batch_lruvec() by checking CSS_DYING
>>>>>>> under the
>>>>>>> RCU lock
>>>>>>>       (suggested by Harry)
>>>>>>>     - update the commit message (suggested by Harry)
>>>>>>>     - temporarily drop the previous Reviewed-by tags
>>>>>>>       (since the sync method has changed)
>>>>>>>     - rebase onto the next-20260624
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>>     - update the commit message (pointed by Barry)
>>>>>>>     - collect Reviewed-by
>>>>>>>
>>>>>>>     mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>>>     1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>>         walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>>     }
>>>>>>>     +#ifdef CONFIG_MEMCG
>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>> +{
>>>>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>> +
>>>>>>> +    rcu_read_lock();
>>>>>>
>>>>>> Where is this unlocked?
>>>>>
>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>> unlocking.
>>>>>
>>>>>>
>>>>>>> +    /*
>>>>>>> +     * The memcg can be NULL when the memory controller is
>>>>>>> disabled.
>>>>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>>> +     */
>>>>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>>>>> +        goto lock;
>>>>>>> +
>>>>>>> +    do {
>>>>>>> +        memcg = parent_mem_cgroup(memcg);
>>>>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>
>>>>>>       while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>>           memcg = parent_mem_cgroup(memcg);
>>>>>>           lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>> There is no need to acquire the lruvec before finding the first
>>>>> non-dying memcg.
>>>>
>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>
>>>> rcu_read_lock()
>>>>
>>>> while (unlikely(memcg_is_dying(memcg)))
>>>>           memcg = parent_mem_cgroup(memcg);
>>>>
>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> If the first memcg is already non-dying, there's no need to re-acquire
>>> the lruvec. ;)
>>
>> Oh, right :)
>>
>> Hmm but I still think Johannes' suggestion makes the code cleaner.
> 
> I don't have a strong preference on which of the two coding styles is
> more readable. BTW, is there any kernel documentation I could refer to
> for this?

I don't think there's a coding style guide that specifically
mentions this. Just thought it's cleaner because it merges if (...) goto
lock; and do-while into a single while loop.

>> Observing a dying cgroup should be rare anyway, it's worth focusing
>> more on readability?
> 
> While it's rare to encounter consecutive dying memcgs, it can still
> happen, right?

But is worth saving a few instruction in a basic block that is
unlikely() to be executed?

I'm not a memcg maintainer myself, though. just my 2 cents.

-- 
Cheers,
Harry / Hyeonggon


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  6:48             ` Harry Yoo
@ 2026-06-26  7:04               ` Qi Zheng
  2026-06-26  7:09                 ` Harry Yoo
  2026-06-26  9:39                 ` Johannes Weiner
  0 siblings, 2 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-26  7:04 UTC (permalink / raw)
  To: Harry Yoo, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable



On 6/26/26 2:48 PM, Harry Yoo wrote:
> 
> 
> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>
>>
>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>
>>>
>>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>>
>>>>
>>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>>
>>>>>
>>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>>> Hi Johannes,
>>>>>>
>>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>>
>>>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>>>> lock.
>>>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>>>> under
>>>>>>>> the lruvec lock.
>>>>>>>>
>>>>>>>> The page table walker can run concurrently with the memcg
>>>>>>>> reparenting
>>>>>>>> path
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>> CPU0                           CPU1
>>>>>>>> ====                           ====
>>>>>>>>
>>>>>>>> walk_mm
>>>>>>>> --> walk_page_range
>>>>>>>>         --> update_batch_size
>>>>>>>>             --> walk->nr_pages += delta
>>>>>>>>
>>>>>>>>                                   mem_cgroup_css_offline
>>>>>>>>                                   --> memcg_reparent_objcgs
>>>>>>>>                                       --> lock lruvec
>>>>>>>>                                           lru_gen_reparent_memcg
>>>>>>>>                                           --> reparent child
>>>>>>>> folios to
>>>>>>>> parent
>>>>>>>>                                           unlock lruvec
>>>>>>>>
>>>>>>>>         lock lruvec
>>>>>>>>         reset_batch_size
>>>>>>>>         --> child lrugen->nr_pages += delta
>>>>>>>>
>>>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>>>
>>>>>>>>        VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>>>                       sizeof(lruvec->lrugen.nr_pages)));
>>>>>>>>
>>>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>>>> nr_pages
>>>>>>>> reaches zero, but there are still more pages.
>>>>>>>>
>>>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>>>> lruvec
>>>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>>>> deltas
>>>>>>>> to the first non-dying ancestor.
>>>>>>>>
>>>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>>>> folios")
>>>>>>>> Cc: <stable@vger.kernel.org>
>>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>> ---
>>>>>>>> Changes in v3:
>>>>>>>>      - re-implement lock_batch_lruvec() by checking CSS_DYING
>>>>>>>> under the
>>>>>>>> RCU lock
>>>>>>>>        (suggested by Harry)
>>>>>>>>      - update the commit message (suggested by Harry)
>>>>>>>>      - temporarily drop the previous Reviewed-by tags
>>>>>>>>        (since the sync method has changed)
>>>>>>>>      - rebase onto the next-20260624
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>>      - update the commit message (pointed by Barry)
>>>>>>>>      - collect Reviewed-by
>>>>>>>>
>>>>>>>>      mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>>>>      1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>>>          walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>>>      }
>>>>>>>>      +#ifdef CONFIG_MEMCG
>>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>>> +{
>>>>>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>>> +
>>>>>>>> +    rcu_read_lock();
>>>>>>>
>>>>>>> Where is this unlocked?
>>>>>>
>>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>>> unlocking.
>>>>>>
>>>>>>>
>>>>>>>> +    /*
>>>>>>>> +     * The memcg can be NULL when the memory controller is
>>>>>>>> disabled.
>>>>>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>>>> +     */
>>>>>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>>>>>> +        goto lock;
>>>>>>>> +
>>>>>>>> +    do {
>>>>>>>> +        memcg = parent_mem_cgroup(memcg);
>>>>>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>
>>>>>>>        while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>>>            memcg = parent_mem_cgroup(memcg);
>>>>>>>            lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>
>>>>>> There is no need to acquire the lruvec before finding the first
>>>>>> non-dying memcg.
>>>>>
>>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>
>>>>> rcu_read_lock()
>>>>>
>>>>> while (unlikely(memcg_is_dying(memcg)))
>>>>>            memcg = parent_mem_cgroup(memcg);
>>>>>
>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> If the first memcg is already non-dying, there's no need to re-acquire
>>>> the lruvec. ;)
>>>
>>> Oh, right :)
>>>
>>> Hmm but I still think Johannes' suggestion makes the code cleaner.
>>
>> I don't have a strong preference on which of the two coding styles is
>> more readable. BTW, is there any kernel documentation I could refer to
>> for this?
> 
> I don't think there's a coding style guide that specifically
> mentions this. Just thought it's cleaner because it merges if (...) goto
> lock; and do-while into a single while loop.
> 
>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>> more on readability?
>>
>> While it's rare to encounter consecutive dying memcgs, it can still
>> happen, right?
> 
> But is worth saving a few instruction in a basic block that is
> unlikely() to be executed?

I don't have a strong opinion here. Hi Johannes, I'll leave the decision
up to you. If necessary, I can send out the v4.

> 
> I'm not a memcg maintainer myself, though. just my 2 cents.

I'd like to express my gratitude for your reviews, and especially
for your invaluable help with the earlier dying memcg work!

Thanks,
Qi

> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  7:04               ` Qi Zheng
@ 2026-06-26  7:09                 ` Harry Yoo
  2026-06-26  9:39                 ` Johannes Weiner
  1 sibling, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2026-06-26  7:09 UTC (permalink / raw)
  To: Qi Zheng, Johannes Weiner
  Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable


[-- Attachment #1.1: Type: text/plain, Size: 3711 bytes --]



On 6/26/26 4:04 PM, Qi Zheng wrote:
> On 6/26/26 2:48 PM, Harry Yoo wrote:
>> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>>>>          walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>>>>      }
>>>>>>>>>      +#ifdef CONFIG_MEMCG
>>>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>>>> +{
>>>>>>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>>>> +
>>>>>>>>> +    rcu_read_lock();
>>>>>>>>
>>>>>>>> Where is this unlocked?
>>>>>>>
>>>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>>>> unlocking.
>>>>>>>
>>>>>>>>> +    /*
>>>>>>>>> +     * The memcg can be NULL when the memory controller is
>>>>>>>>> disabled.
>>>>>>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec
>>>>>>>>> alive.
>>>>>>>>> +     */
>>>>>>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>>>>>>> +        goto lock;
>>>>>>>>> +
>>>>>>>>> +    do {
>>>>>>>>> +        memcg = parent_mem_cgroup(memcg);
>>>>>>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>>
>>>>>>>>        while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>>>>            memcg = parent_mem_cgroup(memcg);
>>>>>>>>            lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>
>>>>>>> There is no need to acquire the lruvec before finding the first
>>>>>>> non-dying memcg.
>>>>>>
>>>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>
>>>>>> rcu_read_lock()
>>>>>>
>>>>>> while (unlikely(memcg_is_dying(memcg)))
>>>>>>            memcg = parent_mem_cgroup(memcg);
>>>>>>
>>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>> If the first memcg is already non-dying, there's no need to re-acquire
>>>>> the lruvec. ;)
>>>>
>>>> Oh, right :)
>>>>
>>>> Hmm but I still think Johannes' suggestion makes the code cleaner.
>>>
>>> I don't have a strong preference on which of the two coding styles is
>>> more readable. BTW, is there any kernel documentation I could refer to
>>> for this?
>>
>> I don't think there's a coding style guide that specifically
>> mentions this. Just thought it's cleaner because it merges if (...) goto
>> lock; and do-while into a single while loop.
>>
>>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>>> more on readability?
>>>
>>> While it's rare to encounter consecutive dying memcgs, it can still
>>> happen, right?
>>
>> But is worth saving a few instruction in a basic block that is
>> unlikely() to be executed?
> 
> I don't have a strong opinion here. Hi Johannes, I'll leave the decision
> up to you. If necessary, I can send out the v4.
> 
>>
>> I'm not a memcg maintainer myself, though. just my 2 cents.
> 
> I'd like to express my gratitude for your reviews, and especially
> for your invaluable help with the earlier dying memcg work!

No problem, likewise to you and Muchun for invaluable work ;)

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  7:04               ` Qi Zheng
  2026-06-26  7:09                 ` Harry Yoo
@ 2026-06-26  9:39                 ` Johannes Weiner
  2026-06-26 11:21                   ` Qi Zheng
  1 sibling, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2026-06-26  9:39 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Harry Yoo, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, muchun.song, peiyang_he, mhocko,
	roman.gushchin, ljs, linux-mm, linux-kernel, Qi Zheng, stable

On Fri, Jun 26, 2026 at 03:04:17PM +0800, Qi Zheng wrote:
> On 6/26/26 2:48 PM, Harry Yoo wrote:
> > On 6/26/26 3:24 PM, Qi Zheng wrote:
> >> On 6/26/26 12:59 PM, Harry Yoo wrote:
> >>> Observing a dying cgroup should be rare anyway, it's worth focusing
> >>> more on readability?
> >>
> >> While it's rare to encounter consecutive dying memcgs, it can still
> >> happen, right?
> > 
> > But is worth saving a few instruction in a basic block that is
> > unlikely() to be executed?
> 
> I don't have a strong opinion here. Hi Johannes, I'll leave the decision
> up to you. If necessary, I can send out the v4.

Yes, I was thinking what Harry actually bothered to spell out ;)

The race is rare, multiple levels even rarer, and even *then*
mem_cgroup_lruvec() is a quick inline.

This way you have one block to handle that one rare race
condition. One place to put the comment. No labels, no goto.

Simplicity wins :)


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  9:39                 ` Johannes Weiner
@ 2026-06-26 11:21                   ` Qi Zheng
  2026-06-26 17:08                     ` Shakeel Butt
  0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 11:21 UTC (permalink / raw)
  To: Johannes Weiner, shakeel.butt
  Cc: Harry Yoo, akpm, david, kasong, baohua, axelrasmussen, yuanchu,
	weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
	linux-mm, linux-kernel, Qi Zheng, stable



On 6/26/26 5:39 PM, Johannes Weiner wrote:
> On Fri, Jun 26, 2026 at 03:04:17PM +0800, Qi Zheng wrote:
>> On 6/26/26 2:48 PM, Harry Yoo wrote:
>>> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>>>> more on readability?
>>>>
>>>> While it's rare to encounter consecutive dying memcgs, it can still
>>>> happen, right?
>>>
>>> But is worth saving a few instruction in a basic block that is
>>> unlikely() to be executed?
>>
>> I don't have a strong opinion here. Hi Johannes, I'll leave the decision
>> up to you. If necessary, I can send out the v4.
> 
> Yes, I was thinking what Harry actually bothered to spell out ;)
> 
> The race is rare, multiple levels even rarer, and even *then*
> mem_cgroup_lruvec() is a quick inline.
> 
> This way you have one block to handle that one rare race
> condition. One place to put the comment. No labels, no goto.
> 
> Simplicity wins :)

Okay, I will update it as you suggested and send out the v4.

Hi Shakeel, do we really need to move lock_batch_lruvec() to
memcontrol.h? It's currently only used by reset_batch_size().

Thanks,
Qi



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26 11:21                   ` Qi Zheng
@ 2026-06-26 17:08                     ` Shakeel Butt
  0 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2026-06-26 17:08 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Johannes Weiner, Harry Yoo, akpm, david, kasong, baohua,
	axelrasmussen, yuanchu, weixugc, muchun.song, peiyang_he, mhocko,
	roman.gushchin, ljs, linux-mm, linux-kernel, Qi Zheng, stable

On Fri, Jun 26, 2026 at 07:21:28PM +0800, Qi Zheng wrote:
> 
> 
> On 6/26/26 5:39 PM, Johannes Weiner wrote:
> > On Fri, Jun 26, 2026 at 03:04:17PM +0800, Qi Zheng wrote:
> > > On 6/26/26 2:48 PM, Harry Yoo wrote:
> > > > On 6/26/26 3:24 PM, Qi Zheng wrote:
> > > > > On 6/26/26 12:59 PM, Harry Yoo wrote:
> > > > > > Observing a dying cgroup should be rare anyway, it's worth focusing
> > > > > > more on readability?
> > > > > 
> > > > > While it's rare to encounter consecutive dying memcgs, it can still
> > > > > happen, right?
> > > > 
> > > > But is worth saving a few instruction in a basic block that is
> > > > unlikely() to be executed?
> > > 
> > > I don't have a strong opinion here. Hi Johannes, I'll leave the decision
> > > up to you. If necessary, I can send out the v4.
> > 
> > Yes, I was thinking what Harry actually bothered to spell out ;)
> > 
> > The race is rare, multiple levels even rarer, and even *then*
> > mem_cgroup_lruvec() is a quick inline.
> > 
> > This way you have one block to handle that one rare race
> > condition. One place to put the comment. No labels, no goto.
> > 
> > Simplicity wins :)
> 
> Okay, I will update it as you suggested and send out the v4.
> 
> Hi Shakeel, do we really need to move lock_batch_lruvec() to
> memcontrol.h? It's currently only used by reset_batch_size().

This function is very specific to memcg therefore I asked it move to
memcontrol.h and not to keep in vmscan.c but we can always do the cleanup later,
so proceed however you want.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
  2026-06-25 18:41 ` Johannes Weiner
@ 2026-06-25 20:22 ` Shakeel Butt
  2026-06-26  2:39   ` Qi Zheng
  2026-06-26  4:29 ` Peiyang He
  2026-06-26  7:15 ` Harry Yoo
  3 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2026-06-25 20:22 UTC (permalink / raw)
  To: Qi Zheng
  Cc: akpm, david, kasong, baohua, axelrasmussen, yuanchu, weixugc,
	hannes, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
	ljs, linux-mm, linux-kernel, Qi Zheng, stable

On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
> 
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
> 
> CPU0                           CPU1
> ====                           ====
> 
> walk_mm
> --> walk_page_range
>     --> update_batch_size
>         --> walk->nr_pages += delta
> 
>                               mem_cgroup_css_offline
>                               --> memcg_reparent_objcgs
>                                   --> lock lruvec
>                                       lru_gen_reparent_memcg
>                                       --> reparent child folios to parent
>                                       unlock lruvec
> 
>     lock lruvec
>     reset_batch_size
>     --> child lrugen->nr_pages += delta
> 
> This will trigger the following warning in lru_gen_exit_memcg():
> 
> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> 				   sizeof(lruvec->lrugen.nr_pages)));
> 
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
> 
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
> 
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Changes in v3:
>  - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>    (suggested by Harry)
>  - update the commit message (suggested by Harry)
>  - temporarily drop the previous Reviewed-by tags
>    (since the sync method has changed)
>  - rebase onto the next-20260624
> 
> Changes in v2:
>  - update the commit message (pointed by Barry)
>  - collect Reviewed-by
> 
>  mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 38 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35c3bb15ae96..1ec8c23c72b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>  	walk->nr_pages[new_gen][type][zone] += delta;
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)

This is memcg specific function, move this function next to similar functions
like lruvec_lock_irq. Also put irq in the name.

BTW have you checked other places where lruvec_lock_irq is used and if similar
kind of situation can happen?



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-25 20:22 ` Shakeel Butt
@ 2026-06-26  2:39   ` Qi Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-26  2:39 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: akpm, david, kasong, baohua, axelrasmussen, yuanchu, weixugc,
	hannes, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
	ljs, linux-mm, linux-kernel, Qi Zheng, stable

Hi Shakeel,

On 6/26/26 4:22 AM, Shakeel Butt wrote:
> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> The mglru page table walker batches per-generation size deltas in
>> walk->nr_pages while walking page tables without holding the lruvec lock.
>> The reset_batch_size() later folds those deltas into walk->lruvec under
>> the lruvec lock.
>>
>> The page table walker can run concurrently with the memcg reparenting path
>> as follows:
>>
>> CPU0                           CPU1
>> ====                           ====
>>
>> walk_mm
>> --> walk_page_range
>>      --> update_batch_size
>>          --> walk->nr_pages += delta
>>
>>                                mem_cgroup_css_offline
>>                                --> memcg_reparent_objcgs
>>                                    --> lock lruvec
>>                                        lru_gen_reparent_memcg
>>                                        --> reparent child folios to parent
>>                                        unlock lruvec
>>
>>      lock lruvec
>>      reset_batch_size
>>      --> child lrugen->nr_pages += delta
>>
>> This will trigger the following warning in lru_gen_exit_memcg():
>>
>> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>> 				   sizeof(lruvec->lrugen.nr_pages)));
>>
>> And the user-visible impact of underestimated nr_pages in MGLRU was
>> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
>> reaches zero, but there are still more pages.
>>
>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>> to the first non-dying ancestor.
>>
>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> Changes in v3:
>>   - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>>     (suggested by Harry)
>>   - update the commit message (suggested by Harry)
>>   - temporarily drop the previous Reviewed-by tags
>>     (since the sync method has changed)
>>   - rebase onto the next-20260624
>>
>> Changes in v2:
>>   - update the commit message (pointed by Barry)
>>   - collect Reviewed-by
>>
>>   mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>   1 file changed, 38 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 35c3bb15ae96..1ec8c23c72b9 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>>   	walk->nr_pages[new_gen][type][zone] += delta;
>>   }
>>   
>> +#ifdef CONFIG_MEMCG
>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> 
> This is memcg specific function, move this function next to similar functions
> like lruvec_lock_irq. Also put irq in the name.

Currently, the lock_batch_lruvec() is only used by reset_batch_size().
Are you intend to make it a common function for use in other places?

Perhaps we could defer making it generic until we actually have a second
user. Since this is just a fix, keeping it self-contained within
vmscan.c might be more compact for now. ;)

> 
> BTW have you checked other places where lruvec_lock_irq is used and if similar
> kind of situation can happen?

I just checked and found no such callers.

Thanks,
Qi

> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
  2026-06-25 18:41 ` Johannes Weiner
  2026-06-25 20:22 ` Shakeel Butt
@ 2026-06-26  4:29 ` Peiyang He
  2026-06-26  4:50   ` Qi Zheng
  2026-06-26  7:15 ` Harry Yoo
  3 siblings, 1 reply; 18+ messages in thread
From: Peiyang He @ 2026-06-26  4:29 UTC (permalink / raw)
  To: Qi Zheng, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, harry, muchun.song,
	mhocko, roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable

Hi Qi,

I have applied this patch and tested it against the PoC, the warning is not observed anymore in the kernel log.
Thanks for your nice work!

Best,
Peiyang

On 2026/6/25 23:15, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
> 
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
> 
> CPU0                           CPU1
> ====                           ====
> 
> walk_mm
> --> walk_page_range
>     --> update_batch_size
>         --> walk->nr_pages += delta
> 
>                               mem_cgroup_css_offline
>                               --> memcg_reparent_objcgs
>                                   --> lock lruvec
>                                       lru_gen_reparent_memcg
>                                       --> reparent child folios to parent
>                                       unlock lruvec
> 
>     lock lruvec
>     reset_batch_size
>     --> child lrugen->nr_pages += delta
> 
> This will trigger the following warning in lru_gen_exit_memcg():
> 
> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> 				   sizeof(lruvec->lrugen.nr_pages)));
> 
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
> 
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
> 
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Changes in v3:
>  - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>    (suggested by Harry)
>  - update the commit message (suggested by Harry)
>  - temporarily drop the previous Reviewed-by tags
>    (since the sync method has changed)
>  - rebase onto the next-20260624
> 
> Changes in v2:
>  - update the commit message (pointed by Barry)
>  - collect Reviewed-by
> 
>  mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 38 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35c3bb15ae96..1ec8c23c72b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>  	walk->nr_pages[new_gen][type][zone] += delta;
>  }
>  
> +#ifdef CONFIG_MEMCG
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> +{
> +	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> +	rcu_read_lock();
> +
> +	/*
> +	 * The memcg can be NULL when the memory controller is disabled.
> +	 * Otherwise, the caller keeps the memcg owning @lruvec alive.
> +	 */
> +	if (!memcg || !css_is_dying(&memcg->css))
> +		goto lock;
> +
> +	do {
> +		memcg = parent_mem_cgroup(memcg);
> +	} while (memcg && css_is_dying(&memcg->css));
> +	lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
> +lock:
> +	spin_lock_irq(&lruvec->lru_lock);
> +
> +	return lruvec;
> +}
> +#else
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> +{
> +	lruvec_lock_irq(lruvec);
> +
> +	return lruvec;
> +}
> +#endif
> +
>  static void reset_batch_size(struct lru_gen_mm_walk *walk)
>  {
>  	int gen, type, zone;
> -	struct lruvec *lruvec = walk->lruvec;
> +	struct lruvec *lruvec = lock_batch_lruvec(walk->lruvec);
>  	struct lru_gen_folio *lrugen = &lruvec->lrugen;
>  
>  	walk->batched = 0;
> @@ -3285,6 +3319,8 @@ static void reset_batch_size(struct lru_gen_mm_walk *walk)
>  			lru += LRU_ACTIVE;
>  		__update_lru_size(lruvec, lru, zone, delta);
>  	}
> +
> +	lruvec_unlock_irq(lruvec);
>  }
>  
>  static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args)
> @@ -3779,11 +3815,8 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
>  			mmap_read_unlock(mm);
>  		}
>  
> -		if (walk->batched) {
> -			lruvec_lock_irq(lruvec);
> +		if (walk->batched)
>  			reset_batch_size(walk);
> -			lruvec_unlock_irq(lruvec);
> -		}
>  
>  		cond_resched();
>  	} while (err == -EAGAIN);
> @@ -4867,9 +4900,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	walk = current->reclaim_state->mm_walk;
>  	if (walk && walk->batched) {
>  		walk->lruvec = lruvec;
> -		lruvec_lock_irq(lruvec);
>  		reset_batch_size(walk);
> -		lruvec_unlock_irq(lruvec);
>  	}
>  
>  	mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-26  4:29 ` Peiyang He
@ 2026-06-26  4:50   ` Qi Zheng
  0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-26  4:50 UTC (permalink / raw)
  To: Peiyang He, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, harry, muchun.song,
	mhocko, roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable

Hi Peiyang,

On 6/26/26 12:29 PM, Peiyang He wrote:
> Hi Qi,
> 
> I have applied this patch and tested it against the PoC, the warning is not observed anymore in the kernel log.
> Thanks for your nice work!

Thanks a ton for testing this!

> 
> Best,
> Peiyang
> 




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
  2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
                   ` (2 preceding siblings ...)
  2026-06-26  4:29 ` Peiyang He
@ 2026-06-26  7:15 ` Harry Yoo
  3 siblings, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2026-06-26  7:15 UTC (permalink / raw)
  To: Qi Zheng, akpm, david, kasong, shakeel.butt, baohua,
	axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
	mhocko, roman.gushchin, ljs
  Cc: linux-mm, linux-kernel, Qi Zheng, stable


[-- Attachment #1.1: Type: text/plain, Size: 2243 bytes --]



On 6/26/26 12:15 AM, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
> 
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
> 
> CPU0                           CPU1
> ====                           ====
> 
> walk_mm
> --> walk_page_range
>     --> update_batch_size
>         --> walk->nr_pages += delta
> 
>                               mem_cgroup_css_offline
>                               --> memcg_reparent_objcgs
>                                   --> lock lruvec
>                                       lru_gen_reparent_memcg
>                                       --> reparent child folios to parent
>                                       unlock lruvec
> 
>     lock lruvec
>     reset_batch_size
>     --> child lrugen->nr_pages += delta
> 
> This will trigger the following warning in lru_gen_exit_memcg():
> 
> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> 				   sizeof(lruvec->lrugen.nr_pages)));
> 
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
> 
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
> 
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---

besides some nits mentioned elsewhere in the thread,
the approach looks good to me, so:

Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>

-- 
Cheers,
Harry / Hyeonggon

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2026-06-26 17:08 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-25 18:41 ` Johannes Weiner
2026-06-26  2:27   ` Qi Zheng
2026-06-26  4:43     ` Harry Yoo
2026-06-26  4:48       ` Qi Zheng
2026-06-26  4:59         ` Harry Yoo
2026-06-26  6:24           ` Qi Zheng
2026-06-26  6:48             ` Harry Yoo
2026-06-26  7:04               ` Qi Zheng
2026-06-26  7:09                 ` Harry Yoo
2026-06-26  9:39                 ` Johannes Weiner
2026-06-26 11:21                   ` Qi Zheng
2026-06-26 17:08                     ` Shakeel Butt
2026-06-25 20:22 ` Shakeel Butt
2026-06-26  2:39   ` Qi Zheng
2026-06-26  4:29 ` Peiyang He
2026-06-26  4:50   ` Qi Zheng
2026-06-26  7:15 ` Harry Yoo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox