* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
@ 2026-06-25 18:41 ` Johannes Weiner
2026-06-26 2:27 ` Qi Zheng
2026-06-25 20:22 ` Shakeel Butt
` (2 subsequent siblings)
3 siblings, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2026-06-25 18:41 UTC (permalink / raw)
To: Qi Zheng
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
ljs, linux-mm, linux-kernel, Qi Zheng, stable
On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
>
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
>
> CPU0 CPU1
> ==== ====
>
> walk_mm
> --> walk_page_range
> --> update_batch_size
> --> walk->nr_pages += delta
>
> mem_cgroup_css_offline
> --> memcg_reparent_objcgs
> --> lock lruvec
> lru_gen_reparent_memcg
> --> reparent child folios to parent
> unlock lruvec
>
> lock lruvec
> reset_batch_size
> --> child lrugen->nr_pages += delta
>
> This will trigger the following warning in lru_gen_exit_memcg():
>
> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> sizeof(lruvec->lrugen.nr_pages)));
>
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
>
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
>
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Changes in v3:
> - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
> (suggested by Harry)
> - update the commit message (suggested by Harry)
> - temporarily drop the previous Reviewed-by tags
> (since the sync method has changed)
> - rebase onto the next-20260624
>
> Changes in v2:
> - update the commit message (pointed by Barry)
> - collect Reviewed-by
>
> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35c3bb15ae96..1ec8c23c72b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
> walk->nr_pages[new_gen][type][zone] += delta;
> }
>
> +#ifdef CONFIG_MEMCG
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> + rcu_read_lock();
Where is this unlocked?
> + /*
> + * The memcg can be NULL when the memory controller is disabled.
> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
> + */
> + if (!memcg || !css_is_dying(&memcg->css))
> + goto lock;
> +
> + do {
> + memcg = parent_mem_cgroup(memcg);
> + } while (memcg && css_is_dying(&memcg->css));
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
while (unlikely(memcg && css_is_dying(&memcg->css))) {
memcg = parent_mem_cgroup(memcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
}
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-25 18:41 ` Johannes Weiner
@ 2026-06-26 2:27 ` Qi Zheng
2026-06-26 4:43 ` Harry Yoo
0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 2:27 UTC (permalink / raw)
To: Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
ljs, linux-mm, linux-kernel, Qi Zheng, stable
Hi Johannes,
On 6/26/26 2:41 AM, Johannes Weiner wrote:
> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> The mglru page table walker batches per-generation size deltas in
>> walk->nr_pages while walking page tables without holding the lruvec lock.
>> The reset_batch_size() later folds those deltas into walk->lruvec under
>> the lruvec lock.
>>
>> The page table walker can run concurrently with the memcg reparenting path
>> as follows:
>>
>> CPU0 CPU1
>> ==== ====
>>
>> walk_mm
>> --> walk_page_range
>> --> update_batch_size
>> --> walk->nr_pages += delta
>>
>> mem_cgroup_css_offline
>> --> memcg_reparent_objcgs
>> --> lock lruvec
>> lru_gen_reparent_memcg
>> --> reparent child folios to parent
>> unlock lruvec
>>
>> lock lruvec
>> reset_batch_size
>> --> child lrugen->nr_pages += delta
>>
>> This will trigger the following warning in lru_gen_exit_memcg():
>>
>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>> sizeof(lruvec->lrugen.nr_pages)));
>>
>> And the user-visible impact of underestimated nr_pages in MGLRU was
>> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
>> reaches zero, but there are still more pages.
>>
>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>> to the first non-dying ancestor.
>>
>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> Changes in v3:
>> - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>> (suggested by Harry)
>> - update the commit message (suggested by Harry)
>> - temporarily drop the previous Reviewed-by tags
>> (since the sync method has changed)
>> - rebase onto the next-20260624
>>
>> Changes in v2:
>> - update the commit message (pointed by Barry)
>> - collect Reviewed-by
>>
>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 35c3bb15ae96..1ec8c23c72b9 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>> walk->nr_pages[new_gen][type][zone] += delta;
>> }
>>
>> +#ifdef CONFIG_MEMCG
>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> +{
>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>> +
>> + rcu_read_lock();
>
> Where is this unlocked?
The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>
>> + /*
>> + * The memcg can be NULL when the memory controller is disabled.
>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>> + */
>> + if (!memcg || !css_is_dying(&memcg->css))
>> + goto lock;
>> +
>> + do {
>> + memcg = parent_mem_cgroup(memcg);
>> + } while (memcg && css_is_dying(&memcg->css));
>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> while (unlikely(memcg && css_is_dying(&memcg->css))) {
> memcg = parent_mem_cgroup(memcg);
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
There is no need to acquire the lruvec before finding the first
non-dying memcg.
Thanks,
Qi
> }
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 2:27 ` Qi Zheng
@ 2026-06-26 4:43 ` Harry Yoo
2026-06-26 4:48 ` Qi Zheng
0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-06-26 4:43 UTC (permalink / raw)
To: Qi Zheng, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
[-- Attachment #1.1: Type: text/plain, Size: 5175 bytes --]
On 6/26/26 11:27 AM, Qi Zheng wrote:
> Hi Johannes,
>
> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>
>>> The mglru page table walker batches per-generation size deltas in
>>> walk->nr_pages while walking page tables without holding the lruvec
>>> lock.
>>> The reset_batch_size() later folds those deltas into walk->lruvec under
>>> the lruvec lock.
>>>
>>> The page table walker can run concurrently with the memcg reparenting
>>> path
>>> as follows:
>>>
>>> CPU0 CPU1
>>> ==== ====
>>>
>>> walk_mm
>>> --> walk_page_range
>>> --> update_batch_size
>>> --> walk->nr_pages += delta
>>>
>>> mem_cgroup_css_offline
>>> --> memcg_reparent_objcgs
>>> --> lock lruvec
>>> lru_gen_reparent_memcg
>>> --> reparent child folios to
>>> parent
>>> unlock lruvec
>>>
>>> lock lruvec
>>> reset_batch_size
>>> --> child lrugen->nr_pages += delta
>>>
>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>
>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>> sizeof(lruvec->lrugen.nr_pages)));
>>>
>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>> premature OOMs because MGLRU does not try to reclaim memory when
>>> nr_pages
>>> reaches zero, but there are still more pages.
>>>
>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>>> to the first non-dying ancestor.
>>>
>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>>> Cc: <stable@vger.kernel.org>
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> Changes in v3:
>>> - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>> RCU lock
>>> (suggested by Harry)
>>> - update the commit message (suggested by Harry)
>>> - temporarily drop the previous Reviewed-by tags
>>> (since the sync method has changed)
>>> - rebase onto the next-20260624
>>>
>>> Changes in v2:
>>> - update the commit message (pointed by Barry)
>>> - collect Reviewed-by
>>>
>>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>> lru_gen_mm_walk *walk, struct folio *folio,
>>> walk->nr_pages[new_gen][type][zone] += delta;
>>> }
>>> +#ifdef CONFIG_MEMCG
>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>> +{
>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>> +
>>> + rcu_read_lock();
>>
>> Where is this unlocked?
>
> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>
>>
>>> + /*
>>> + * The memcg can be NULL when the memory controller is disabled.
>>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>> + */
>>> + if (!memcg || !css_is_dying(&memcg->css))
>>> + goto lock;
>>> +
>>> + do {
>>> + memcg = parent_mem_cgroup(memcg);
>>> + } while (memcg && css_is_dying(&memcg->css));
>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>> memcg = parent_mem_cgroup(memcg);
>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> There is no need to acquire the lruvec before finding the first
> non-dying memcg.
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
struct mem_cgroup *memcg = lruvec_memcg(lruvec);
rcu_read_lock()
while (unlikely(memcg_is_dying(memcg)))
memcg = parent_mem_cgroup(memcg);
lruvec = mem_cgroup_lruvec(memcg, pgdat);
spin_lock_irq(&lruvec->lru_lock);
return lruvec;
should work?
if the memory controller is disabled, it's equivalent to:
rcu_read_lock();
spin_lock_irq(&lruvec->lru_lock);
return lruvec;
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 4:43 ` Harry Yoo
@ 2026-06-26 4:48 ` Qi Zheng
2026-06-26 4:59 ` Harry Yoo
0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 4:48 UTC (permalink / raw)
To: Harry Yoo, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
On 6/26/26 12:43 PM, Harry Yoo wrote:
>
>
> On 6/26/26 11:27 AM, Qi Zheng wrote:
>> Hi Johannes,
>>
>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>
>>>> The mglru page table walker batches per-generation size deltas in
>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>> lock.
>>>> The reset_batch_size() later folds those deltas into walk->lruvec under
>>>> the lruvec lock.
>>>>
>>>> The page table walker can run concurrently with the memcg reparenting
>>>> path
>>>> as follows:
>>>>
>>>> CPU0 CPU1
>>>> ==== ====
>>>>
>>>> walk_mm
>>>> --> walk_page_range
>>>> --> update_batch_size
>>>> --> walk->nr_pages += delta
>>>>
>>>> mem_cgroup_css_offline
>>>> --> memcg_reparent_objcgs
>>>> --> lock lruvec
>>>> lru_gen_reparent_memcg
>>>> --> reparent child folios to
>>>> parent
>>>> unlock lruvec
>>>>
>>>> lock lruvec
>>>> reset_batch_size
>>>> --> child lrugen->nr_pages += delta
>>>>
>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>
>>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>> sizeof(lruvec->lrugen.nr_pages)));
>>>>
>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>> nr_pages
>>>> reaches zero, but there are still more pages.
>>>>
>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>>>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>>>> to the first non-dying ancestor.
>>>>
>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>>>> Cc: <stable@vger.kernel.org>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> Changes in v3:
>>>> - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>>> RCU lock
>>>> (suggested by Harry)
>>>> - update the commit message (suggested by Harry)
>>>> - temporarily drop the previous Reviewed-by tags
>>>> (since the sync method has changed)
>>>> - rebase onto the next-20260624
>>>>
>>>> Changes in v2:
>>>> - update the commit message (pointed by Barry)
>>>> - collect Reviewed-by
>>>>
>>>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>> walk->nr_pages[new_gen][type][zone] += delta;
>>>> }
>>>> +#ifdef CONFIG_MEMCG
>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>> +{
>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>> +
>>>> + rcu_read_lock();
>>>
>>> Where is this unlocked?
>>
>> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>>
>>>
>>>> + /*
>>>> + * The memcg can be NULL when the memory controller is disabled.
>>>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>> + */
>>>> + if (!memcg || !css_is_dying(&memcg->css))
>>>> + goto lock;
>>>> +
>>>> + do {
>>>> + memcg = parent_mem_cgroup(memcg);
>>>> + } while (memcg && css_is_dying(&memcg->css));
>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>> memcg = parent_mem_cgroup(memcg);
>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> There is no need to acquire the lruvec before finding the first
>> non-dying memcg.
>
> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> rcu_read_lock()
>
> while (unlikely(memcg_is_dying(memcg)))
> memcg = parent_mem_cgroup(memcg);
>
> lruvec = mem_cgroup_lruvec(memcg, pgdat);
If the first memcg is already non-dying, there's no need to re-acquire
the lruvec. ;)
Thanks,
Qi
> spin_lock_irq(&lruvec->lru_lock);
>
> return lruvec;
>
> should work?
>
> if the memory controller is disabled, it's equivalent to:
>
> rcu_read_lock();
> spin_lock_irq(&lruvec->lru_lock);
> return lruvec;
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 4:48 ` Qi Zheng
@ 2026-06-26 4:59 ` Harry Yoo
2026-06-26 6:24 ` Qi Zheng
0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-06-26 4:59 UTC (permalink / raw)
To: Qi Zheng, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
[-- Attachment #1.1: Type: text/plain, Size: 5253 bytes --]
On 6/26/26 1:48 PM, Qi Zheng wrote:
>
>
> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>
>>
>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>> Hi Johannes,
>>>
>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>
>>>>> The mglru page table walker batches per-generation size deltas in
>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>> lock.
>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>> under
>>>>> the lruvec lock.
>>>>>
>>>>> The page table walker can run concurrently with the memcg reparenting
>>>>> path
>>>>> as follows:
>>>>>
>>>>> CPU0 CPU1
>>>>> ==== ====
>>>>>
>>>>> walk_mm
>>>>> --> walk_page_range
>>>>> --> update_batch_size
>>>>> --> walk->nr_pages += delta
>>>>>
>>>>> mem_cgroup_css_offline
>>>>> --> memcg_reparent_objcgs
>>>>> --> lock lruvec
>>>>> lru_gen_reparent_memcg
>>>>> --> reparent child folios to
>>>>> parent
>>>>> unlock lruvec
>>>>>
>>>>> lock lruvec
>>>>> reset_batch_size
>>>>> --> child lrugen->nr_pages += delta
>>>>>
>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>
>>>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>> sizeof(lruvec->lrugen.nr_pages)));
>>>>>
>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>> nr_pages
>>>>> reaches zero, but there are still more pages.
>>>>>
>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>> lruvec
>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>> deltas
>>>>> to the first non-dying ancestor.
>>>>>
>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>> folios")
>>>>> Cc: <stable@vger.kernel.org>
>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>> ---
>>>>> Changes in v3:
>>>>> - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>>>> RCU lock
>>>>> (suggested by Harry)
>>>>> - update the commit message (suggested by Harry)
>>>>> - temporarily drop the previous Reviewed-by tags
>>>>> (since the sync method has changed)
>>>>> - rebase onto the next-20260624
>>>>>
>>>>> Changes in v2:
>>>>> - update the commit message (pointed by Barry)
>>>>> - collect Reviewed-by
>>>>>
>>>>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>> walk->nr_pages[new_gen][type][zone] += delta;
>>>>> }
>>>>> +#ifdef CONFIG_MEMCG
>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>> +{
>>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>> +
>>>>> + rcu_read_lock();
>>>>
>>>> Where is this unlocked?
>>>
>>> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>>>
>>>>
>>>>> + /*
>>>>> + * The memcg can be NULL when the memory controller is disabled.
>>>>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>> + */
>>>>> + if (!memcg || !css_is_dying(&memcg->css))
>>>>> + goto lock;
>>>>> +
>>>>> + do {
>>>>> + memcg = parent_mem_cgroup(memcg);
>>>>> + } while (memcg && css_is_dying(&memcg->css));
>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>> memcg = parent_mem_cgroup(memcg);
>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> There is no need to acquire the lruvec before finding the first
>>> non-dying memcg.
>>
>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>> rcu_read_lock()
>>
>> while (unlikely(memcg_is_dying(memcg)))
>> memcg = parent_mem_cgroup(memcg);
>>
>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>
> If the first memcg is already non-dying, there's no need to re-acquire
> the lruvec. ;)
Oh, right :)
Hmm but I still think Johannes' suggestion makes the code cleaner.
Observing a dying cgroup should be rare anyway, it's worth focusing
more on readability?
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 4:59 ` Harry Yoo
@ 2026-06-26 6:24 ` Qi Zheng
2026-06-26 6:48 ` Harry Yoo
0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 6:24 UTC (permalink / raw)
To: Harry Yoo, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
On 6/26/26 12:59 PM, Harry Yoo wrote:
>
>
> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>
>>
>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>
>>>
>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>> Hi Johannes,
>>>>
>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>
>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>> lock.
>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>> under
>>>>>> the lruvec lock.
>>>>>>
>>>>>> The page table walker can run concurrently with the memcg reparenting
>>>>>> path
>>>>>> as follows:
>>>>>>
>>>>>> CPU0 CPU1
>>>>>> ==== ====
>>>>>>
>>>>>> walk_mm
>>>>>> --> walk_page_range
>>>>>> --> update_batch_size
>>>>>> --> walk->nr_pages += delta
>>>>>>
>>>>>> mem_cgroup_css_offline
>>>>>> --> memcg_reparent_objcgs
>>>>>> --> lock lruvec
>>>>>> lru_gen_reparent_memcg
>>>>>> --> reparent child folios to
>>>>>> parent
>>>>>> unlock lruvec
>>>>>>
>>>>>> lock lruvec
>>>>>> reset_batch_size
>>>>>> --> child lrugen->nr_pages += delta
>>>>>>
>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>
>>>>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>> sizeof(lruvec->lrugen.nr_pages)));
>>>>>>
>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>> nr_pages
>>>>>> reaches zero, but there are still more pages.
>>>>>>
>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>> lruvec
>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>> deltas
>>>>>> to the first non-dying ancestor.
>>>>>>
>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>> folios")
>>>>>> Cc: <stable@vger.kernel.org>
>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>> ---
>>>>>> Changes in v3:
>>>>>> - re-implement lock_batch_lruvec() by checking CSS_DYING under the
>>>>>> RCU lock
>>>>>> (suggested by Harry)
>>>>>> - update the commit message (suggested by Harry)
>>>>>> - temporarily drop the previous Reviewed-by tags
>>>>>> (since the sync method has changed)
>>>>>> - rebase onto the next-20260624
>>>>>>
>>>>>> Changes in v2:
>>>>>> - update the commit message (pointed by Barry)
>>>>>> - collect Reviewed-by
>>>>>>
>>>>>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>> --- a/mm/vmscan.c
>>>>>> +++ b/mm/vmscan.c
>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>> walk->nr_pages[new_gen][type][zone] += delta;
>>>>>> }
>>>>>> +#ifdef CONFIG_MEMCG
>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>> +{
>>>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>> +
>>>>>> + rcu_read_lock();
>>>>>
>>>>> Where is this unlocked?
>>>>
>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the unlocking.
>>>>
>>>>>
>>>>>> + /*
>>>>>> + * The memcg can be NULL when the memory controller is disabled.
>>>>>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>> + */
>>>>>> + if (!memcg || !css_is_dying(&memcg->css))
>>>>>> + goto lock;
>>>>>> +
>>>>>> + do {
>>>>>> + memcg = parent_mem_cgroup(memcg);
>>>>>> + } while (memcg && css_is_dying(&memcg->css));
>>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>> memcg = parent_mem_cgroup(memcg);
>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> There is no need to acquire the lruvec before finding the first
>>>> non-dying memcg.
>>>
>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>
>>> rcu_read_lock()
>>>
>>> while (unlikely(memcg_is_dying(memcg)))
>>> memcg = parent_mem_cgroup(memcg);
>>>
>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>
>> If the first memcg is already non-dying, there's no need to re-acquire
>> the lruvec. ;)
>
> Oh, right :)
>
> Hmm but I still think Johannes' suggestion makes the code cleaner.
I don't have a strong preference on which of the two coding styles is
more readable. BTW, is there any kernel documentation I could refer to
for this?
> Observing a dying cgroup should be rare anyway, it's worth focusing
> more on readability?
While it's rare to encounter consecutive dying memcgs, it can still
happen, right?
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 6:24 ` Qi Zheng
@ 2026-06-26 6:48 ` Harry Yoo
2026-06-26 7:04 ` Qi Zheng
0 siblings, 1 reply; 18+ messages in thread
From: Harry Yoo @ 2026-06-26 6:48 UTC (permalink / raw)
To: Qi Zheng, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
[-- Attachment #1.1: Type: text/plain, Size: 6770 bytes --]
On 6/26/26 3:24 PM, Qi Zheng wrote:
>
>
> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>
>>
>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>
>>>
>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>
>>>>
>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>> Hi Johannes,
>>>>>
>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>
>>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>>> lock.
>>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>>> under
>>>>>>> the lruvec lock.
>>>>>>>
>>>>>>> The page table walker can run concurrently with the memcg
>>>>>>> reparenting
>>>>>>> path
>>>>>>> as follows:
>>>>>>>
>>>>>>> CPU0 CPU1
>>>>>>> ==== ====
>>>>>>>
>>>>>>> walk_mm
>>>>>>> --> walk_page_range
>>>>>>> --> update_batch_size
>>>>>>> --> walk->nr_pages += delta
>>>>>>>
>>>>>>> mem_cgroup_css_offline
>>>>>>> --> memcg_reparent_objcgs
>>>>>>> --> lock lruvec
>>>>>>> lru_gen_reparent_memcg
>>>>>>> --> reparent child
>>>>>>> folios to
>>>>>>> parent
>>>>>>> unlock lruvec
>>>>>>>
>>>>>>> lock lruvec
>>>>>>> reset_batch_size
>>>>>>> --> child lrugen->nr_pages += delta
>>>>>>>
>>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>>
>>>>>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>> sizeof(lruvec->lrugen.nr_pages)));
>>>>>>>
>>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>>> nr_pages
>>>>>>> reaches zero, but there are still more pages.
>>>>>>>
>>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>>> lruvec
>>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>>> deltas
>>>>>>> to the first non-dying ancestor.
>>>>>>>
>>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>>> folios")
>>>>>>> Cc: <stable@vger.kernel.org>
>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>> ---
>>>>>>> Changes in v3:
>>>>>>> - re-implement lock_batch_lruvec() by checking CSS_DYING
>>>>>>> under the
>>>>>>> RCU lock
>>>>>>> (suggested by Harry)
>>>>>>> - update the commit message (suggested by Harry)
>>>>>>> - temporarily drop the previous Reviewed-by tags
>>>>>>> (since the sync method has changed)
>>>>>>> - rebase onto the next-20260624
>>>>>>>
>>>>>>> Changes in v2:
>>>>>>> - update the commit message (pointed by Barry)
>>>>>>> - collect Reviewed-by
>>>>>>>
>>>>>>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>>
>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>> --- a/mm/vmscan.c
>>>>>>> +++ b/mm/vmscan.c
>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>> walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>> }
>>>>>>> +#ifdef CONFIG_MEMCG
>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>> +{
>>>>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>> +
>>>>>>> + rcu_read_lock();
>>>>>>
>>>>>> Where is this unlocked?
>>>>>
>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>> unlocking.
>>>>>
>>>>>>
>>>>>>> + /*
>>>>>>> + * The memcg can be NULL when the memory controller is
>>>>>>> disabled.
>>>>>>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>>> + */
>>>>>>> + if (!memcg || !css_is_dying(&memcg->css))
>>>>>>> + goto lock;
>>>>>>> +
>>>>>>> + do {
>>>>>>> + memcg = parent_mem_cgroup(memcg);
>>>>>>> + } while (memcg && css_is_dying(&memcg->css));
>>>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>
>>>>>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>> memcg = parent_mem_cgroup(memcg);
>>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>> There is no need to acquire the lruvec before finding the first
>>>>> non-dying memcg.
>>>>
>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>
>>>> rcu_read_lock()
>>>>
>>>> while (unlikely(memcg_is_dying(memcg)))
>>>> memcg = parent_mem_cgroup(memcg);
>>>>
>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>
>>> If the first memcg is already non-dying, there's no need to re-acquire
>>> the lruvec. ;)
>>
>> Oh, right :)
>>
>> Hmm but I still think Johannes' suggestion makes the code cleaner.
>
> I don't have a strong preference on which of the two coding styles is
> more readable. BTW, is there any kernel documentation I could refer to
> for this?
I don't think there's a coding style guide that specifically
mentions this. Just thought it's cleaner because it merges if (...) goto
lock; and do-while into a single while loop.
>> Observing a dying cgroup should be rare anyway, it's worth focusing
>> more on readability?
>
> While it's rare to encounter consecutive dying memcgs, it can still
> happen, right?
But is worth saving a few instruction in a basic block that is
unlikely() to be executed?
I'm not a memcg maintainer myself, though. just my 2 cents.
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 6:48 ` Harry Yoo
@ 2026-06-26 7:04 ` Qi Zheng
2026-06-26 7:09 ` Harry Yoo
2026-06-26 9:39 ` Johannes Weiner
0 siblings, 2 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 7:04 UTC (permalink / raw)
To: Harry Yoo, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
On 6/26/26 2:48 PM, Harry Yoo wrote:
>
>
> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>
>>
>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>
>>>
>>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>>
>>>>
>>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>>
>>>>>
>>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>>> Hi Johannes,
>>>>>>
>>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>>
>>>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>>>> lock.
>>>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>>>> under
>>>>>>>> the lruvec lock.
>>>>>>>>
>>>>>>>> The page table walker can run concurrently with the memcg
>>>>>>>> reparenting
>>>>>>>> path
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>> CPU0 CPU1
>>>>>>>> ==== ====
>>>>>>>>
>>>>>>>> walk_mm
>>>>>>>> --> walk_page_range
>>>>>>>> --> update_batch_size
>>>>>>>> --> walk->nr_pages += delta
>>>>>>>>
>>>>>>>> mem_cgroup_css_offline
>>>>>>>> --> memcg_reparent_objcgs
>>>>>>>> --> lock lruvec
>>>>>>>> lru_gen_reparent_memcg
>>>>>>>> --> reparent child
>>>>>>>> folios to
>>>>>>>> parent
>>>>>>>> unlock lruvec
>>>>>>>>
>>>>>>>> lock lruvec
>>>>>>>> reset_batch_size
>>>>>>>> --> child lrugen->nr_pages += delta
>>>>>>>>
>>>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>>>
>>>>>>>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>>> sizeof(lruvec->lrugen.nr_pages)));
>>>>>>>>
>>>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>>>> nr_pages
>>>>>>>> reaches zero, but there are still more pages.
>>>>>>>>
>>>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>>>> lruvec
>>>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>>>> deltas
>>>>>>>> to the first non-dying ancestor.
>>>>>>>>
>>>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>>>> folios")
>>>>>>>> Cc: <stable@vger.kernel.org>
>>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>> ---
>>>>>>>> Changes in v3:
>>>>>>>> - re-implement lock_batch_lruvec() by checking CSS_DYING
>>>>>>>> under the
>>>>>>>> RCU lock
>>>>>>>> (suggested by Harry)
>>>>>>>> - update the commit message (suggested by Harry)
>>>>>>>> - temporarily drop the previous Reviewed-by tags
>>>>>>>> (since the sync method has changed)
>>>>>>>> - rebase onto the next-20260624
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>> - update the commit message (pointed by Barry)
>>>>>>>> - collect Reviewed-by
>>>>>>>>
>>>>>>>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>>>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>>> walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>>> }
>>>>>>>> +#ifdef CONFIG_MEMCG
>>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>>> +{
>>>>>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>>> +
>>>>>>>> + rcu_read_lock();
>>>>>>>
>>>>>>> Where is this unlocked?
>>>>>>
>>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>>> unlocking.
>>>>>>
>>>>>>>
>>>>>>>> + /*
>>>>>>>> + * The memcg can be NULL when the memory controller is
>>>>>>>> disabled.
>>>>>>>> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>>>> + */
>>>>>>>> + if (!memcg || !css_is_dying(&memcg->css))
>>>>>>>> + goto lock;
>>>>>>>> +
>>>>>>>> + do {
>>>>>>>> + memcg = parent_mem_cgroup(memcg);
>>>>>>>> + } while (memcg && css_is_dying(&memcg->css));
>>>>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>
>>>>>>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>>> memcg = parent_mem_cgroup(memcg);
>>>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>
>>>>>> There is no need to acquire the lruvec before finding the first
>>>>>> non-dying memcg.
>>>>>
>>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>
>>>>> rcu_read_lock()
>>>>>
>>>>> while (unlikely(memcg_is_dying(memcg)))
>>>>> memcg = parent_mem_cgroup(memcg);
>>>>>
>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> If the first memcg is already non-dying, there's no need to re-acquire
>>>> the lruvec. ;)
>>>
>>> Oh, right :)
>>>
>>> Hmm but I still think Johannes' suggestion makes the code cleaner.
>>
>> I don't have a strong preference on which of the two coding styles is
>> more readable. BTW, is there any kernel documentation I could refer to
>> for this?
>
> I don't think there's a coding style guide that specifically
> mentions this. Just thought it's cleaner because it merges if (...) goto
> lock; and do-while into a single while loop.
>
>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>> more on readability?
>>
>> While it's rare to encounter consecutive dying memcgs, it can still
>> happen, right?
>
> But is worth saving a few instruction in a basic block that is
> unlikely() to be executed?
I don't have a strong opinion here. Hi Johannes, I'll leave the decision
up to you. If necessary, I can send out the v4.
>
> I'm not a memcg maintainer myself, though. just my 2 cents.
I'd like to express my gratitude for your reviews, and especially
for your invaluable help with the earlier dying memcg work!
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 7:04 ` Qi Zheng
@ 2026-06-26 7:09 ` Harry Yoo
2026-06-26 9:39 ` Johannes Weiner
1 sibling, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2026-06-26 7:09 UTC (permalink / raw)
To: Qi Zheng, Johannes Weiner
Cc: akpm, david, kasong, shakeel.butt, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
[-- Attachment #1.1: Type: text/plain, Size: 3711 bytes --]
On 6/26/26 4:04 PM, Qi Zheng wrote:
> On 6/26/26 2:48 PM, Harry Yoo wrote:
>> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>>>> --- a/mm/vmscan.c
>>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>>>> walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>>>> }
>>>>>>>>> +#ifdef CONFIG_MEMCG
>>>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>>>> +{
>>>>>>>>> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>>>> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>>>> +
>>>>>>>>> + rcu_read_lock();
>>>>>>>>
>>>>>>>> Where is this unlocked?
>>>>>>>
>>>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>>>> unlocking.
>>>>>>>
>>>>>>>>> + /*
>>>>>>>>> + * The memcg can be NULL when the memory controller is
>>>>>>>>> disabled.
>>>>>>>>> + * Otherwise, the caller keeps the memcg owning @lruvec
>>>>>>>>> alive.
>>>>>>>>> + */
>>>>>>>>> + if (!memcg || !css_is_dying(&memcg->css))
>>>>>>>>> + goto lock;
>>>>>>>>> +
>>>>>>>>> + do {
>>>>>>>>> + memcg = parent_mem_cgroup(memcg);
>>>>>>>>> + } while (memcg && css_is_dying(&memcg->css));
>>>>>>>>> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>>
>>>>>>>> while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>>>> memcg = parent_mem_cgroup(memcg);
>>>>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>
>>>>>>> There is no need to acquire the lruvec before finding the first
>>>>>>> non-dying memcg.
>>>>>>
>>>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>
>>>>>> rcu_read_lock()
>>>>>>
>>>>>> while (unlikely(memcg_is_dying(memcg)))
>>>>>> memcg = parent_mem_cgroup(memcg);
>>>>>>
>>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>
>>>>> If the first memcg is already non-dying, there's no need to re-acquire
>>>>> the lruvec. ;)
>>>>
>>>> Oh, right :)
>>>>
>>>> Hmm but I still think Johannes' suggestion makes the code cleaner.
>>>
>>> I don't have a strong preference on which of the two coding styles is
>>> more readable. BTW, is there any kernel documentation I could refer to
>>> for this?
>>
>> I don't think there's a coding style guide that specifically
>> mentions this. Just thought it's cleaner because it merges if (...) goto
>> lock; and do-while into a single while loop.
>>
>>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>>> more on readability?
>>>
>>> While it's rare to encounter consecutive dying memcgs, it can still
>>> happen, right?
>>
>> But is worth saving a few instruction in a basic block that is
>> unlikely() to be executed?
>
> I don't have a strong opinion here. Hi Johannes, I'll leave the decision
> up to you. If necessary, I can send out the v4.
>
>>
>> I'm not a memcg maintainer myself, though. just my 2 cents.
>
> I'd like to express my gratitude for your reviews, and especially
> for your invaluable help with the earlier dying memcg work!
No problem, likewise to you and Muchun for invaluable work ;)
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 7:04 ` Qi Zheng
2026-06-26 7:09 ` Harry Yoo
@ 2026-06-26 9:39 ` Johannes Weiner
2026-06-26 11:21 ` Qi Zheng
1 sibling, 1 reply; 18+ messages in thread
From: Johannes Weiner @ 2026-06-26 9:39 UTC (permalink / raw)
To: Qi Zheng
Cc: Harry Yoo, akpm, david, kasong, shakeel.butt, baohua,
axelrasmussen, yuanchu, weixugc, muchun.song, peiyang_he, mhocko,
roman.gushchin, ljs, linux-mm, linux-kernel, Qi Zheng, stable
On Fri, Jun 26, 2026 at 03:04:17PM +0800, Qi Zheng wrote:
> On 6/26/26 2:48 PM, Harry Yoo wrote:
> > On 6/26/26 3:24 PM, Qi Zheng wrote:
> >> On 6/26/26 12:59 PM, Harry Yoo wrote:
> >>> Observing a dying cgroup should be rare anyway, it's worth focusing
> >>> more on readability?
> >>
> >> While it's rare to encounter consecutive dying memcgs, it can still
> >> happen, right?
> >
> > But is worth saving a few instruction in a basic block that is
> > unlikely() to be executed?
>
> I don't have a strong opinion here. Hi Johannes, I'll leave the decision
> up to you. If necessary, I can send out the v4.
Yes, I was thinking what Harry actually bothered to spell out ;)
The race is rare, multiple levels even rarer, and even *then*
mem_cgroup_lruvec() is a quick inline.
This way you have one block to handle that one rare race
condition. One place to put the comment. No labels, no goto.
Simplicity wins :)
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 9:39 ` Johannes Weiner
@ 2026-06-26 11:21 ` Qi Zheng
2026-06-26 17:08 ` Shakeel Butt
0 siblings, 1 reply; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 11:21 UTC (permalink / raw)
To: Johannes Weiner, shakeel.butt
Cc: Harry Yoo, akpm, david, kasong, baohua, axelrasmussen, yuanchu,
weixugc, muchun.song, peiyang_he, mhocko, roman.gushchin, ljs,
linux-mm, linux-kernel, Qi Zheng, stable
On 6/26/26 5:39 PM, Johannes Weiner wrote:
> On Fri, Jun 26, 2026 at 03:04:17PM +0800, Qi Zheng wrote:
>> On 6/26/26 2:48 PM, Harry Yoo wrote:
>>> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>>>> more on readability?
>>>>
>>>> While it's rare to encounter consecutive dying memcgs, it can still
>>>> happen, right?
>>>
>>> But is worth saving a few instruction in a basic block that is
>>> unlikely() to be executed?
>>
>> I don't have a strong opinion here. Hi Johannes, I'll leave the decision
>> up to you. If necessary, I can send out the v4.
>
> Yes, I was thinking what Harry actually bothered to spell out ;)
>
> The race is rare, multiple levels even rarer, and even *then*
> mem_cgroup_lruvec() is a quick inline.
>
> This way you have one block to handle that one rare race
> condition. One place to put the comment. No labels, no goto.
>
> Simplicity wins :)
Okay, I will update it as you suggested and send out the v4.
Hi Shakeel, do we really need to move lock_batch_lruvec() to
memcontrol.h? It's currently only used by reset_batch_size().
Thanks,
Qi
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 11:21 ` Qi Zheng
@ 2026-06-26 17:08 ` Shakeel Butt
0 siblings, 0 replies; 18+ messages in thread
From: Shakeel Butt @ 2026-06-26 17:08 UTC (permalink / raw)
To: Qi Zheng
Cc: Johannes Weiner, Harry Yoo, akpm, david, kasong, baohua,
axelrasmussen, yuanchu, weixugc, muchun.song, peiyang_he, mhocko,
roman.gushchin, ljs, linux-mm, linux-kernel, Qi Zheng, stable
On Fri, Jun 26, 2026 at 07:21:28PM +0800, Qi Zheng wrote:
>
>
> On 6/26/26 5:39 PM, Johannes Weiner wrote:
> > On Fri, Jun 26, 2026 at 03:04:17PM +0800, Qi Zheng wrote:
> > > On 6/26/26 2:48 PM, Harry Yoo wrote:
> > > > On 6/26/26 3:24 PM, Qi Zheng wrote:
> > > > > On 6/26/26 12:59 PM, Harry Yoo wrote:
> > > > > > Observing a dying cgroup should be rare anyway, it's worth focusing
> > > > > > more on readability?
> > > > >
> > > > > While it's rare to encounter consecutive dying memcgs, it can still
> > > > > happen, right?
> > > >
> > > > But is worth saving a few instruction in a basic block that is
> > > > unlikely() to be executed?
> > >
> > > I don't have a strong opinion here. Hi Johannes, I'll leave the decision
> > > up to you. If necessary, I can send out the v4.
> >
> > Yes, I was thinking what Harry actually bothered to spell out ;)
> >
> > The race is rare, multiple levels even rarer, and even *then*
> > mem_cgroup_lruvec() is a quick inline.
> >
> > This way you have one block to handle that one rare race
> > condition. One place to put the comment. No labels, no goto.
> >
> > Simplicity wins :)
>
> Okay, I will update it as you suggested and send out the v4.
>
> Hi Shakeel, do we really need to move lock_batch_lruvec() to
> memcontrol.h? It's currently only used by reset_batch_size().
This function is very specific to memcg therefore I asked it move to
memcontrol.h and not to keep in vmscan.c but we can always do the cleanup later,
so proceed however you want.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-25 18:41 ` Johannes Weiner
@ 2026-06-25 20:22 ` Shakeel Butt
2026-06-26 2:39 ` Qi Zheng
2026-06-26 4:29 ` Peiyang He
2026-06-26 7:15 ` Harry Yoo
3 siblings, 1 reply; 18+ messages in thread
From: Shakeel Butt @ 2026-06-25 20:22 UTC (permalink / raw)
To: Qi Zheng
Cc: akpm, david, kasong, baohua, axelrasmussen, yuanchu, weixugc,
hannes, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
ljs, linux-mm, linux-kernel, Qi Zheng, stable
On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
>
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
>
> CPU0 CPU1
> ==== ====
>
> walk_mm
> --> walk_page_range
> --> update_batch_size
> --> walk->nr_pages += delta
>
> mem_cgroup_css_offline
> --> memcg_reparent_objcgs
> --> lock lruvec
> lru_gen_reparent_memcg
> --> reparent child folios to parent
> unlock lruvec
>
> lock lruvec
> reset_batch_size
> --> child lrugen->nr_pages += delta
>
> This will trigger the following warning in lru_gen_exit_memcg():
>
> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> sizeof(lruvec->lrugen.nr_pages)));
>
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
>
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
>
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Changes in v3:
> - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
> (suggested by Harry)
> - update the commit message (suggested by Harry)
> - temporarily drop the previous Reviewed-by tags
> (since the sync method has changed)
> - rebase onto the next-20260624
>
> Changes in v2:
> - update the commit message (pointed by Barry)
> - collect Reviewed-by
>
> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35c3bb15ae96..1ec8c23c72b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
> walk->nr_pages[new_gen][type][zone] += delta;
> }
>
> +#ifdef CONFIG_MEMCG
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
This is memcg specific function, move this function next to similar functions
like lruvec_lock_irq. Also put irq in the name.
BTW have you checked other places where lruvec_lock_irq is used and if similar
kind of situation can happen?
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-25 20:22 ` Shakeel Butt
@ 2026-06-26 2:39 ` Qi Zheng
0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 2:39 UTC (permalink / raw)
To: Shakeel Butt
Cc: akpm, david, kasong, baohua, axelrasmussen, yuanchu, weixugc,
hannes, harry, muchun.song, peiyang_he, mhocko, roman.gushchin,
ljs, linux-mm, linux-kernel, Qi Zheng, stable
Hi Shakeel,
On 6/26/26 4:22 AM, Shakeel Butt wrote:
> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> The mglru page table walker batches per-generation size deltas in
>> walk->nr_pages while walking page tables without holding the lruvec lock.
>> The reset_batch_size() later folds those deltas into walk->lruvec under
>> the lruvec lock.
>>
>> The page table walker can run concurrently with the memcg reparenting path
>> as follows:
>>
>> CPU0 CPU1
>> ==== ====
>>
>> walk_mm
>> --> walk_page_range
>> --> update_batch_size
>> --> walk->nr_pages += delta
>>
>> mem_cgroup_css_offline
>> --> memcg_reparent_objcgs
>> --> lock lruvec
>> lru_gen_reparent_memcg
>> --> reparent child folios to parent
>> unlock lruvec
>>
>> lock lruvec
>> reset_batch_size
>> --> child lrugen->nr_pages += delta
>>
>> This will trigger the following warning in lru_gen_exit_memcg():
>>
>> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>> sizeof(lruvec->lrugen.nr_pages)));
>>
>> And the user-visible impact of underestimated nr_pages in MGLRU was
>> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
>> reaches zero, but there are still more pages.
>>
>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>> flushing the pending batch. A non-dying memcg keeps the original lruvec
>> stable against RCU-delayed offlining; a dying memcg redirects the deltas
>> to the first non-dying ancestor.
>>
>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> Changes in v3:
>> - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
>> (suggested by Harry)
>> - update the commit message (suggested by Harry)
>> - temporarily drop the previous Reviewed-by tags
>> (since the sync method has changed)
>> - rebase onto the next-20260624
>>
>> Changes in v2:
>> - update the commit message (pointed by Barry)
>> - collect Reviewed-by
>>
>> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>> 1 file changed, 38 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 35c3bb15ae96..1ec8c23c72b9 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
>> walk->nr_pages[new_gen][type][zone] += delta;
>> }
>>
>> +#ifdef CONFIG_MEMCG
>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>
> This is memcg specific function, move this function next to similar functions
> like lruvec_lock_irq. Also put irq in the name.
Currently, the lock_batch_lruvec() is only used by reset_batch_size().
Are you intend to make it a common function for use in other places?
Perhaps we could defer making it generic until we actually have a second
user. Since this is just a fix, keeping it self-contained within
vmscan.c might be more compact for now. ;)
>
> BTW have you checked other places where lruvec_lock_irq is used and if similar
> kind of situation can happen?
I just checked and found no such callers.
Thanks,
Qi
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-25 18:41 ` Johannes Weiner
2026-06-25 20:22 ` Shakeel Butt
@ 2026-06-26 4:29 ` Peiyang He
2026-06-26 4:50 ` Qi Zheng
2026-06-26 7:15 ` Harry Yoo
3 siblings, 1 reply; 18+ messages in thread
From: Peiyang He @ 2026-06-26 4:29 UTC (permalink / raw)
To: Qi Zheng, akpm, david, kasong, shakeel.butt, baohua,
axelrasmussen, yuanchu, weixugc, hannes, harry, muchun.song,
mhocko, roman.gushchin, ljs
Cc: linux-mm, linux-kernel, Qi Zheng, stable
Hi Qi,
I have applied this patch and tested it against the PoC, the warning is not observed anymore in the kernel log.
Thanks for your nice work!
Best,
Peiyang
On 2026/6/25 23:15, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
>
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
>
> CPU0 CPU1
> ==== ====
>
> walk_mm
> --> walk_page_range
> --> update_batch_size
> --> walk->nr_pages += delta
>
> mem_cgroup_css_offline
> --> memcg_reparent_objcgs
> --> lock lruvec
> lru_gen_reparent_memcg
> --> reparent child folios to parent
> unlock lruvec
>
> lock lruvec
> reset_batch_size
> --> child lrugen->nr_pages += delta
>
> This will trigger the following warning in lru_gen_exit_memcg():
>
> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> sizeof(lruvec->lrugen.nr_pages)));
>
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
>
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
>
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> Changes in v3:
> - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock
> (suggested by Harry)
> - update the commit message (suggested by Harry)
> - temporarily drop the previous Reviewed-by tags
> (since the sync method has changed)
> - rebase onto the next-20260624
>
> Changes in v2:
> - update the commit message (pointed by Barry)
> - collect Reviewed-by
>
> mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
> 1 file changed, 38 insertions(+), 7 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 35c3bb15ae96..1ec8c23c72b9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio,
> walk->nr_pages[new_gen][type][zone] += delta;
> }
>
> +#ifdef CONFIG_MEMCG
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> +{
> + struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> + struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> + rcu_read_lock();
> +
> + /*
> + * The memcg can be NULL when the memory controller is disabled.
> + * Otherwise, the caller keeps the memcg owning @lruvec alive.
> + */
> + if (!memcg || !css_is_dying(&memcg->css))
> + goto lock;
> +
> + do {
> + memcg = parent_mem_cgroup(memcg);
> + } while (memcg && css_is_dying(&memcg->css));
> + lruvec = mem_cgroup_lruvec(memcg, pgdat);
> +
> +lock:
> + spin_lock_irq(&lruvec->lru_lock);
> +
> + return lruvec;
> +}
> +#else
> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
> +{
> + lruvec_lock_irq(lruvec);
> +
> + return lruvec;
> +}
> +#endif
> +
> static void reset_batch_size(struct lru_gen_mm_walk *walk)
> {
> int gen, type, zone;
> - struct lruvec *lruvec = walk->lruvec;
> + struct lruvec *lruvec = lock_batch_lruvec(walk->lruvec);
> struct lru_gen_folio *lrugen = &lruvec->lrugen;
>
> walk->batched = 0;
> @@ -3285,6 +3319,8 @@ static void reset_batch_size(struct lru_gen_mm_walk *walk)
> lru += LRU_ACTIVE;
> __update_lru_size(lruvec, lru, zone, delta);
> }
> +
> + lruvec_unlock_irq(lruvec);
> }
>
> static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args)
> @@ -3779,11 +3815,8 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk)
> mmap_read_unlock(mm);
> }
>
> - if (walk->batched) {
> - lruvec_lock_irq(lruvec);
> + if (walk->batched)
> reset_batch_size(walk);
> - lruvec_unlock_irq(lruvec);
> - }
>
> cond_resched();
> } while (err == -EAGAIN);
> @@ -4867,9 +4900,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> walk = current->reclaim_state->mm_walk;
> if (walk && walk->batched) {
> walk->lruvec = lruvec;
> - lruvec_lock_irq(lruvec);
> reset_batch_size(walk);
> - lruvec_unlock_irq(lruvec);
> }
>
> mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc),
^ permalink raw reply [flat|nested] 18+ messages in thread* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-26 4:29 ` Peiyang He
@ 2026-06-26 4:50 ` Qi Zheng
0 siblings, 0 replies; 18+ messages in thread
From: Qi Zheng @ 2026-06-26 4:50 UTC (permalink / raw)
To: Peiyang He, akpm, david, kasong, shakeel.butt, baohua,
axelrasmussen, yuanchu, weixugc, hannes, harry, muchun.song,
mhocko, roman.gushchin, ljs
Cc: linux-mm, linux-kernel, Qi Zheng, stable
Hi Peiyang,
On 6/26/26 12:29 PM, Peiyang He wrote:
> Hi Qi,
>
> I have applied this patch and tested it against the PoC, the warning is not observed anymore in the kernel log.
> Thanks for your nice work!
Thanks a ton for testing this!
>
> Best,
> Peiyang
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
` (2 preceding siblings ...)
2026-06-26 4:29 ` Peiyang He
@ 2026-06-26 7:15 ` Harry Yoo
3 siblings, 0 replies; 18+ messages in thread
From: Harry Yoo @ 2026-06-26 7:15 UTC (permalink / raw)
To: Qi Zheng, akpm, david, kasong, shakeel.butt, baohua,
axelrasmussen, yuanchu, weixugc, hannes, muchun.song, peiyang_he,
mhocko, roman.gushchin, ljs
Cc: linux-mm, linux-kernel, Qi Zheng, stable
[-- Attachment #1.1: Type: text/plain, Size: 2243 bytes --]
On 6/26/26 12:15 AM, Qi Zheng wrote:
> From: Qi Zheng <zhengqi.arch@bytedance.com>
>
> The mglru page table walker batches per-generation size deltas in
> walk->nr_pages while walking page tables without holding the lruvec lock.
> The reset_batch_size() later folds those deltas into walk->lruvec under
> the lruvec lock.
>
> The page table walker can run concurrently with the memcg reparenting path
> as follows:
>
> CPU0 CPU1
> ==== ====
>
> walk_mm
> --> walk_page_range
> --> update_batch_size
> --> walk->nr_pages += delta
>
> mem_cgroup_css_offline
> --> memcg_reparent_objcgs
> --> lock lruvec
> lru_gen_reparent_memcg
> --> reparent child folios to parent
> unlock lruvec
>
> lock lruvec
> reset_batch_size
> --> child lrugen->nr_pages += delta
>
> This will trigger the following warning in lru_gen_exit_memcg():
>
> VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
> sizeof(lruvec->lrugen.nr_pages)));
>
> And the user-visible impact of underestimated nr_pages in MGLRU was
> premature OOMs because MGLRU does not try to reclaim memory when nr_pages
> reaches zero, but there are still more pages.
>
> To fix it, make reset_batch_size() check CSS_DYING under RCU before
> flushing the pending batch. A non-dying memcg keeps the original lruvec
> stable against RCU-delayed offlining; a dying memcg redirects the deltas
> to the first non-dying ancestor.
>
> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
besides some nits mentioned elsewhere in the thread,
the approach looks good to me, so:
Reviewed-by: Harry Yoo (Oracle) <harry@kernel.org>
--
Cheers,
Harry / Hyeonggon
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread