Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Qi Zheng <qi.zheng@linux.dev>
To: Harry Yoo <harry@kernel.org>,
	akpm@linux-foundation.org, david@kernel.org, kasong@tencent.com,
	shakeel.butt@linux.dev, baohua@kernel.org,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	hannes@cmpxchg.org, muchun.song@linux.dev,
	peiyang_he@smail.nju.edu.cn, mhocko@kernel.org,
	roman.gushchin@linux.dev, ljs@kernel.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
Date: Tue, 23 Jun 2026 15:16:16 +0800	[thread overview]
Message-ID: <d97128c0-7d89-4b5c-b891-84f9af702fee@linux.dev> (raw)
In-Reply-To: <e74b0808-3bcc-414d-a037-41e479210cc0@kernel.org>

Hi Harry,

On 6/23/26 2:17 PM, Harry Yoo wrote:
> 
> 
> On 6/23/26 11:42 AM, Qi Zheng wrote:
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> The mglru page table walker batches per-generation size deltas in
>> walk->nr_pages while walking page tables without holding the lruvec lock.
>> The reset_batch_size() later folds those deltas into walk->lruvec under
>> the lruvec lock.
> 
> Ouch.
> 
> IIRC the user-visible impact of underestimated nr_pages in MGLRU
> was premature OOMs because MGLRU does not try to reclaim memory when
> nr_pages reaches zero, but there are still more pages.
> 
> Perhaps worth mentioning in the changelog?

Maybe this should be placed before "To fix it...".

> 
>> The page table walker can run concurrently with the memcg reparenting path
>> as follows:
>>
>> CPU0                           CPU1
>> ====                           ====
>>
>> walk_mm
>> --> walk_page_range
>>      --> update_batch_size
>>          --> walk->nr_pages += delta
>>
>>                                mem_cgroup_css_offline
>>                                --> memcg_reparent_objcgs
>>                                    --> lock lruvec
>>                                        lru_gen_reparent_memcg
>>                                        --> reparent child folios to parent
>>                                        unlock lruvec
>>
>>      lock lruvec
>>      reset_batch_size
>>      --> child lrugen->nr_pages += delta
> 
> The problem here is that, while grabbing a reference to memcg
> (via mem_cgroup_iter(), for example) makes sure that the memcg is not
> freed, it does not prevent offlining happening, and reset_batch_size()
> doesn't check whether the lruvec has been reparented, or the lruvec
> is going to be reparented.
> 
>> This will trigger the following warning in lru_gen_exit_memcg():
>>
>> 	VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>> 				   sizeof(lruvec->lrugen.nr_pages)));
>>
>> To fix it, add lrugen->reparented to remember the new owner of a
>> reparented lruvec, and make reset_batch_size() charge pending deltas to
>> that owner.
> 
> Could you please explain why it is unavoidable to introduce the new
> field and why checking whether the cgroup is dying (and charging deltas
> to non-dying parent) doesn't work?

Peiyang tried doing this [1], but it doesn't work because
ss->css_offline() is called before clearing the CSS_ONLINE flag. I
also considered using mem_cgroup_tryget_online(), but that only prevent
the memcg from being freed. It's doesn't prevent the offlining.

So in the end, I chose the approach used in this patch. Simply adding
a new field to mglru to track its reparenting status seems to be the
most straightforward and effective approach.

Thanks,
Qi

[1]. 
https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn

> 
>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios")
>> Cc: <stable@vger.kernel.org>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> Reviewed-by: Barry Song <baohua@kernel.org>
>> ---
>

next prev parent reply	other threads:[~2026-06-23  7:16 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23  2:42 [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-23  2:56 ` Qi Zheng
2026-06-23  4:03 ` Baolin Wang
2026-06-23  6:17 ` Harry Yoo
2026-06-23  7:16   ` Qi Zheng [this message]
2026-06-23  8:18     ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d97128c0-7d89-4b5c-b891-84f9af702fee@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=peiyang_he@smail.nju.edu.cn \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.