All of lore.kernel.org
 help / color / mirror / Atom feed
From: Qi Zheng <qi.zheng@linux.dev>
To: Harry Yoo <harry@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>
Cc: akpm@linux-foundation.org, david@kernel.org, kasong@tencent.com,
	shakeel.butt@linux.dev, baohua@kernel.org,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	muchun.song@linux.dev, peiyang_he@smail.nju.edu.cn,
	mhocko@kernel.org, roman.gushchin@linux.dev, ljs@kernel.org,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting
Date: Fri, 26 Jun 2026 15:04:17 +0800	[thread overview]
Message-ID: <c0e366ec-ee5d-42d9-ba33-7c630660e8af@linux.dev> (raw)
In-Reply-To: <5a0c6597-6b96-4781-a71b-fd1298b2b7bb@kernel.org>



On 6/26/26 2:48 PM, Harry Yoo wrote:
> 
> 
> On 6/26/26 3:24 PM, Qi Zheng wrote:
>>
>>
>> On 6/26/26 12:59 PM, Harry Yoo wrote:
>>>
>>>
>>> On 6/26/26 1:48 PM, Qi Zheng wrote:
>>>>
>>>>
>>>> On 6/26/26 12:43 PM, Harry Yoo wrote:
>>>>>
>>>>>
>>>>> On 6/26/26 11:27 AM, Qi Zheng wrote:
>>>>>> Hi Johannes,
>>>>>>
>>>>>> On 6/26/26 2:41 AM, Johannes Weiner wrote:
>>>>>>> On Thu, Jun 25, 2026 at 11:15:54PM +0800, Qi Zheng wrote:
>>>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>>
>>>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>>>> lock.
>>>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>>>> under
>>>>>>>> the lruvec lock.
>>>>>>>>
>>>>>>>> The page table walker can run concurrently with the memcg
>>>>>>>> reparenting
>>>>>>>> path
>>>>>>>> as follows:
>>>>>>>>
>>>>>>>> CPU0                           CPU1
>>>>>>>> ====                           ====
>>>>>>>>
>>>>>>>> walk_mm
>>>>>>>> --> walk_page_range
>>>>>>>>         --> update_batch_size
>>>>>>>>             --> walk->nr_pages += delta
>>>>>>>>
>>>>>>>>                                   mem_cgroup_css_offline
>>>>>>>>                                   --> memcg_reparent_objcgs
>>>>>>>>                                       --> lock lruvec
>>>>>>>>                                           lru_gen_reparent_memcg
>>>>>>>>                                           --> reparent child
>>>>>>>> folios to
>>>>>>>> parent
>>>>>>>>                                           unlock lruvec
>>>>>>>>
>>>>>>>>         lock lruvec
>>>>>>>>         reset_batch_size
>>>>>>>>         --> child lrugen->nr_pages += delta
>>>>>>>>
>>>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>>>
>>>>>>>>        VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>>>                       sizeof(lruvec->lrugen.nr_pages)));
>>>>>>>>
>>>>>>>> And the user-visible impact of underestimated nr_pages in MGLRU was
>>>>>>>> premature OOMs because MGLRU does not try to reclaim memory when
>>>>>>>> nr_pages
>>>>>>>> reaches zero, but there are still more pages.
>>>>>>>>
>>>>>>>> To fix it, make reset_batch_size() check CSS_DYING under RCU before
>>>>>>>> flushing the pending batch. A non-dying memcg keeps the original
>>>>>>>> lruvec
>>>>>>>> stable against RCU-delayed offlining; a dying memcg redirects the
>>>>>>>> deltas
>>>>>>>> to the first non-dying ancestor.
>>>>>>>>
>>>>>>>> Reported-by: Peiyang He <peiyang_he@smail.nju.edu.cn>
>>>>>>>> Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-
>>>>>>>> efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn
>>>>>>>> Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU
>>>>>>>> folios")
>>>>>>>> Cc: <stable@vger.kernel.org>
>>>>>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>>> ---
>>>>>>>> Changes in v3:
>>>>>>>>      - re-implement lock_batch_lruvec() by checking CSS_DYING
>>>>>>>> under the
>>>>>>>> RCU lock
>>>>>>>>        (suggested by Harry)
>>>>>>>>      - update the commit message (suggested by Harry)
>>>>>>>>      - temporarily drop the previous Reviewed-by tags
>>>>>>>>        (since the sync method has changed)
>>>>>>>>      - rebase onto the next-20260624
>>>>>>>>
>>>>>>>> Changes in v2:
>>>>>>>>      - update the commit message (pointed by Barry)
>>>>>>>>      - collect Reviewed-by
>>>>>>>>
>>>>>>>>      mm/vmscan.c | 45 ++++++++++++++++++++++++++++++++++++++-------
>>>>>>>>      1 file changed, 38 insertions(+), 7 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>>>>> index 35c3bb15ae96..1ec8c23c72b9 100644
>>>>>>>> --- a/mm/vmscan.c
>>>>>>>> +++ b/mm/vmscan.c
>>>>>>>> @@ -3262,10 +3262,44 @@ static void update_batch_size(struct
>>>>>>>> lru_gen_mm_walk *walk, struct folio *folio,
>>>>>>>>          walk->nr_pages[new_gen][type][zone] += delta;
>>>>>>>>      }
>>>>>>>>      +#ifdef CONFIG_MEMCG
>>>>>>>> +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>>>>>>>> +{
>>>>>>>> +    struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>>>> +    struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>>>> +
>>>>>>>> +    rcu_read_lock();
>>>>>>>
>>>>>>> Where is this unlocked?
>>>>>>
>>>>>> The lruvec_unlock_irq() in reset_batch_size() will handle the
>>>>>> unlocking.
>>>>>>
>>>>>>>
>>>>>>>> +    /*
>>>>>>>> +     * The memcg can be NULL when the memory controller is
>>>>>>>> disabled.
>>>>>>>> +     * Otherwise, the caller keeps the memcg owning @lruvec alive.
>>>>>>>> +     */
>>>>>>>> +    if (!memcg || !css_is_dying(&memcg->css))
>>>>>>>> +        goto lock;
>>>>>>>> +
>>>>>>>> +    do {
>>>>>>>> +        memcg = parent_mem_cgroup(memcg);
>>>>>>>> +    } while (memcg && css_is_dying(&memcg->css));
>>>>>>>> +    lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>>
>>>>>>>        while (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>>>>>            memcg = parent_mem_cgroup(memcg);
>>>>>>>            lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>>>
>>>>>> There is no need to acquire the lruvec before finding the first
>>>>>> non-dying memcg.
>>>>>
>>>>> struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>> struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>
>>>>> rcu_read_lock()
>>>>>
>>>>> while (unlikely(memcg_is_dying(memcg)))
>>>>>            memcg = parent_mem_cgroup(memcg);
>>>>>
>>>>> lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>>>
>>>> If the first memcg is already non-dying, there's no need to re-acquire
>>>> the lruvec. ;)
>>>
>>> Oh, right :)
>>>
>>> Hmm but I still think Johannes' suggestion makes the code cleaner.
>>
>> I don't have a strong preference on which of the two coding styles is
>> more readable. BTW, is there any kernel documentation I could refer to
>> for this?
> 
> I don't think there's a coding style guide that specifically
> mentions this. Just thought it's cleaner because it merges if (...) goto
> lock; and do-while into a single while loop.
> 
>>> Observing a dying cgroup should be rare anyway, it's worth focusing
>>> more on readability?
>>
>> While it's rare to encounter consecutive dying memcgs, it can still
>> happen, right?
> 
> But is worth saving a few instruction in a basic block that is
> unlikely() to be executed?

I don't have a strong opinion here. Hi Johannes, I'll leave the decision
up to you. If necessary, I can send out the v4.

> 
> I'm not a memcg maintainer myself, though. just my 2 cents.

I'd like to express my gratitude for your reviews, and especially
for your invaluable help with the earlier dying memcg work!

Thanks,
Qi

> 


  reply	other threads:[~2026-06-26  7:05 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-25 15:15 [PATCH v3] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-25 18:41 ` Johannes Weiner
2026-06-26  2:27   ` Qi Zheng
2026-06-26  4:43     ` Harry Yoo
2026-06-26  4:48       ` Qi Zheng
2026-06-26  4:59         ` Harry Yoo
2026-06-26  6:24           ` Qi Zheng
2026-06-26  6:48             ` Harry Yoo
2026-06-26  7:04               ` Qi Zheng [this message]
2026-06-26  7:09                 ` Harry Yoo
2026-06-26  9:39                 ` Johannes Weiner
2026-06-26 11:21                   ` Qi Zheng
2026-06-26 17:08                     ` Shakeel Butt
2026-06-25 20:22 ` Shakeel Butt
2026-06-26  2:39   ` Qi Zheng
2026-06-26  4:29 ` Peiyang He
2026-06-26  4:50   ` Qi Zheng
2026-06-26  7:15 ` Harry Yoo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c0e366ec-ee5d-42d9-ba33-7c630660e8af@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=peiyang_he@smail.nju.edu.cn \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.