Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Qi Zheng <qi.zheng@linux.dev>
To: Harry Yoo <harry@kernel.org>,
	akpm@linux-foundation.org, david@kernel.org, kasong@tencent.com,
	shakeel.butt@linux.dev, baohua@kernel.org,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	hannes@cmpxchg.org, muchun.song@linux.dev,
	peiyang_he@smail.nju.edu.cn, mhocko@kernel.org,
	roman.gushchin@linux.dev, ljs@kernel.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
Date: Tue, 23 Jun 2026 17:14:17 +0800	[thread overview]
Message-ID: <7946da94-dc1d-4cf2-986e-466c378665b6@linux.dev> (raw)
In-Reply-To: <8a76aefd-629c-41f3-b365-aefd4cc1411e@kernel.org>

Hi Harry,

On 6/23/26 4:18 PM, Harry Yoo wrote:
> 
> 
> On 6/23/26 4:16 PM, Qi Zheng wrote:
>> Hi Harry,
> 
> Hi Qi!
> 
>> On 6/23/26 2:17 PM, Harry Yoo wrote:
>>> On 6/23/26 11:42 AM, Qi Zheng wrote:
>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>
>>>> The mglru page table walker batches per-generation size deltas in
>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>> lock.
>>>> The reset_batch_size() later folds those deltas into walk->lruvec under
>>>> the lruvec lock.
>>>
>>> Ouch.
>>>
>>> IIRC the user-visible impact of underestimated nr_pages in MGLRU
>>> was premature OOMs because MGLRU does not try to reclaim memory when
>>> nr_pages reaches zero, but there are still more pages.
>>>
>>> Perhaps worth mentioning in the changelog?
>>
>> Maybe this should be placed before "To fix it...".
> 
> Thanks!
> 
>>>> The page table walker can run concurrently with the memcg reparenting
>>>> path
>>>> as follows:
>>>>
>>>> CPU0                           CPU1
>>>> ====                           ====
>>>>
>>>> walk_mm
>>>> --> walk_page_range
>>>>       --> update_batch_size
>>>>           --> walk->nr_pages += delta
>>>>
>>>>                                 mem_cgroup_css_offline
>>>>                                 --> memcg_reparent_objcgs
>>>>                                     --> lock lruvec
>>>>                                         lru_gen_reparent_memcg
>>>>                                         --> reparent child folios to
>>>> parent
>>>>                                         unlock lruvec
>>>>
>>>>       lock lruvec
>>>>       reset_batch_size
>>>>       --> child lrugen->nr_pages += delta
>>>
>>> The problem here is that, while grabbing a reference to memcg
>>> (via mem_cgroup_iter(), for example) makes sure that the memcg is not
>>> freed, it does not prevent offlining happening, and reset_batch_size()
>>> doesn't check whether the lruvec has been reparented, or the lruvec
>>> is going to be reparented.
>>>
>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>
>>>>      VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>                     sizeof(lruvec->lrugen.nr_pages)));
>>>>
>>>> To fix it, add lrugen->reparented to remember the new owner of a
>>>> reparented lruvec, and make reset_batch_size() charge pending deltas to
>>>> that owner.
>>>
>>> Could you please explain why it is unavoidable to introduce the new
>>> field and why checking whether the cgroup is dying (and charging deltas
>>> to non-dying parent) doesn't work?
>>
>> Peiyang tried doing this [1], but it doesn't work because
>> ss->css_offline() is called before clearing the CSS_ONLINE flag.
> 
> Right.
> 
>> I also considered using mem_cgroup_tryget_online(), but that only prevent
>> the memcg from being freed. It's doesn't prevent the offlining.
> 
> Right.
> 
> I think checking CSS_DYING under RCU and grabbing the lruvec
> of the first non-dying memcg should work (this pattern is already
> used where we use RCU to guarantee memcgs are not freed).
> 
> If we do not observe CSS_DYING flag, it is safe to charge deltas
> to the lruvec because RCU guarantees that reparenting cannot happen
> under us.
> 
> If we do observe CSS_DYING, we can walk up the hierarchy and charge
> deltas to the first non-dying memcg.

Checking CSS_DYING looks feasible, but the rcu lock alone cannot prevent
reparenting. We should recheck CSS_DYING after acquiring the lruvec
lock, otherwise we might run into the following race:

   CPU0 reset_batch_size              CPU1 memcg teardown
   =====================              ==================

   read !CSS_DYING

                                      set CSS_DYING
                                      memcg_reparent_objcgs()
                                      lock child lruvec
                                      move child to parent
                                      zero child nr_pages
                                      unlock child lruvec

   lock child lruvec
   charge stale delta to child

So it seems lock_batch_lruvec() should be implemented like this:

static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
{
	struct mem_cgroup *memcg = lruvec_memcg(lruvec);

	rcu_read_lock();
retry:
	while (memcg && css_is_dying(&memcg->css))
		memcg = parent_mem_cgroup(memcg);

	lruvec = mem_cgroup_lruvec(memcg, pgdat);
	spin_lock_irq(&lruvec->lru_lock);
	if (memcg && unlikely(css_is_dying(&memcg->css))) {
		spin_unlock_irq(&lruvec->lru_lock);
		goto retry;
	}

	rcu_read_unlock();

	return lruvec;
}

This way, there is no need to add lrugen->reparented, right?

Thanks,
Qi

     prev parent reply	other threads:[~2026-06-23  9:15 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23  2:42 [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-23  2:56 ` Qi Zheng
2026-06-23  4:03 ` Baolin Wang
2026-06-23  6:17 ` Harry Yoo
2026-06-23  7:16   ` Qi Zheng
2026-06-23  8:18     ` Harry Yoo
2026-06-23  9:14       ` Qi Zheng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7946da94-dc1d-4cf2-986e-466c378665b6@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=peiyang_he@smail.nju.edu.cn \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.