Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Qi Zheng <qi.zheng@linux.dev>
To: Harry Yoo <harry@kernel.org>,
	akpm@linux-foundation.org, david@kernel.org, kasong@tencent.com,
	shakeel.butt@linux.dev, baohua@kernel.org,
	axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com,
	hannes@cmpxchg.org, muchun.song@linux.dev,
	peiyang_he@smail.nju.edu.cn, mhocko@kernel.org,
	roman.gushchin@linux.dev, ljs@kernel.org
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting
Date: Wed, 24 Jun 2026 15:11:56 +0800	[thread overview]
Message-ID: <1d638906-6d64-4e57-a181-4b77683652b5@linux.dev> (raw)
In-Reply-To: <dfe5d773-2992-448b-a6cb-ef633714a08f@kernel.org>

Hi Harry,

On 6/24/26 12:29 PM, Harry Yoo wrote:
> 
> 
> On 6/23/26 6:14 PM, Qi Zheng wrote:
>> Hi Harry,
>>
>> On 6/23/26 4:18 PM, Harry Yoo wrote:
>>> On 6/23/26 4:16 PM, Qi Zheng wrote:
>>>> Hi Harry,
>>>
>>> Hi Qi!
>>>
>>>> On 6/23/26 2:17 PM, Harry Yoo wrote:
>>>>> On 6/23/26 11:42 AM, Qi Zheng wrote:
>>>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>>>
>>>>>> The mglru page table walker batches per-generation size deltas in
>>>>>> walk->nr_pages while walking page tables without holding the lruvec
>>>>>> lock.
>>>>>> The reset_batch_size() later folds those deltas into walk->lruvec
>>>>>> under
>>>>>> the lruvec lock.
>>>>>
>>>>> Ouch.
>>>>>
>>>>> IIRC the user-visible impact of underestimated nr_pages in MGLRU
>>>>> was premature OOMs because MGLRU does not try to reclaim memory when
>>>>> nr_pages reaches zero, but there are still more pages.
>>>>>
>>>>> Perhaps worth mentioning in the changelog?
>>>>
>>>> Maybe this should be placed before "To fix it...".
>>>
>>> Thanks!
>>>
>>>>>> The page table walker can run concurrently with the memcg reparenting
>>>>>> path
>>>>>> as follows:
>>>>>>
>>>>>> CPU0                           CPU1
>>>>>> ====                           ====
>>>>>>
>>>>>> walk_mm
>>>>>> --> walk_page_range
>>>>>>        --> update_batch_size
>>>>>>            --> walk->nr_pages += delta
>>>>>>
>>>>>>                                  mem_cgroup_css_offline
>>>>>>                                  --> memcg_reparent_objcgs
>>>>>>                                      --> lock lruvec
>>>>>>                                          lru_gen_reparent_memcg
>>>>>>                                          --> reparent child folios to
>>>>>> parent
>>>>>>                                          unlock lruvec
>>>>>>
>>>>>>        lock lruvec
>>>>>>        reset_batch_size
>>>>>>        --> child lrugen->nr_pages += delta
>>>>>
>>>>> The problem here is that, while grabbing a reference to memcg
>>>>> (via mem_cgroup_iter(), for example) makes sure that the memcg is not
>>>>> freed, it does not prevent offlining happening, and reset_batch_size()
>>>>> doesn't check whether the lruvec has been reparented, or the lruvec
>>>>> is going to be reparented.
>>>>>
>>>>>> This will trigger the following warning in lru_gen_exit_memcg():
>>>>>>
>>>>>>       VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
>>>>>>                      sizeof(lruvec->lrugen.nr_pages)));
>>>>>>
>>>>>> To fix it, add lrugen->reparented to remember the new owner of a
>>>>>> reparented lruvec, and make reset_batch_size() charge pending
>>>>>> deltas to
>>>>>> that owner.
>>>>>
>>>>> Could you please explain why it is unavoidable to introduce the new
>>>>> field and why checking whether the cgroup is dying (and charging deltas
>>>>> to non-dying parent) doesn't work?
>>>>
>>>> Peiyang tried doing this [1], but it doesn't work because
>>>> ss->css_offline() is called before clearing the CSS_ONLINE flag.
>>>
>>> Right.
>>>
>>>> I also considered using mem_cgroup_tryget_online(), but that only
>>>> prevent
>>>> the memcg from being freed. It's doesn't prevent the offlining.
>>>
>>> Right.
>>>
>>> I think checking CSS_DYING under RCU and grabbing the lruvec
>>> of the first non-dying memcg should work (this pattern is already
>>> used where we use RCU to guarantee memcgs are not freed).
>>>
>>> If we do not observe CSS_DYING flag, it is safe to charge deltas
>>> to the lruvec because RCU guarantees that reparenting cannot happen
>>> under us.
>>>
>>> If we do observe CSS_DYING, we can walk up the hierarchy and charge
>>> deltas to the first non-dying memcg.
>>
>> Checking CSS_DYING looks feasible, but the rcu lock alone cannot prevent
>> reparenting. We should recheck CSS_DYING after acquiring the lruvec
>> lock, otherwise we might run into the following race:
> 
> Haha, actually, I was thinking of checking CSS_DYING under both RCU and
> lruvec lock. (because that's the pattern)
> 
>>    CPU0 reset_batch_size              CPU1 memcg teardown
>>    =====================              ==================
>>
>>    read !CSS_DYING
>>
>>                                       set CSS_DYING
> 
> Oh, I thought the entire critical section is covered by RCU.
> (I see lock_batch_lruvec() you suggested below doesn't do that)
> 
> Isn't RCU enough to prevent reparenting because RCU guarantees that
> all readers who read !CSS_DYING complete before reparenting?

Oh, I think you are right.

I forgot that offlining is executed in the rcu work context.

Let's walk through this again:

cgroup_destroy_locked
--> kill_css_sync
     --> css->flags |= CSS_DYING;                    1)
     kill_css_finish
     --> css_killed_ref_fn
         --> css_killed_work_fn  <-- RCU work !!     2)
             --> offline_css
                 --> reparent memcg

So while holding the rcu lock, if CSS_DYING is not observed,
css_killed_work_fn() will not be called until rcu_read_unlock().

So lock_batch_lruvec() can be implemented like this:

#ifdef CONFIG_MEMCG
static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
{
	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
	struct mem_cgroup *memcg = lruvec_memcg(lruvec);

	rcu_read_lock();

	/*
	 * The memcg can be NULL when the memory controller is disabled.
	 * Otherwise, the caller keeps the memcg owning @lruvec alive.
	 */
	if (!memcg || !css_is_dying(&memcg->css))
		goto lock;

	do {
		memcg = parent_mem_cgroup(memcg);
	} while (memcg && css_is_dying(&memcg->css));
	lruvec = mem_cgroup_lruvec(memcg, pgdat);

lock:
	spin_lock_irq(&lruvec->lru_lock);

	return lruvec;
}
#else
static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
{
	lruvec_lock_irq(lruvec);

	return lruvec;
}
#endif

Does this make sense?

Thanks,
Qi

> 
> Now I'm confused. Is it strictly required to check CSS_DYING under
> lruvec lock? CSS_DYING is updated outside the lruvec lock anyway?
> 
>>                                       memcg_reparent_objcgs()
>>                                       lock child lruvec
>>                                       move child to parent
>>                                       zero child nr_pages
>>                                       unlock child lruvec
>>
>>    lock child lruvec
>>    charge stale delta to child
>>
>> So it seems lock_batch_lruvec() should be implemented like this:
>>
>> static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec)
>> {
>>      struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>
>>      rcu_read_lock();
>> retry:
>>      while (memcg && css_is_dying(&memcg->css))
>>          memcg = parent_mem_cgroup(memcg);
> 
> Isn't this loop unnecessary as spin_lock_irq() -> check CSS_DYING ->
> goto retry does the same thing? (of course, we need to fetch the parent
> memcg before retry then...)
> 
>>      lruvec = mem_cgroup_lruvec(memcg, pgdat);
>>      spin_lock_irq(&lruvec->lru_lock);
>>      if (memcg && unlikely(css_is_dying(&memcg->css))) {
>>          spin_unlock_irq(&lruvec->lru_lock);
>>          goto retry;
>>      }
>>
>>      rcu_read_unlock();
>>
>>      return lruvec;
>> }
> 
> Thanks!
>

     prev parent reply	other threads:[~2026-06-24  7:12 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-23  2:42 [PATCH v2] mm: mglru: fix stale batch updates after memcg reparenting Qi Zheng
2026-06-23  2:56 ` Qi Zheng
2026-06-23  4:03 ` Baolin Wang
2026-06-23  6:17 ` Harry Yoo
2026-06-23  7:16   ` Qi Zheng
2026-06-23  8:18     ` Harry Yoo
2026-06-23  9:14       ` Qi Zheng
2026-06-24  4:29         ` Harry Yoo
2026-06-24  7:11           ` Qi Zheng [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1d638906-6d64-4e57-a181-4b77683652b5@linux.dev \
    --to=qi.zheng@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=harry@kernel.org \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=peiyang_he@smail.nju.edu.cn \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=stable@vger.kernel.org \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.