From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-174.mta0.migadu.com (out-174.mta0.migadu.com [91.218.175.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C0CEB23E320 for ; Wed, 1 Jul 2026 14:57:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.174 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782917869; cv=none; b=G1q/j7lTGkD1tCsXluXqzyTgPUUncvHLwayPnVEiQoh4fcWq6CQzKWpuFb3tVBwdfwhifRJu7lMLBwf61HJr/z8TWyOFmORpCDE1I3ajWvBFBgSnXExLDQmfqzK6DnSBVgwWTNqI7h+20KWfkg4/X463DWXMr91jA8wPPCQZimQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1782917869; c=relaxed/simple; bh=vGLL9VM7Nd6jgMhorxhyV71SGtCxHnTJ0VvqtfPeljM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=FPMxnqJqevfiUtL/lJJQ8J7Is9a/jl7/TVpHU5zqC1cZFEVQlo042aVFWedl1EwqTKzPYTgfJcxHaOAYl1CN3KxTPHyZsjYrxZKq44CE8AiR5+Uuh6W1Bxjnnvp/jRZLP9jfAzYXveNBggPF2EwtO1ioiGmgYCh418pZIXtQQgI= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=u/xNL6ff; arc=none smtp.client-ip=91.218.175.174 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="u/xNL6ff" X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1782917861; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IZxZYzr7s/jUyGjUJHT8GX3O2MqALroysxUHX5OVR6I=; b=u/xNL6ffTYYFG6/pQpvgb7+IDpimCtA2nrVpoyuJXSR+7kaiiIrfULdup+UpIRI/qtrg7V GgzR0/hCpJhwoWWcGngrI4zLqG283UAA2LBp8soStkA1DrGZXKPunOUidLsR+PQee3cP6B RhE8vD7Y1go18ShEHgOu6+YT2Wtt24c= From: Usama Arif To: Qi Zheng Cc: Usama Arif , akpm@linux-foundation.org, david@kernel.org, kasong@tencent.com, shakeel.butt@linux.dev, baohua@kernel.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, hannes@cmpxchg.org, harry@kernel.org, muchun.song@linux.dev, peiyang_he@smail.nju.edu.cn, mhocko@kernel.org, roman.gushchin@linux.dev, ljs@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Qi Zheng , stable@vger.kernel.org Subject: Re: [PATCH v4] mm: mglru: fix stale batch updates after memcg reparenting Date: Wed, 1 Jul 2026 07:57:35 -0700 Message-ID: <20260701145736.3785016-1-usama.arif@linux.dev> In-Reply-To: <20260701075251.56413-1-qi.zheng@linux.dev> References: Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Migadu-Flow: FLOW_OUT On Wed, 1 Jul 2026 15:52:51 +0800 Qi Zheng wrote: > From: Qi Zheng > > The mglru page table walker batches per-generation size deltas in > walk->nr_pages while walking page tables without holding the lruvec lock. > The reset_batch_size() later folds those deltas into walk->lruvec under > the lruvec lock. > > The page table walker can run concurrently with the memcg reparenting path > as follows: > > CPU0 CPU1 > ==== ==== > > walk_mm > --> walk_page_range > --> update_batch_size > --> walk->nr_pages += delta > > mem_cgroup_css_offline > --> memcg_reparent_objcgs > --> lock lruvec > lru_gen_reparent_memcg > --> reparent child folios to parent > unlock lruvec > > lock lruvec > reset_batch_size > --> child lrugen->nr_pages += delta > > This will trigger the following warning in lru_gen_exit_memcg(): > > VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0, > sizeof(lruvec->lrugen.nr_pages))); > > And the user-visible impact of underestimated nr_pages in MGLRU was > premature OOMs because MGLRU does not try to reclaim memory when nr_pages > reaches zero, but there are still more pages. > > To fix it, make reset_batch_size() check CSS_DYING under RCU before > flushing the pending batch. A non-dying memcg keeps the original lruvec > stable against RCU-delayed offlining; a dying memcg redirects the deltas > to the first non-dying ancestor. > > Reported-by: Peiyang He > Closes: https://lore.kernel.org/all/5A9E929D82717101+12fcf643-efb8-4b9a-a53a-1e28cc894f0b@smail.nju.edu.cn > Fixes: f304652609ea ("mm: vmscan: prepare for reparenting MGLRU folios") > Cc: > Signed-off-by: Qi Zheng > Reviewed-by: Harry Yoo (Oracle) > --- > Changes in v4: > - re-implement lock_batch_lruvec() in a simpler way > (suggested by Johannes and Harry) > - collect Reviewed-by > - rebase onto the next-20260630 > > Changes in v3: > - re-implement lock_batch_lruvec() by checking CSS_DYING under the RCU lock > (suggested by Harry) > - update the commit message (suggested by Harry) > - temporarily drop the previous Reviewed-by tags > (since the sync method has changed) > - rebase onto the next-20260624 > > Changes in v2: > - update the commit message (pointed by Barry) > - collect Reviewed-by > > mm/vmscan.c | 41 ++++++++++++++++++++++++++++++++++------- > 1 file changed, 34 insertions(+), 7 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 35c3bb15ae96..ca1e2a870d51 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -3262,10 +3262,40 @@ static void update_batch_size(struct lru_gen_mm_walk *walk, struct folio *folio, > walk->nr_pages[new_gen][type][zone] += delta; > } > > +#ifdef CONFIG_MEMCG > +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec) > +{ > + struct pglist_data *pgdat = lruvec_pgdat(lruvec); > + struct mem_cgroup *memcg = lruvec_memcg(lruvec); > + > + rcu_read_lock(); > + > + /* > + * The memcg can be NULL when the memory controller is disabled. > + * Otherwise, the caller keeps the memcg owning @lruvec alive. > + */ > + while (unlikely(memcg && css_is_dying(&memcg->css))) { > + memcg = parent_mem_cgroup(memcg); > + lruvec = mem_cgroup_lruvec(memcg, pgdat); > + } > + > + spin_lock_irq(&lruvec->lru_lock); Do we need an rcu_read_unlock() here? > + > + return lruvec; > +} > +#else > +static struct lruvec *lock_batch_lruvec(struct lruvec *lruvec) > +{ > + lruvec_lock_irq(lruvec); > + > + return lruvec; > +} > +#endif > + > static void reset_batch_size(struct lru_gen_mm_walk *walk) > { > int gen, type, zone; > - struct lruvec *lruvec = walk->lruvec; > + struct lruvec *lruvec = lock_batch_lruvec(walk->lruvec); > struct lru_gen_folio *lrugen = &lruvec->lrugen; > > walk->batched = 0; > @@ -3285,6 +3315,8 @@ static void reset_batch_size(struct lru_gen_mm_walk *walk) > lru += LRU_ACTIVE; > __update_lru_size(lruvec, lru, zone, delta); > } > + > + lruvec_unlock_irq(lruvec); > } > > static int should_skip_vma(unsigned long start, unsigned long end, struct mm_walk *args) > @@ -3779,11 +3811,8 @@ static void walk_mm(struct mm_struct *mm, struct lru_gen_mm_walk *walk) > mmap_read_unlock(mm); > } > > - if (walk->batched) { > - lruvec_lock_irq(lruvec); > + if (walk->batched) > reset_batch_size(walk); > - lruvec_unlock_irq(lruvec); > - } > > cond_resched(); > } while (err == -EAGAIN); > @@ -4867,9 +4896,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec, > walk = current->reclaim_state->mm_walk; > if (walk && walk->batched) { > walk->lruvec = lruvec; > - lruvec_lock_irq(lruvec); > reset_batch_size(walk); > - lruvec_unlock_irq(lruvec); > } > > mod_lruvec_state(lruvec, PGDEMOTE_KSWAPD + reclaimer_offset(sc), > -- > 2.54.0 > >