Re: [PATCH v2 04/12] mm/mglru: restructure the reclaim loop

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Kairui Song <kasong@tencent.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	 Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	 Johannes Weiner <hannes@cmpxchg.org>,
	David Hildenbrand <david@kernel.org>,
	 Michal Hocko <mhocko@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	 Shakeel Butt <shakeel.butt@linux.dev>,
	Lorenzo Stoakes <ljs@kernel.org>, Barry Song <baohua@kernel.org>,
	 David Stevens <stevensd@google.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	 Leno Hou <lenohou@gmail.com>, Yafang Shao <laoar.shao@gmail.com>,
	Yu Zhao <yuzhao@google.com>,
	 Zicheng Wang <wangzicheng@honor.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	 Suren Baghdasaryan <surenb@google.com>,
	Chris Li <chrisl@kernel.org>, Vernon Yang <vernon2gm@gmail.com>,
	 linux-kernel@vger.kernel.org, Qi Zheng <qi.zheng@linux.dev>,
	 Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: Re: [PATCH v2 04/12] mm/mglru: restructure the reclaim loop
Date: Sun, 29 Mar 2026 14:47:50 +0800	[thread overview]
Message-ID: <aciejkdIHyXPNS9Y@KASONG-MC4> (raw)
In-Reply-To: <20260329-mglru-reclaim-v2-4-b53a3678513c@tencent.com>

So sashiko provided some review, I'll post it below:

https://sashiko.dev/#/patchset/20260329-mglru-reclaim-v2-0-b53a3678513c%40tencent.com?patch=12986

> From sashiko:
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4913,49 +4913,40 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  }
> >  
> >  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> > -			     int swappiness, unsigned long *nr_to_scan)
> > +			     struct scan_control *sc, int swappiness)
> >  {
> >  	DEFINE_MIN_SEQ(lruvec);
> >  
> > -	*nr_to_scan = 0;
> >  	/* have to run aging, since eviction is not possible anymore */
> >  	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
> >  		return true;
> >  
> > -	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> > +	/* try to get away with not aging at the default priority */
> > +	if (sc->priority == DEF_PRIORITY)
> > +		return false;
> 
> Is this a regression that unconditionally forces aging even at DEF_PRIORITY?
> In the original code, if eviction wasn't possible, it set need_aging to
> true, but the subsequent check against DEF_PRIORITY could still bypass aging
> and allow MGLRU to skip memcgs without cold pages.
> With the new logic, the mandatory aging check precedes the DEF_PRIORITY
> check, returning true immediately. Will this cause try_to_shrink_lruvec()
> to inappropriately force hot pages into older generations on the gentlest
> reclaim pass instead of trying other memcgs first?

Right, this part is a bit confusing indeed, I tried to split it
into two patches but it seem could be more messy that way.

This is more of an intentional change. When eviction is not possible
(fewer than MIN_NR_GENS generations), aging is the only way to make
progress. The old code always skip aging at DEF_PRIORITY.

For single cgroup setup, this is not helpful in anyway, it waste a
reclaim iteration only to come back at a higher priority and age
then. And it causes over reclaim of slab.

For multi cgroup setup, it may seems reasonable that defer the aging
and reclaim other cgroup a good idea, but we will have to proceed
with aging anyway, potentially when all cgroups are run out of
gens, causing a larger jitter, and break reclaim balance.
And the escalated priority also may cause over reclaim.

And right now we still have 3 gens as the buffer.

Testing the multi cgroup setup using JaveScript workload in the
cover letter showed that after this series, multi cgroup reclaim
balance and latency are good, the total throughput seems better.

Testing the reclaim balance of cgroup tree (see below) also
seems better.

I think to avoid the latency of aging, we really need async
aging, not deferred aging. Async aging might be easier to be
integrated here now after this series.

Could add some word on that in the commit message.

> From sashiko:
>
> > -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> > +			   struct mem_cgroup *memcg, int swappiness)
> >  {
> > -	bool need_aging;
> >  	unsigned long nr_to_scan;
> > -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > -	DEFINE_MAX_SEQ(lruvec);
> > -
> > -	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> > -		return -1;
> > -
> > -	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> >  
> > +	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> >  	/* try to scrape all its memory if this memcg was deleted */
> > -	if (nr_to_scan && !mem_cgroup_online(memcg))
> > +	if (!mem_cgroup_online(memcg))
> >  		return nr_to_scan;
>
> Will this cause a regression resulting in a permanent memory leak for
> offline memcgs that have zero evictable folios?
> If an offline memcg has all its folios in the unevictable youngest
> generation, lruvec_evictable_size() returns 0. The early return here
> then evaluates to 0.
> When try_to_shrink_lruvec() receives 0, its while (nr_to_scan > 0) loop
> is completely bypassed. Since the aging logic was moved inside that loop,
> the offline memcg is never aged, meaning its folios will remain in the
> unevictable youngest generation indefinitely.

Seems hallucination. lruvec_evictable_size returns evictable
folios across gen, not excluding the youngest one since aging
may make then evictable. Perhaps we can add some comment for this,
if it confuses AI, it may confuse human too.

And aging is not out of the loop, it's just moved to a different
place helper.

> From sashiko:
> >
> >  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> >  
> > -	/* try to get away with not aging at the default priority */
> > -	if (!need_aging || sc->priority == DEF_PRIORITY)
> > -		return nr_to_scan >> sc->priority;
> > -
> > -	/* stop scanning this lruvec as it's low on cold folios */
> > -	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> > +	/*
> > +	 * Always respect scan priority, minimally target
> > +	 * SWAP_CLUSTER_MAX pages to keep reclaim moving forwards.
> > +	 */
> > +	nr_to_scan >>= sc->priority;
> > +	return max(nr_to_scan, SWAP_CLUSTER_MAX);
> >  }
>
> Does this regression break proportional reclaim and memory protection
> semantics for small memcgs by forcing a minimum scan size?

Actually this change was inspired by sashiko's review for V1:

https://sashiko.dev/#/patchset/20260318-mglru-reclaim-v1-0-2c46f9eb0508%40tencent.com?patch=2909

Without this, for cgroups smaller than 16M, def priority scan will just
do nothing. It's still OK but to make it more efficient I added a
minimal batch. But now think again, it better be:

if (!nr_to_scan)
	nr_to_scan = min(lruvec_evictable_size, SWAP_CLUSTER_MAX);

Using max() here could get very small cgroups over reclaimed.
I did test V2 using test_memcg_min suggested by af827e090489:

Before:
Proportional reclaim results:
  c[0]     actual=  29069312 (27M)  ideal=  30408704 (29M)  err=4.4%
  c[1]     actual=  23257088 (22M)  ideal=  22020096 (21M)  err=5.6%
  c[2]     actual=   1552384 (1M)  (expected ~0)
  c[3]     actual=         0 (0M)  (expected =0)

After:
Proportional reclaim results:
  c[0]     actual=  31391744 (29M)  ideal=  30408704 (29M)  err=3.2%
  c[1]     actual=  21028864 (20M)  ideal=  22020096 (21M)  err=4.5%
  c[2]     actual=   1515520 (1M)  (expected ~0)
  c[3]     actual=         0 (0M)  (expected =0)

In both case the result is somehow not very stable, I run the test
7 times using the medium stable result, after this series it seems
sometimes the result is even better but likely just noisy. And
didn't see a regression.

The 32 folios minimal batch seems already small enough for
typical usage, but min(evictable_size, SWAP_CLUSTER_MAX) is definitely
better. Will send a V3 to update this.

I think non of the benchmark or test would be effected by this.

next prev parent reply	other threads:[~2026-03-29  6:48 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 01/12] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
2026-03-30  1:57   ` Chen Ridong
2026-03-30  7:59   ` Baolin Wang
2026-03-28 19:52 ` [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-03-30  8:14   ` Baolin Wang
2026-03-28 19:52 ` [PATCH v2 04/12] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-03-29  6:47   ` Kairui Song [this message]
2026-03-28 19:52 ` [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 07/12] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-03-29  8:21   ` Kairui Song
2026-03-29  8:46     ` Kairui Song
2026-03-28 19:52 ` [PATCH v2 09/12] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 10/12] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aciejkdIHyXPNS9Y@KASONG-MC4 \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=chenridong@huaweicloud.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=laoar.shao@gmail.com \
    --cc=lenohou@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=qi.zheng@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=vernon2gm@gmail.com \
    --cc=wangzicheng@honor.com \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox