[PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
@ 2026-04-23 17:43 Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
                   ` (16 more replies)
  0 siblings, 17 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Before:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
pgpgout 5413332
workingset_refault_anon 0
workingset_refault_file 34522071

After:
Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
pgpgin 111093923                       (-30.3%, lower is better)
pgpgout 5437456
workingset_refault_anon 0
workingset_refault_file 19566366       (-43.3%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            79915
Per-worker 95% CI (mean):  [1233.9, 1263.5]
Per-worker stdev:          59.2
Jain's fairness:           0.997795 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26859   33.61%   33.61%
[1,2)s      7818    9.78%   43.39%
[2,4)s      5532    6.92%   50.31%
[4,8)s     39706   49.69%  100.00%

After:
Total requests:            81382
Per-worker 95% CI (mean):  [1241.9, 1301.3]
Per-worker stdev:          118.8
Jain's fairness:           0.991480 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26696   32.80%   32.80%
[1,2)s      8745   10.75%   43.55%
[2,4)s      6865    8.44%   51.98%
[4,8)s     39076   48.02%  100.00%

Reclaim is still fair and effective, total requests number seems
slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated in patch 14, and fixed by this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]:
Spawns multiple workers that keep reading the given file using mmap,
and pauses for 120ms after one file read batch. It also spawns another
set of workers that keep allocating and freeing a given size of
anonymous memory. The total memory size exceeds the memory limit
(eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17260.781429 tps
After this series: 17266.842857 tps

MySQL is anon folios heavy, involves writeback and file and still
looking good. Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Before:            9196.481429 MB/s
After this series: 9256.105000 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2589.63s
After this series: 2543.58s

Also seem only noise level changes, no regression or very slightly better.

Android:
========
Xinyu reported a performance gain on Android, too, with this series. The
test consisted of cold-starting multiple applications sequentially under
moderate system load. [6]

Before:
Launch Time Summary (all apps, all runs)
  Mean 868.0ms
  P50 888.0ms
  P90 1274.2ms
  P95 1399.0ms

After:
Launch Time Summary (all apps, all runs)
  Mean 850.5ms (-2.07%)
  P50 861.5ms  (-3.04%)
  P90 1179.0ms (-8.05%)
  P95 1228.0ms (-12.2%)

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v6:
- Avoid potential over rotation of tiny cgroup (<16M):
  https://lore.kernel.org/linux-mm/CAMgjq7ArnmmoHOGRt6Wc8hu7tjx_t583-UVzJK+HOHgjjetQ9g@mail.gmail.com/
- Avoid potentially skewed stat counter:
  https://lore.kernel.org/linux-mm/CAMgjq7DCn8p_yMMhiejFjX6sdybZKYOw8qJbq=+OCsZ=AfJnFA@mail.gmail.com/
- Update a few comment and varible name as suggested by Barry Song.
- Tested over days, also tested on my Android phone, everything still
  matches the cover letter description. And add test result from Xinyu.
- Link to v5: https://patch.msgid.link/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com

Changes in v5:
- Add back a more moderate minimal batch limit for each reclaim loop:
  https://lore.kernel.org/linux-mm/adYP81AhpNf0znp3@KASONG-MC4/
- Collect review-by.
- Link to v4: https://patch.msgid.link/20260407-mglru-reclaim-v4-0-98cf3dc69519@tencent.com

Changes in v4:
- Remove the minimal scan batch limit, and add rotate for
  unevictable memcg as reported by sashiko:
  https://lore.kernel.org/linux-mm/ac8xVN82LBLDZpIO@KASONG-MC4/
- Slightly imporove a few commit messages.
- Reran the test and seems identical with before so data are unchanged.
- Collect review-by.
- Link to v3: https://patch.msgid.link/20260403-mglru-reclaim-v3-0-a285efd6ff91@tencent.com

Changes in v3:
- Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
  loop. If the LRU is too small, adjust it accordingly. Now the
  multi-cgroup scan balance looked even better for tiny cgroups:
  https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
- Add one patch to remove the swap constraint check in isolate_folio. In
  theory, it's fine, and both stress test and performance test didn't
  show any issue:
  https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
- I reran most tests, all seem identical, so most data is kept.
  Intermediate test results are dropped. I ran tests on most patches
  individually, and there is no problem, but the series is getting too
  long, and posting them makes it harder to read and unnecessary.
- Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
  also some test result is collected to support the design:
  https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
  I kept Axel's review-by since the code is identical.
- Call try_to_inc_min_seq twice to avoid stale empty gen and drop
  its return argument [ Baolin Wang ]
- Move a few lines of code between patches to where they fits better,
  the final result is identical [ Baolin Wang ].
- Collect tested by and update test setup [ Leno Hou ]
- Collect review by.
- Update a few commit message [ Shakeel Butt ].
- Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com

Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Kairui Song (14):
      mm/mglru: consolidate common code for retrieving evictable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: remove redundant swap constrained check upon isolation
      mm/mglru: use the common routine for dirty/writeback reactivation
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 330 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 143 insertions(+), 187 deletions(-)
---
base-commit: 2bcc13c29c711381d815c1ba5d5b25737400c71a
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>




^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v6 01/14] mm/mglru: consolidate common code for retrieving evictable size
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 02/14] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Merge commonly used code for counting evictable folios in a lruvec.

No behavior change.

Acked-by: Yuanchu Xie <yuanchu@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd1b1aa12581..6fa828c7c19d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4084,27 +4084,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4909,9 +4915,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
 	*nr_to_scan = 0;
@@ -4919,18 +4922,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
-
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 02/14] mm/mglru: rename variables related to aging and rotation
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 03/14] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current variable name isn't helpful. Make the variable names more
meaningful.

Only naming change, no behavior change.

Suggested-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6fa828c7c19d..4623a5ac6bc7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4934,7 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  */
 static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
-	bool success;
+	bool need_aging;
 	unsigned long nr_to_scan;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
@@ -4942,7 +4942,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
 		return -1;
 
-	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (nr_to_scan && !mem_cgroup_online(memcg))
@@ -4951,7 +4951,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
 	/* try to get away with not aging at the default priority */
-	if (!success || sc->priority == DEF_PRIORITY)
+	if (!need_aging || sc->priority == DEF_PRIORITY)
 		return nr_to_scan >> sc->priority;
 
 	/* stop scanning this lruvec as it's low on cold folios */
@@ -5040,7 +5040,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
+	bool need_rotate;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5058,7 +5058,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		memcg_memory_event(memcg, MEMCG_LOW);
 	}
 
-	success = try_to_shrink_lruvec(lruvec, sc);
+	need_rotate = try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -5068,10 +5068,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	flush_reclaim_state(sc);
 
-	if (success && mem_cgroup_online(memcg))
+	if (need_rotate && mem_cgroup_online(memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (!success && lruvec_is_sizable(lruvec, sc))
+	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
 		return 0;
 
 	/* one retry if offlined or too small */

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 03/14] mm/mglru: relocate the LRU scan batch limit to callers
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 02/14] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Same as active / inactive LRU, MGLRU isolates and scans folios in
batches.  The batch split is done hidden deep in the helper, which
makes the code harder to follow.  The helper's arguments are also
confusing since callers usually request more folios than the batch
size, so the helper almost never processes the full requested amount.

Move the batch splitting into the top loop to make it cleaner, there
should be no behavior change.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4623a5ac6bc7..3c5a6ae92440 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4695,10 +4695,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int scanned = 0;
 	int isolated = 0;
 	int skipped = 0;
-	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
-	int remaining = scan_batch;
+	unsigned long remaining = nr_to_scan;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
+	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
 	VM_WARN_ON_ONCE(!list_empty(list));
 
 	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
@@ -4751,7 +4751,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	mod_lruvec_state(lruvec, item, isolated);
 	mod_lruvec_state(lruvec, PGREFILL, sorted);
 	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
-	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
@@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	long nr_to_scan;
+	long nr_batch, nr_to_scan;
 	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
 
@@ -4998,7 +4998,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (nr_to_scan <= 0)
 			break;
 
-		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
+		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
@@ -5623,6 +5624,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
 			int swappiness, unsigned long nr_to_reclaim)
 {
+	int nr_batch;
 	DEFINE_MAX_SEQ(lruvec);
 
 	if (seq + MIN_NR_GENS > max_seq)
@@ -5639,8 +5641,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			return 0;
 
-		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
-				  swappiness))
+		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
+		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
 			return 0;
 
 		cond_resched();

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 04/14] mm/mglru: restructure the reclaim loop
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 03/14] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-24 17:04   ` Kairui Song
  2026-04-23 17:43 ` [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current loop will calculate the scan number on each iteration. The
number of folios to scan is based on the LRU length, with some unclear
behaviors, eg, the scan number is only shifted by reclaim priority when
aging is not needed or when at the default priority, and it couples
the number calculation with aging and rotation.

Adjust, simplify it, and decouple aging and rotation. Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how aging and offline memcg reclaim works:
Previously, aging was skipped at DEF_PRIORITY even when eviction was
no longer possible, so the reclaimer wasted an iteration until the
priority escalated.  Now aging runs immediately whenever it is needed
to make progress; the DEF_PRIORITY skip only applies when eviction is
still viable. This may avoid wasted iterations that over-reclaim slab
and break reclaim balance in multi-cgroup setups.

Similar for offline memcg. Previously, offline memcg wouldn't be
aged unless it didn't have any evictable folios. Now, we might age
it if it has only 3 generations and the reclaim priority is less
than DEF_PRIORITY, which should be fine. On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem. These folios
might be used by other memcg, so aging them as ordinary memcg seems
correct. Besides, aging enables further reclaim of an offlined memcg,
which will certainly happen if we keep shrinking it. And offline
memcg might soon be no longer an issue with reparenting.

Overall, the memcg LRU rotation, as described in mmzone.h,
remains the same.

Also apply a minimal batch factor when reclaim is running with higher
priority so small memcg won't be over protected.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 70 ++++++++++++++++++++++++++++++++-----------------------------
 1 file changed, 37 insertions(+), 33 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3c5a6ae92440..757beb605980 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4913,49 +4913,41 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     int swappiness, unsigned long *nr_to_scan)
+			     struct scan_control *sc, int swappiness)
 {
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
+	/* try to avoid aging, do gentle reclaim at the default priority */
+	if (sc->priority == DEF_PRIORITY)
+		return false;
+
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
-/*
- * For future optimizations:
- * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
- *    reclaim.
- */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
 {
-	bool need_aging;
-	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
+	unsigned long nr_to_scan, evictable;
 
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	evictable = lruvec_evictable_size(lruvec, swappiness);
+	nr_to_scan = evictable;
 
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
+	if (!mem_cgroup_online(memcg))
 		return nr_to_scan;
 
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
+	nr_to_scan >>= sc->priority;
 
-	/* try to get away with not aging at the default priority */
-	if (!need_aging || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
+	if (!nr_to_scan && sc->priority < DEF_PRIORITY)
+		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
 
-	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	return nr_to_scan;
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4985,31 +4977,44 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
+	bool need_rotate = false;
 	long nr_batch, nr_to_scan;
-	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
+
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			/* stop scanning as it's low on cold folios */
+			break;
+		}
 
 		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
-			break;
-
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		nr_to_scan -= delta;
 		cond_resched();
 	}
 
@@ -5035,8 +5040,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Make the scan helpers return the exact number of folios being scanned
or isolated. Since the reclaim loop now has a natural scan budget that
controls the scan progress, returning the scan number and consuming
the budget makes the scan more accurate and easier to follow.

The number of scanned folios for each iteration is always larger than
0, unless the reclaim must stop for a forced aging, so there is no
more need for any special handling when there is no progress made:

- `return isolated || !remaining ? scanned : 0` in scan_folios: both
  the function and the call now just return the exact scan count,
  combined with the scan budget introduced in the previous commit to
  avoid livelock or under scan.

- `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
  scan count was kind of confusing and no longer needed to, as scan
  number should never be zero as long as there are still evictable
  gens. We may encounter a empty old gen that return 0 scan count,
  to avoid that, do a try_to_inc_min_seq before isolation which
  have slight to none overhead in most cases.

- `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
  the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
  naturally returns 0 when only two gens remain and breaks the loop.

Also change try_to_inc_min_seq to return void, as its return value is
no longer used by any caller. Call it before isolate_folios to flush
any empty gens left by external folio freeing, and again after
isolate_folios when scanning moved or protected folios may have
emptied the oldest gen.

The scan still stops if only two gens are left, as the scan number
will be zero. This matches the previous behavior. This forced gen
protection may be removed or softened later to improve reclaim
further.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 58 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 757beb605980..f021dd1b84f8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3878,10 +3878,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 	return true;
 }
 
-static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
+static void try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	bool success = false;
 	bool seq_inc_flag = false;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
@@ -3907,11 +3906,10 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 	/*
 	 * If min_seq[type] of both anonymous and file is not increased,
-	 * we can directly return false to avoid unnecessary checking
-	 * overhead later.
+	 * return here to avoid unnecessary checking overhead later.
 	 */
 	if (!seq_inc_flag)
-		return success;
+		return;
 
 	/* see the comment on lru_gen_folio */
 	if (swappiness && swappiness <= MAX_SWAPPINESS) {
@@ -3929,10 +3927,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 		reset_ctrl_pos(lruvec, type, true);
 		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
-		success = true;
 	}
-
-	return success;
 }
 
 static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
@@ -4686,7 +4681,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
 static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		       struct scan_control *sc, int type, int tier,
-		       struct list_head *list)
+		       struct list_head *list, int *isolatedp)
 {
 	int i;
 	int gen;
@@ -4756,11 +4751,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
 		sc->nr.file_taken += isolated;
-	/*
-	 * There might not be eligible folios due to reclaim_idx. Check the
-	 * remaining to prevent livelock if it's not making progress.
-	 */
-	return isolated || !remaining ? scanned : 0;
+
+	*isolatedp = isolated;
+	return scanned;
 }
 
 static int get_tier_idx(struct lruvec *lruvec, int type)
@@ -4804,33 +4797,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 
 static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			  struct scan_control *sc, int swappiness,
-			  int *type_scanned, struct list_head *list)
+			  struct list_head *list, int *isolated,
+			  int *isolate_type, int *isolate_scanned)
 {
 	int i;
+	int total_scanned = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
 		int scanned;
 		int tier = get_tier_idx(lruvec, type);
 
-		*type_scanned = type;
+		scanned = scan_folios(nr_to_scan, lruvec, sc,
+				      type, tier, list, isolated);
 
-		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
-		if (scanned)
-			return scanned;
+		total_scanned += scanned;
+		if (*isolated) {
+			*isolate_type = type;
+			*isolate_scanned = scanned;
+			break;
+		}
 
 		type = !type;
 	}
 
-	return 0;
+	return total_scanned;
 }
 
 static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			struct scan_control *sc, int swappiness)
 {
-	int type;
-	int scanned;
-	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
@@ -4838,19 +4834,23 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	enum node_stat_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
+	int scanned, reclaimed;
+	int isolated = 0, type, type_scanned;
 	bool skip_retry = false;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lruvec_lock_irq(lruvec);
 
-	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
+	/* In case folio deletion left empty old gens, flush them */
+	try_to_inc_min_seq(lruvec, swappiness);
 
-	scanned += try_to_inc_min_seq(lruvec, swappiness);
+	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
+				 &list, &isolated, &type, &type_scanned);
 
-	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
-		scanned = 0;
+	/* Scanning may have emptied the oldest gen, flush it */
+	if (scanned)
+		try_to_inc_min_seq(lruvec, swappiness);
 
 	lruvec_unlock_irq(lruvec);
 
@@ -4861,7 +4861,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			scanned, reclaimed, &stat, sc->priority,
+			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 06/14] mm/mglru: use a smaller batch for reclaim
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 07/14] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f021dd1b84f8..f6ee7ccf4e81 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5006,7 +5006,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			break;
 		}
 
-		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 07/14] mm/mglru: don't abort scan immediately right after aging
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 08/14] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Right now, if eviction triggers aging, the reclaimer will abort. This is
not the optimal strategy for several reasons.

Aborting the reclaim early wastes a reclaim cycle when under pressure,
and for concurrent reclaim, if the LRU is under aging, all concurrent
reclaimers might fail. And if the age has just finished, new cold folios
exposed by the aging are not reclaimed until the next reclaim iteration.

What's more, the current aging trigger is quite lenient, having 3 gens
with a reclaim priority lower than default will trigger aging, and
blocks reclaiming from one memcg. This wastes reclaim retry cycles
easily. And in the worst case, if the reclaim is making slower progress
and all following attempts fail due to being blocked by aging, it
triggers unexpected early OOM.

And if a lruvec requires aging, it doesn't mean it's hot. Instead, the
lruvec could be idle for quite a while, and hence it might contain lots
of cold folios to be reclaimed.

While it's helpful to rotate memcg LRU after aging for global reclaim,
as global reclaim fairness is coupled with the rotation in shrink_many,
memcg fairness is instead handled by cgroup iteration in
shrink_node_memcgs. So, for memcg level pressure, this abort is not the
key part for keeping the fairness. And in most cases, there is no need
to age, and fairness must be achieved by upper-level reclaim control.

So instead, just keep the scanning going unless one whole batch of
folios failed to be isolated or enough folios have been scanned, which
is triggered by evict_folios returning 0. And only abort for global
reclaim after one batch, so when there are fewer memcgs, progress is
still made, and the fairness mechanism described above still works fine.

And in most cases, the one more batch attempt for global reclaim might
just be enough to satisfy what the reclaimer needs, hence improving
global reclaim performance by reducing reclaim retry cycles.

Rotation is still there after the reclaim is done, which still follows
the comment in mmzone.h. And fairness still looking good.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6ee7ccf4e81..084c6ea8910c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4984,7 +4984,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
  */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool need_rotate = false;
+	bool need_rotate = false, should_age = false;
 	long nr_batch, nr_to_scan;
 	int swappiness = get_swappiness(lruvec, sc);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5002,8 +5002,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
 			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
 				need_rotate = true;
-			/* stop scanning as it's low on cold folios */
-			break;
+			should_age = true;
 		}

 		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
@@ -5014,6 +5013,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_abort_scan(lruvec, sc))
 			break;

+		/*
+		 * Root reclaim needs rotation when low on cold folio for better
+		 * fairness. Cgroup reclaim gets fairness from the iterator.
+		 */
+		if (root_reclaim(sc) && should_age)
+			break;
+
 		nr_to_scan -= delta;
 		cond_resched();
 	}

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 08/14] mm/mglru: remove redundant swap constrained check upon isolation
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 07/14] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Remove the swap-constrained early reject check upon isolation. This
check is a micro optimization when swap IO is not allowed, so folios are
rejected early. But it is redundant and overly broad since
shrink_folio_list() already handles all these cases with proper
granularity.

Notably, this check wrongly rejected lazyfree folios, and it doesn't
cover all rejection cases. shrink_folio_list() uses may_enter_fs(),
which distinguishes non-SWP_FS_OPS devices from filesystem-backed swap
and does all the checks after folio is locked, so flags like swap cache
are stable.

This check also covers dirty file folios, which are not a problem now
since sort_folio() already bumps dirty file folios to the next
generation, but causes trouble for unifying dirty folio writeback
handling.

And there should be no performance impact from removing it. We may have
lost a micro optimization, but unblocked lazyfree reclaim for NOIO
contexts, which is not a common case in the first place.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 084c6ea8910c..35e3352a53e3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4650,12 +4650,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 {
 	bool success;

-	/* swap constrained */
-	if (!(sc->gfp_mask & __GFP_IO) &&
-	    (folio_test_dirty(folio) ||
-	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
-		return false;
-
 	/* raced with release_pages() */
 	if (!folio_try_get(folio))
 		return false;

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 09/14] mm/mglru: use the common routine for dirty/writeback reactivation
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 08/14] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-24 19:05   ` Kairui Song
  2026-04-23 17:43 ` [PATCH v6 10/14] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently MGLRU will move the dirty writeback folios to the second
oldest gen instead of reactivate them like the classical LRU. This
might help to reduce the LRU contention as it skipped the isolation.
But as a result we will see these folios at the LRU tail more frequently
leading to inefficient reclaim.

Besides, the dirty / writeback check after isolation in
shrink_folio_list is more accurate and covers more cases. So instead,
just drop the special handling for dirty writeback, use the common
routine and re-activate it like the classical LRU.

This should in theory improve the scan efficiency. These folios will be
rotated back to LRU tail once writeback is done so there is no risk of
hotness inversion. And now each reclaim loop will have a higher
success rate. This also prepares for unifying the writeback and
throttling mechanism with classical LRU, we keep these folios far from
tail so detecting the tail batch will have a similar pattern with
classical LRU.

The micro optimization that avoids LRU contention by skipping the
isolation is gone, which should be fine. Compared to IO and writeback
cost, the isolation overhead is trivial.

And using the common routine also keeps the folio's referenced bits
(tier bits), which could improve metrics in the long term. Also no
more need to clean reclaim bit as the common routine will make use
of it.

Note the common routine updates a few throttling and writeback counters,
which are not used, and never have been for the MGLRU case. We will
start making use of these in later commits.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 35e3352a53e3..74255efc4ad9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4578,7 +4578,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4628,21 +4627,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }
 
@@ -4664,9 +4648,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 	if (!folio_test_referenced(folio))
 		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
 
-	/* for shrink_folio_list() */
-	folio_clear_reclaim(folio);
-
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);
 

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 10/14] mm/mglru: simplify and improve dirty writeback handling
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (8 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 11/14] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Right now the flusher wakeup mechanism for MGLRU is less responsive
and unlikely to trigger compared to classical LRU. The classical
LRU wakes the flusher if one batch of folios passed to shrink_folio_list
is unevictable due to under writeback. MGLRU instead check and handle
this after the whole reclaim loop is done.

We previously even saw OOM problems due to passive flusher, which were
fixed but still not perfect [1].

We have just unified the dirty folio counting and activation routine,
now just move the dirty flush into the loop right after shrink_folio_list.
This improves the performance a lot for workloads involving heavy
writeback and prepares for throttling too.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
workingset_refault_file 34522071

After this commit:
Throughput(ops/sec): 80857.08510208207
AverageLatency(us): 386.653262968934
pgpgin 112233121
workingset_refault_file 19516246

The performance is a lot better with significantly lower refault. We also
observed similar or higher performance gain for other real-world workloads.

We were concerned that the dirty flush could cause more wear for SSD:
that should not be the problem here, since the wakeup condition is when
the dirty folios have been pushed to the tail of LRU, which indicates
that memory pressure is so high that writeback is blocking the workload
already.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 41 ++++++++++++++++-------------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74255efc4ad9..d7a72a60c894 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4724,8 +4724,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
 
 	*isolatedp = isolated;
 	return scanned;
@@ -4833,12 +4831,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (stat.nr_unqueued_dirty == isolated) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4999,28 +5012,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	return need_rotate;
 }
 

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 11/14] mm/mglru: remove no longer used reclaim argument for folio protection
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (9 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 10/14] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 12/14] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now dirty reclaim folios are handled after isolation, not before,
since dirty reactivation must take the folio off LRU first, and that
helps to unify the dirty handling logic.

So this argument is no longer needed. Just remove it.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7a72a60c894..f6cb1a4b6a31 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3220,7 +3220,7 @@ static int folio_update_gen(struct folio *folio, int gen)
 }
 
 /* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio)
 {
 	int type = folio_is_file_lru(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -3239,9 +3239,6 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
-		/* for folio_end_writeback() */
-		if (reclaiming)
-			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags.f, &old_flags, new_flags));
 
 	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
@@ -3855,7 +3852,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(lruvec, folio);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
 			/* don't count the workingset being lazily promoted */
@@ -4607,7 +4604,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* protected */
 	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		/* don't count the workingset being lazily promoted */
@@ -4622,7 +4619,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (zone > sc->reclaim_idx) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 12/14] mm/vmscan: remove sc->file_taken
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (10 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 11/14] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 13/14] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f6cb1a4b6a31..7bec0ae51465 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -173,7 +173,6 @@ struct scan_control {
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
-		unsigned int file_taken;
 		unsigned int taken;
 	} nr;
 
@@ -2040,8 +2039,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 13/14] mm/vmscan: remove sc->unqueued_dirty
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (11 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 12/14] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 17:43 ` [PATCH v6 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Suggested-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7bec0ae51465..6df5ab625e6a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,7 +169,6 @@ struct scan_control {
 
 	struct {
 		unsigned int dirty;
-		unsigned int unqueued_dirty;
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
@@ -2035,7 +2034,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	sc->nr.dirty += stat.nr_dirty;
 	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v6 14/14] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (12 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 13/14] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
@ 2026-04-23 17:43 ` Kairui Song via B4 Relay
  2026-04-23 18:14 ` [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Andrew Morton
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-23 17:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently MGLRU and non-MGLRU handle the reclaim statistic and
writeback handling very differently, especially throttling.
Basically MGLRU just ignored the throttling part.

Let's just unify this part, use a helper to deduplicate the code
so both setups will share the same behavior.

Test using following reproducer using bash:

  echo "Setup a slow device using dm delay"
  dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
  LOOP=$(losetup --show -f /var/tmp/backing)
  mkfs.ext4 -q $LOOP
  echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
      dmsetup create slow_dev
  mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow

  echo "Start writeback pressure"
  sync && echo 3 > /proc/sys/vm/drop_caches
  mkdir /sys/fs/cgroup/test_wb
  echo 128M > /sys/fs/cgroup/test_wb/memory.max
  (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
      dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)

  echo "Clean up"
  echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
  dmsetup resume slow_dev
  umount -l /mnt/slow && sync
  dmsetup remove slow_dev

Before this commit, `dd` will get OOM killed immediately if
MGLRU is enabled. Classic LRU is fine.

After this commit, throttling is now effective and no more spin on
LRU or premature OOM. Stress test on other workloads also looking good.

Global throttling is not here yet, we will fix that separately later.

Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Tested-by: Leno Hou <lenohou@gmail.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 92 +++++++++++++++++++++++++++++--------------------------------
 1 file changed, 43 insertions(+), 49 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6df5ab625e6a..910420f7754d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
-
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -4824,26 +4830,13 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr_reclaimed += reclaimed;
+	/* Retry pass is only meant for clean folios without new isolation */
+	if (isolated)
+		handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (stat.nr_unqueued_dirty == isolated) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4886,6 +4879,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}
 

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (13 preceding siblings ...)
  2026-04-23 17:43 ` [PATCH v6 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
@ 2026-04-23 18:14 ` Andrew Morton
  2026-04-24 10:32 ` Barry Song
  2026-04-24 13:36 ` Andrew Morton
  16 siblings, 0 replies; 29+ messages in thread
From: Andrew Morton @ 2026-04-23 18:14 UTC (permalink / raw)
  To: kasong
  Cc: Kairui Song via B4 Relay, linux-mm, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng, Baolin Wang

On Fri, 24 Apr 2026 01:43:11 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:

> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.

Thanks, I queued this up in mm-new.

And thanks to all the testers and reviewers - it's so good to see MGRLU
getting some attention.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (14 preceding siblings ...)
  2026-04-23 18:14 ` [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Andrew Morton
@ 2026-04-24 10:32 ` Barry Song
  2026-04-24 11:58   ` Barry Song
  2026-04-24 13:36 ` Andrew Morton
  16 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2026-04-24 10:32 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Fri, Apr 24, 2026 at 1:43 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
>
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
>
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.
>
> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
>
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.
>
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
>
> Not using SWAP.
>
> Before:
> Throughput(ops/sec): 62485.02962831822
> AverageLatency(us): 500.9746963330107
> pgpgin 159347462
> pgpgout 5413332
> workingset_refault_anon 0
> workingset_refault_file 34522071
>
> After:
> Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> pgpgin 111093923                       (-30.3%, lower is better)
> pgpgout 5437456
> workingset_refault_anon 0
> workingset_refault_file 19566366       (-43.3%, lower is better)
>
> We can see a significant performance improvement after this series.
> The test is done on NVME and the performance gap would be even larger
> for slow devices, such as HDD or network storage. We observed over
> 100% gain for some workloads with slow IO.
>
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
>
> Before:
> Total requests:            79915
> Per-worker 95% CI (mean):  [1233.9, 1263.5]
> Per-worker stdev:          59.2
> Jain's fairness:           0.997795 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26859   33.61%   33.61%
> [1,2)s      7818    9.78%   43.39%
> [2,4)s      5532    6.92%   50.31%
> [4,8)s     39706   49.69%  100.00%
>
> After:
> Total requests:            81382
> Per-worker 95% CI (mean):  [1241.9, 1301.3]
> Per-worker stdev:          118.8
> Jain's fairness:           0.991480 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26696   32.80%   32.80%
> [1,2)s      8745   10.75%   43.55%
> [2,4)s      6865    8.44%   51.98%
> [4,8)s     39076   48.02%  100.00%
>
> Reclaim is still fair and effective, total requests number seems
> slightly better.
>
> OOM issue with aging and throttling
> ===================================
> For the throttling OOM issue, it can be easily reproduced using dd and
> cgroup limit as demonstrated in patch 14, and fixed by this series.
>
> The aging OOM is a bit tricky, a specific reproducer can be used to
> simulate what we encountered in production environment [4]:
> Spawns multiple workers that keep reading the given file using mmap,
> and pauses for 120ms after one file read batch. It also spawns another
> set of workers that keep allocating and freeing a given size of
> anonymous memory. The total memory size exceeds the memory limit
> (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).
>
> - MGLRU disabled:
>   Finished 128 iterations.
>
> - MGLRU enabled:
>   OOM with following info after about ~10-20 iterations:
>     [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
>     [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [   62.640823] Memory cgroup stats for /demo:
>     [   62.641017] anon 10604879872
>     [   62.641941] file 6574858240
>
>   OOM occurs despite there being still evictable file folios.
>
> - MGLRU enabled after this series:
>   Finished 128 iterations.
>
> Worth noting there is another OOM related issue reported in V1 of
> this series, which is tested and looking OK now [5].
>
> MySQL:
> ======
>
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=48 --time=600 run
>
> Before:            17260.781429 tps
> After this series: 17266.842857 tps
>
> MySQL is anon folios heavy, involves writeback and file and still
> looking good. Seems only noise level changes, no regression.
>
> FIO:
> ====
> Testing with the following command, where /mnt/ramdisk is a
> 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> 6 test run each:
>
> fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
>   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
>   --rw=randread --norandommap --time_based \
>   --ramp_time=1m --runtime=5m --group_reporting
>
> Before:            9196.481429 MB/s
> After this series: 9256.105000 MB/s
>
> Also seem only noise level changes and no regression or slightly better.
>
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 12 test run each.
>
> Before:            2589.63s
> After this series: 2543.58s
>
> Also seem only noise level changes, no regression or very slightly better.
>
> Android:
> ========
> Xinyu reported a performance gain on Android, too, with this series. The
> test consisted of cold-starting multiple applications sequentially under
> moderate system load. [6]
>
> Before:
> Launch Time Summary (all apps, all runs)
>   Mean 868.0ms
>   P50 888.0ms
>   P90 1274.2ms
>   P95 1399.0ms
>
> After:
> Launch Time Summary (all apps, all runs)
>   Mean 850.5ms (-2.07%)
>   P50 861.5ms  (-3.04%)
>   P90 1179.0ms (-8.05%)
>   P95 1228.0ms (-12.2%)
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
> Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Hi Kairui,

I haven't identified the exact commit, but this patchset seems to
make MGLRU's swappiness behavior more erratic.

In mainline, MGLRU does not show as strong an effect as the
active/inactive LRU, but it still behaves roughly linearly: higher
swappiness leads to more swap activity and fewer file refaults.

With this patchset, however, the behavior becomes non-monotonic as
swappiness increases. I observed clear up-and-down fluctuations.

I reproduced this by running a kernel build in a memcg limited to
1GB, with swappiness set to 35, 70, 105, 140, and 175.

this is mainline using MGLRU:

*** Executing round 1 ***
set swappiness to 35

real 1m49.247s
user 25m30.484s
sys 3m37.203s
pswpin: 933731
pswpout: 3365968
pgpgin: 5649320
pgpgout: 13786572
swpout_zero: 794960
swpin_zero: 10594
refault_file: 354998
refault_anon: 944323

*** Executing round 2 ***
set swappiness to 70

real 1m49.313s
user 25m31.643s
sys 3m40.661s
pswpin: 1049052
pswpout: 3565887
pgpgin: 5694288
pgpgout: 14582200
swpout_zero: 840947
swpin_zero: 12029
refault_file: 242973
refault_anon: 1061033

*** Executing round 3 ***
set swappiness to 105

real 1m48.611s
user 25m32.198s
sys 3m37.210s
pswpin: 981095
pswpout: 3396069
pgpgin: 5283940
pgpgout: 13898988
swpout_zero: 795932
swpin_zero: 11249
refault_file: 202432
refault_anon: 992295

*** Executing round 4 ***
set swappiness to 140

real 1m49.398s
user 25m35.650s
sys 3m50.656s
pswpin: 1222881
pswpout: 3935186
pgpgin: 6165024
pgpgout: 16056664
swpout_zero: 913808
swpin_zero: 13251
refault_file: 191564
refault_anon: 1236083

*** Executing round 5 ***
set swappiness to 175

real 1m49.513s
user 25m35.442s
sys 3m55.869s
pswpin: 1343139
pswpout: 4256014
pgpgin: 6557152
pgpgout: 17341452
swpout_zero: 998107
swpin_zero: 15692
refault_file: 175795
refault_anon: 1358782

this is mm-new using MGLRU:

*** Executing round 1 ***
set swappiness to 35

real 1m51.804s
user 25m38.070s
sys 4m16.301s
pswpin: 1587728
pswpout: 4932011
pgpgin: 8788688
pgpgout: 20062761
swpout_zero: 1129975
swpin_zero: 17944
refault_file: 487923
refault_anon: 1605670

*** Executing round 2 ***
set swappiness to 70

real 1m51.503s
user 25m37.581s
sys 4m18.161s
pswpin: 1743890
pswpout: 5214587
pgpgin: 8676728
pgpgout: 21178716
swpout_zero: 1185453
swpin_zero: 20016
refault_file: 317993
refault_anon: 1763904

*** Executing round 3 ***
set swappiness to 105

real 1m51.154s
user 25m37.956s
sys 4m15.017s
pswpin: 1687517
pswpout: 5073825
pgpgin: 8173036
pgpgout: 20608932
swpout_zero: 1161806
swpin_zero: 20069
refault_file: 249769
refault_anon: 1707538

*** Executing round 4 ***
set swappiness to 140

real 1m50.732s
user 25m37.686s
sys 4m16.066s
pswpin: 1671678
pswpout: 5118895
pgpgin: 7929960
pgpgout: 20790468
swpout_zero: 1171029
swpin_zero: 19596
refault_file: 193421
refault_anon: 1691228

*** Executing round 5 ***
set swappiness to 175

real 1m49.518s
user 25m37.653s
sys 4m12.619s
pswpin: 1506888
pswpout: 4789793
pgpgin: 7270448
pgpgout: 19479188
swpout_zero: 1119251
swpin_zero: 16699
refault_file: 187304
refault_anon: 1523585

The final one is classic active/inactive LRU:

*** Executing round 1 ***
set swappiness to 35

real 1m50.038s
user 25m21.911s
sys 3m42.798s
pswpin: 476994
pswpout: 2258185
pgpgin: 5247280
pgpgout: 9354640
swpout_zero: 684759
swpin_zero: 6387
refault_file: 750021
refault_anon: 483334

*** Executing round 2 ***
set swappiness to 70

real 1m48.781s
user 25m25.682s
sys 3m37.854s
pswpin: 515470
pswpout: 2306901
pgpgin: 4265500
pgpgout: 9547436
swpout_zero: 706437
swpin_zero: 6960
refault_file: 459740
refault_anon: 522381

*** Executing round 3 ***
set swappiness to 105

real 1m48.233s
user 25m26.623s
sys 3m38.843s
pswpin: 519540
pswpout: 2343897
pgpgin: 3628788
pgpgout: 9696500
swpout_zero: 743576
swpin_zero: 7782
refault_file: 303701
refault_anon: 527273

*** Executing round 4 ***
set swappiness to 140

real 1m48.800s
user 25m32.067s
sys 3m50.751s
pswpin: 605537
pswpout: 2615227
pgpgin: 3470540
pgpgout: 10776312
swpout_zero: 825446
swpin_zero: 9055
refault_file: 173236
refault_anon: 614544

*** Executing round 5 ***
set swappiness to 175

real 1m52.356s
user 25m29.727s
sys 3m55.664s
pswpin: 698228
pswpout: 2908292
pgpgin: 3602884
pgpgout: 11945332
swpout_zero: 912127
swpin_zero: 10298
refault_file: 117625
refault_anon: 708478


The build script is available here if you want to have a try:

https://git.kernel.org/pub/scm/linux/kernel/git/baohua/linux.git/diff/tools/mm/build-kernel-with-increasing-swappiness.sh?h=zram-async-gc&id=d47888e9

I am also debugging this. One possibility is that placing
dirty pages in the youngest generation may have affected
lruvec_evictable_size()?

Thanks
Barry


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-24 10:32 ` Barry Song
@ 2026-04-24 11:58   ` Barry Song
  2026-04-24 12:55     ` Kairui Song
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2026-04-24 11:58 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Fri, Apr 24, 2026 at 6:32 PM Barry Song <baohua@kernel.org> wrote:
>
> On Fri, Apr 24, 2026 at 1:43 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > This series cleans up and slightly improves MGLRU's reclaim loop and
> > dirty writeback handling. As a result, we can see an up to ~30% increase
> > in some workloads like MongoDB with YCSB and a huge decrease in file
> > refault, no swap involved. Other common benchmarks have no regression,
> > and LOC is reduced, with less unexpected OOM, too.
> >
> > Some of the problems were found in our production environment, and
> > others were mostly exposed while stress testing during the development
> > of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> > the code base and fixes several performance issues, preparing for
> > further work.
> >
> > MGLRU's reclaim loop is a bit complex, and hence these problems are
> > somehow related to each other. The aging, scan number calculation, and
> > reclaim loop are coupled together, and the dirty folio handling logic is
> > quite different, making the reclaim loop hard to follow and the dirty
> > flush ineffective.
> >
> > This series slightly cleans up and improves these issues using a scan
> > budget by calculating the number of folios to scan at the beginning of
> > the loop, and decouples aging from the reclaim calculation helpers.
> > Then, move the dirty flush logic inside the reclaim loop so it can kick
> > in more effectively. These issues are somehow related, and this series
> > handles them and improves MGLRU reclaim in many ways.
> >
> > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> > and a 128G memory machine using NVME as storage.
> >
> > MongoDB
> > =======
> > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> > threads:32), which does 95% read and 5% update to generate mixed read
> > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> > the WiredTiger cache size is set to 4.5G, using NVME as storage.
> >
> > Not using SWAP.
> >
> > Before:
> > Throughput(ops/sec): 62485.02962831822
> > AverageLatency(us): 500.9746963330107
> > pgpgin 159347462
> > pgpgout 5413332
> > workingset_refault_anon 0
> > workingset_refault_file 34522071
> >
> > After:
> > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> > pgpgin 111093923                       (-30.3%, lower is better)
> > pgpgout 5437456
> > workingset_refault_anon 0
> > workingset_refault_file 19566366       (-43.3%, lower is better)
> >
> > We can see a significant performance improvement after this series.
> > The test is done on NVME and the performance gap would be even larger
> > for slow devices, such as HDD or network storage. We observed over
> > 100% gain for some workloads with slow IO.
> >
> > Chrome & Node.js [3]
> > ====================
> > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> > workers:
> >
> > Before:
> > Total requests:            79915
> > Per-worker 95% CI (mean):  [1233.9, 1263.5]
> > Per-worker stdev:          59.2
> > Jain's fairness:           0.997795 (1.0 = perfectly fair)
> > Latency:
> > Bucket     Count      Pct    Cumul
> > [0,1)s     26859   33.61%   33.61%
> > [1,2)s      7818    9.78%   43.39%
> > [2,4)s      5532    6.92%   50.31%
> > [4,8)s     39706   49.69%  100.00%
> >
> > After:
> > Total requests:            81382
> > Per-worker 95% CI (mean):  [1241.9, 1301.3]
> > Per-worker stdev:          118.8
> > Jain's fairness:           0.991480 (1.0 = perfectly fair)
> > Latency:
> > Bucket     Count      Pct    Cumul
> > [0,1)s     26696   32.80%   32.80%
> > [1,2)s      8745   10.75%   43.55%
> > [2,4)s      6865    8.44%   51.98%
> > [4,8)s     39076   48.02%  100.00%
> >
> > Reclaim is still fair and effective, total requests number seems
> > slightly better.
> >
> > OOM issue with aging and throttling
> > ===================================
> > For the throttling OOM issue, it can be easily reproduced using dd and
> > cgroup limit as demonstrated in patch 14, and fixed by this series.
> >
> > The aging OOM is a bit tricky, a specific reproducer can be used to
> > simulate what we encountered in production environment [4]:
> > Spawns multiple workers that keep reading the given file using mmap,
> > and pauses for 120ms after one file read batch. It also spawns another
> > set of workers that keep allocating and freeing a given size of
> > anonymous memory. The total memory size exceeds the memory limit
> > (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).
> >
> > - MGLRU disabled:
> >   Finished 128 iterations.
> >
> > - MGLRU enabled:
> >   OOM with following info after about ~10-20 iterations:
> >     [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >     [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
> >     [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >     [   62.640823] Memory cgroup stats for /demo:
> >     [   62.641017] anon 10604879872
> >     [   62.641941] file 6574858240
> >
> >   OOM occurs despite there being still evictable file folios.
> >
> > - MGLRU enabled after this series:
> >   Finished 128 iterations.
> >
> > Worth noting there is another OOM related issue reported in V1 of
> > this series, which is tested and looking OK now [5].
> >
> > MySQL:
> > ======
> >
> > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> > ZRAM as swap and test command:
> >
> > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> >   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> >
> > Before:            17260.781429 tps
> > After this series: 17266.842857 tps
> >
> > MySQL is anon folios heavy, involves writeback and file and still
> > looking good. Seems only noise level changes, no regression.
> >
> > FIO:
> > ====
> > Testing with the following command, where /mnt/ramdisk is a
> > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> > 6 test run each:
> >
> > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
> >   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
> >   --rw=randread --norandommap --time_based \
> >   --ramp_time=1m --runtime=5m --group_reporting
> >
> > Before:            9196.481429 MB/s
> > After this series: 9256.105000 MB/s
> >
> > Also seem only noise level changes and no regression or slightly better.
> >
> > Build kernel:
> > =============
> > Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> > using make -j96 and defconfig, measuring system time, 12 test run each.
> >
> > Before:            2589.63s
> > After this series: 2543.58s
> >
> > Also seem only noise level changes, no regression or very slightly better.
> >
> > Android:
> > ========
> > Xinyu reported a performance gain on Android, too, with this series. The
> > test consisted of cold-starting multiple applications sequentially under
> > moderate system load. [6]
> >
> > Before:
> > Launch Time Summary (all apps, all runs)
> >   Mean 868.0ms
> >   P50 888.0ms
> >   P90 1274.2ms
> >   P95 1399.0ms
> >
> > After:
> > Launch Time Summary (all apps, all runs)
> >   Mean 850.5ms (-2.07%)
> >   P50 861.5ms  (-3.04%)
> >   P90 1179.0ms (-8.05%)
> >   P95 1228.0ms (-12.2%)
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> > Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
> > Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
>
> Hi Kairui,
>
> I haven't identified the exact commit, but this patchset seems to
> make MGLRU's swappiness behavior more erratic.
>
> In mainline, MGLRU does not show as strong an effect as the
> active/inactive LRU, but it still behaves roughly linearly: higher
> swappiness leads to more swap activity and fewer file refaults.
>
> With this patchset, however, the behavior becomes non-monotonic as
> swappiness increases. I observed clear up-and-down fluctuations.
>
> I reproduced this by running a kernel build in a memcg limited to
> 1GB, with swappiness set to 35, 70, 105, 140, and 175.
>
> this is mainline using MGLRU:
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m49.247s
> user 25m30.484s
> sys 3m37.203s
> pswpin: 933731
> pswpout: 3365968
> pgpgin: 5649320
> pgpgout: 13786572
> swpout_zero: 794960
> swpin_zero: 10594
> refault_file: 354998
> refault_anon: 944323
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m49.313s
> user 25m31.643s
> sys 3m40.661s
> pswpin: 1049052
> pswpout: 3565887
> pgpgin: 5694288
> pgpgout: 14582200
> swpout_zero: 840947
> swpin_zero: 12029
> refault_file: 242973
> refault_anon: 1061033
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m48.611s
> user 25m32.198s
> sys 3m37.210s
> pswpin: 981095
> pswpout: 3396069
> pgpgin: 5283940
> pgpgout: 13898988
> swpout_zero: 795932
> swpin_zero: 11249
> refault_file: 202432
> refault_anon: 992295
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m49.398s
> user 25m35.650s
> sys 3m50.656s
> pswpin: 1222881
> pswpout: 3935186
> pgpgin: 6165024
> pgpgout: 16056664
> swpout_zero: 913808
> swpin_zero: 13251
> refault_file: 191564
> refault_anon: 1236083
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m49.513s
> user 25m35.442s
> sys 3m55.869s
> pswpin: 1343139
> pswpout: 4256014
> pgpgin: 6557152
> pgpgout: 17341452
> swpout_zero: 998107
> swpin_zero: 15692
> refault_file: 175795
> refault_anon: 1358782
>
> this is mm-new using MGLRU:
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m51.804s
> user 25m38.070s
> sys 4m16.301s
> pswpin: 1587728
> pswpout: 4932011
> pgpgin: 8788688
> pgpgout: 20062761
> swpout_zero: 1129975
> swpin_zero: 17944
> refault_file: 487923
> refault_anon: 1605670
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m51.503s
> user 25m37.581s
> sys 4m18.161s
> pswpin: 1743890
> pswpout: 5214587
> pgpgin: 8676728
> pgpgout: 21178716
> swpout_zero: 1185453
> swpin_zero: 20016
> refault_file: 317993
> refault_anon: 1763904
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m51.154s
> user 25m37.956s
> sys 4m15.017s
> pswpin: 1687517
> pswpout: 5073825
> pgpgin: 8173036
> pgpgout: 20608932
> swpout_zero: 1161806
> swpin_zero: 20069
> refault_file: 249769
> refault_anon: 1707538
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m50.732s
> user 25m37.686s
> sys 4m16.066s
> pswpin: 1671678
> pswpout: 5118895
> pgpgin: 7929960
> pgpgout: 20790468
> swpout_zero: 1171029
> swpin_zero: 19596
> refault_file: 193421
> refault_anon: 1691228
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m49.518s
> user 25m37.653s
> sys 4m12.619s
> pswpin: 1506888
> pswpout: 4789793
> pgpgin: 7270448
> pgpgout: 19479188
> swpout_zero: 1119251
> swpin_zero: 16699
> refault_file: 187304
> refault_anon: 1523585
>
> The final one is classic active/inactive LRU:
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m50.038s
> user 25m21.911s
> sys 3m42.798s
> pswpin: 476994
> pswpout: 2258185
> pgpgin: 5247280
> pgpgout: 9354640
> swpout_zero: 684759
> swpin_zero: 6387
> refault_file: 750021
> refault_anon: 483334
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m48.781s
> user 25m25.682s
> sys 3m37.854s
> pswpin: 515470
> pswpout: 2306901
> pgpgin: 4265500
> pgpgout: 9547436
> swpout_zero: 706437
> swpin_zero: 6960
> refault_file: 459740
> refault_anon: 522381
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m48.233s
> user 25m26.623s
> sys 3m38.843s
> pswpin: 519540
> pswpout: 2343897
> pgpgin: 3628788
> pgpgout: 9696500
> swpout_zero: 743576
> swpin_zero: 7782
> refault_file: 303701
> refault_anon: 527273
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m48.800s
> user 25m32.067s
> sys 3m50.751s
> pswpin: 605537
> pswpout: 2615227
> pgpgin: 3470540
> pgpgout: 10776312
> swpout_zero: 825446
> swpin_zero: 9055
> refault_file: 173236
> refault_anon: 614544
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m52.356s
> user 25m29.727s
> sys 3m55.664s
> pswpin: 698228
> pswpout: 2908292
> pgpgin: 3602884
> pgpgout: 11945332
> swpout_zero: 912127
> swpin_zero: 10298
> refault_file: 117625
> refault_anon: 708478
>
>
> The build script is available here if you want to have a try:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/baohua/linux.git/diff/tools/mm/build-kernel-with-increasing-swappiness.sh?h=zram-async-gc&id=d47888e9
>
> I am also debugging this. One possibility is that placing
> dirty pages in the youngest generation may have affected
> lruvec_evictable_size()?

I reverted the six commits below, but swappiness behavior is still
very unusual on mm-new.

4ce85c040e0a mm/vmscan: unify writeback reclaim statistic and throttling
f80a81552f50 mm/vmscan: remove sc->unqueued_dirty
9381a541a759 mm/vmscan: remove sc->file_taken
f2e2a7ae7660 mm/mglru: remove no longer used reclaim argument for
folio protection
b052c4a752a5 mm/mglru: simplify and improve dirty writeback handling
831409284da1 mm/mglru: use the common routine for dirty/writeback reactivation

After reverting patch 9-14:

*** Executing round 1 ***
set swappiness to 35

real 2m6.982s
user 24m59.930s
sys 9m1.374s
pswpin: 1973368
pswpout: 4792167
pgpgin: 12471516
pgpgout: 19490361
swpout_zero: 992543
swpin_zero: 48166
refault_file: 1002114
refault_anon: 2021486

*** Executing round 2 ***
set swappiness to 70

real 1m56.011s
user 25m24.954s
sys 5m31.730s
pswpin: 1788750
pswpout: 4869145
pgpgin: 9745888
pgpgout: 19799848
swpout_zero: 1009680
swpin_zero: 35920
refault_file: 540060
refault_anon: 1824622

*** Executing round 3 ***
set swappiness to 105

real 1m52.184s
user 25m29.605s
sys 5m19.031s
pswpin: 1894596
pswpout: 5220326
pgpgin: 9844536
pgpgout: 21251668
swpout_zero: 1107839
swpin_zero: 33253
refault_file: 453966
refault_anon: 1927801

*** Executing round 4 ***
set swappiness to 140

real 1m56.725s
user 25m26.667s
sys 6m7.878s
pswpin: 2366033
pswpout: 5584223
pgpgin: 11962872
pgpgout: 22660564
swpout_zero: 1167419
swpin_zero: 56513
refault_file: 442744
refault_anon: 2422500

*** Executing round 5 ***
set swappiness to 175

real 2m16.219s
user 24m32.728s
sys 12m26.124s
pswpin: 1990093
pswpout: 4568372
pgpgin: 13571748
pgpgout: 18604592
swpout_zero: 977963
swpin_zero: 52072
refault_file: 1289471
refault_anon: 2042117

So it is likely caused by an earlier commit than the six above.
I need to get some sleep.

Could this be because get_nr_to_scan() was moved out of the loop by
[PATCH v6 04/14] mm/mglru: restructure the reclaim loop,
while in mainline it is re-evaluated in each iteration?

Will take a look tomorrow or the day after.

Thanks
Barry


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-24 11:58   ` Barry Song
@ 2026-04-24 12:55     ` Kairui Song
  2026-04-25 12:18       ` Barry Song
  0 siblings, 1 reply; 29+ messages in thread
From: Kairui Song @ 2026-04-24 12:55 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Fri, Apr 24, 2026 at 7:58 PM Barry Song <baohua@kernel.org> wrote:
> Could this be because get_nr_to_scan() was moved out of the loop by
> [PATCH v6 04/14] mm/mglru: restructure the reclaim loop,
> while in mainline it is re-evaluated in each iteration?
>
> Will take a look tomorrow or the day after.
>
> Thanks
> Barry
>

Hi Barry,

I ran your test script a few times, and strangely I can't reproduce
it. Swapniess behaves similarly after or before this series. I
directly checked out the mm-new commit of this series (4ce85c040e0a)
and compare to the mm-new commit right before this series
(31a112f05f62). I also extended your script a bit to test more
swappiness:

Before:
uname -a
Linux localhost 7.0.0.mm-new-g31a112f05f62 #305 SMP PREEMPT_DYNAMIC
Fri Apr 24 19:07:30 CST 2026 x86_64 GNU/Linux

*** Executing swappiness 5 ***
set swappiness to 5
Running as unit: run-p6675-i151640.scope; invocation ID:
e8b79eab2854418b81c2ad62c3d121c4

real 1m50.652s
user 38m58.129s
sys 21m41.763s
pswpin: 15917615
pswpout: 41199479
pgpgin: 76113864
pgpgout: 165640720
swpout_zero: 1546131
swpin_zero: 285359
refault_file: 1648508
refault_anon: 15829347

*** Executing swappiness 35 ***
set swappiness to 35
Running as unit: run-p44771-i189005.scope; invocation ID:
562deebdc09647a290a2165870295ef7

real 1m50.623s
user 38m50.505s
sys 21m45.217s
pswpin: 15915841
pswpout: 41297235
pgpgin: 75425648
pgpgout: 166025592
swpout_zero: 1546865
swpin_zero: 277756
refault_file: 1546926
refault_anon: 15778795

*** Executing swappiness 70 ***
set swappiness to 70
Running as unit: run-p82859-i1819.scope; invocation ID:
82aead8ff78742bba6bf401a6534588e

real 1m49.605s
user 38m30.262s
sys 21m49.417s
pswpin: 16088273
pswpout: 41592875
pgpgin: 71469324
pgpgout: 167370708
swpout_zero: 1657002
swpin_zero: 319600
refault_file: 857694
refault_anon: 16175619

*** Executing swappiness 105 ***
set swappiness to 105
Running as unit: run-p120948-i250916.scope; invocation ID:
e7bc4e739cab4e8bbc5efb06b50ead14

real 1m48.951s
user 38m30.080s
sys 21m48.815s
pswpin: 16265236
pswpout: 41884071
pgpgin: 70923620
pgpgout: 168654260
swpout_zero: 1649763
swpin_zero: 316386
refault_file: 691818
refault_anon: 16368960

*** Executing swappiness 140 ***
set swappiness to 140
Running as unit: run-p159035-i42797.scope; invocation ID:
ea31084e41bf4c7b9427800bf78500fb

real 1m49.596s
user 38m32.881s
sys 21m50.914s
pswpin: 16434435
pswpout: 42315789
pgpgin: 70803660
pgpgout: 170344784
swpout_zero: 1659821
swpin_zero: 314116
refault_file: 563047
refault_anon: 16557648

*** Executing swappiness 175 ***
set swappiness to 175
Running as unit: run-p197122-i325303.scope; invocation ID:
181af70e74b74bff901e10ed97d32c50

real 1m49.091s
user 38m31.101s
sys 21m56.002s
pswpin: 16470200
pswpout: 42370706
pgpgin: 70388644
pgpgout: 170452728
swpout_zero: 1689922
swpin_zero: 330216
refault_file: 487950
refault_anon: 16613124

*** Executing swappiness 200 ***
set swappiness to 200
Running as unit: run-p235206-i68603.scope; invocation ID:
d6af085ecd9a47a88ddecfb326af4d58

real 1m49.458s
user 38m37.696s
sys 22m7.194s
pswpin: 16473742
pswpout: 42413098
pgpgin: 70351964
pgpgout: 170454504
swpout_zero: 1653981
swpin_zero: 316654
refault_file: 539774
refault_anon: 16620315

After:
uname -a
Linux localhost 7.0.0.ptch-g4ce85c040e0a #2733 SMP PREEMPT_DYNAMIC Fri
Apr 24 19:07:08 CST 2026 x86_64 GNU/Linux

*** Executing swappiness 5 ***
set swappiness to 5
Running as unit: run-p10510-i123098.scope; invocation ID:
7773cea9690140378786e496d7bf0523

real 1m50.042s
user 38m59.555s
sys 21m25.018s
pswpin: 15913149
pswpout: 41183479
pgpgin: 76136184
pgpgout: 165910472
swpout_zero: 1557330
swpin_zero: 282842
refault_file: 1599524
refault_anon: 15876245

*** Executing swappiness 35 ***
set swappiness to 35
Running as unit: run-p48606-i45593.scope; invocation ID:
66799f770f0944558b258fa42a62cd28

real 1m50.479s
user 38m59.363s
sys 21m27.155s
pswpin: 15865488
pswpout: 41087641
pgpgin: 75103236
pgpgout: 165212060
swpout_zero: 1557445
swpin_zero: 287809
refault_file: 1421868
refault_anon: 15790633

*** Executing swappiness 70 ***
set swappiness to 70
Running as unit: run-p86689-i246403.scope; invocation ID:
ac7303701c524901a64c9a5b945f9e20

real 1m49.409s
user 38m36.037s
sys 21m37.421s
pswpin: 16198083
pswpout: 41876332
pgpgin: 71719832
pgpgout: 168490208
swpout_zero: 1622494
swpin_zero: 304799
refault_file: 847114
refault_anon: 16283181

*** Executing swappiness 105 ***
set swappiness to 105
Running as unit: run-p124782-i205752.scope; invocation ID:
591727ffbc1d40c1b79a5a20567e1616

real 1m48.638s
user 38m28.237s
sys 21m31.590s
pswpin: 16278124
pswpout: 41920058
pgpgin: 70591876
pgpgout: 168801512
swpout_zero: 1660539
swpin_zero: 321367
refault_file: 651376
refault_anon: 16400826

*** Executing swappiness 140 ***
set swappiness to 140
Running as unit: run-p162871-i189610.scope; invocation ID:
3be1d782d9724eb79e7803eb72c47e4a

real 1m48.602s
user 38m31.708s
sys 21m39.510s
pswpin: 16376932
pswpout: 41921230
pgpgin: 70503000
pgpgout: 168528328
swpout_zero: 1684494
swpin_zero: 333271
refault_file: 524031
refault_anon: 16513067

*** Executing swappiness 175 ***
set swappiness to 175
Running as unit: run-p200958-i133720.scope; invocation ID:
12da6771f02b485ea6cf0c6842bd5f73

real 1m48.905s
user 38m28.887s
sys 21m31.592s
pswpin: 16535604
pswpout: 42244753
pgpgin: 70727968
pgpgout: 170088056
swpout_zero: 1675139
swpin_zero: 327897
refault_file: 489367
refault_anon: 16669662

*** Executing swappiness 200 ***
set swappiness to 200
Running as unit: run-p239041-i341992.scope; invocation ID:
31ff32be856f41f4a3ad60ea7878a230

real 1m48.746s
user 38m25.722s
sys 21m50.843s
pswpin: 16644335
pswpout: 42660348
pgpgin: 70911344
pgpgout: 171749444
swpout_zero: 1675432
swpin_zero: 321988
refault_file: 501717
refault_anon: 16775111

Since you mentioned it's mm-new vs mainline, and you have reverted
part of this series and the problem is still there. Could it be
related to something else in mm-new? I'll keep testing more stress and
workload to dig deeper too. Or maybe the swappiness behavior just
changed slightly, some it may perform better or worse depending on
timing and workload? Swappiness on MGLRU currently only works as a
factor for calculating the refault and reclaim balance of anon / file
so it may behave a bit unpredictable. There isn't a proportional
calculation like active / inactive LRU. That's a problem too, and we
might fix that later.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (15 preceding siblings ...)
  2026-04-24 10:32 ` Barry Song
@ 2026-04-24 13:36 ` Andrew Morton
  2026-04-24 14:16   ` Kairui Song
  16 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2026-04-24 13:36 UTC (permalink / raw)
  To: kasong
  Cc: Kairui Song via B4 Relay, linux-mm, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng, Baolin Wang

On Fri, 24 Apr 2026 01:43:11 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:

> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.

Sashiko:
	https://sashiko.dev/#/patchset/20260424-mglru-reclaim-v6-0-a57622d770c3@tencent.com


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-24 13:36 ` Andrew Morton
@ 2026-04-24 14:16   ` Kairui Song
  0 siblings, 0 replies; 29+ messages in thread
From: Kairui Song @ 2026-04-24 14:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song via B4 Relay, linux-mm, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng, Baolin Wang

On Fri, Apr 24, 2026 at 9:42 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 24 Apr 2026 01:43:11 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:
>
> > This series cleans up and slightly improves MGLRU's reclaim loop and
> > dirty writeback handling. As a result, we can see an up to ~30% increase
> > in some workloads like MongoDB with YCSB and a huge decrease in file
> > refault, no swap involved. Other common benchmarks have no regression,
> > and LOC is reduced, with less unexpected OOM, too.
>
> Sashiko:
>         https://sashiko.dev/#/patchset/20260424-mglru-reclaim-v6-0-a57622d770c3@tencent.com
>

Thanks. It's interesting that I followed sashiko previous suggestion,
cause it's a trivial part, and now sashiko is suggesting to change it
back to what I was doing :)

Nothing critical and I just saw a false positive, but anyway, I'll
forward the review to corresponding patch, I do prefer to change it
back to the previous design, maybe do some other slight improvement
too. Let me send an update in a few days after double checking the
swappiness issue reported by Barry.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 04/14] mm/mglru: restructure the reclaim loop
  2026-04-23 17:43 ` [PATCH v6 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-04-24 17:04   ` Kairui Song
  0 siblings, 0 replies; 29+ messages in thread
From: Kairui Song @ 2026-04-24 17:04 UTC (permalink / raw)
  To: linux-mm
  Cc: Kairui Song, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

Forwarding sashiko review:

> commit b0a64f8c3f76ce5cebb30c1caa6305fc617888cf
> Author: Kairui Song <kasong@tencent.com>
> mm/mglru: restructure the reclaim loop
> The current loop calculates the scan number on each iteration, which couples
> it with aging and rotation. This patch decouples aging and rotation by
> calculating the scan number once at the beginning of reclaim, respecting
> reclaim priority, and making aging explicit.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 3c5a6ae9244099..757beb605980d4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > @@ -4940,11 +4932,10 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> >  		return nr_to_scan;
> >  
> >  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> > +	nr_to_scan >>= sc->priority;
> >  
> > -	/* try to get away with not aging at the default priority */
> > -	if (!need_aging || sc->priority == DEF_PRIORITY)
> > -		return nr_to_scan >> sc->priority;
> > +	if (!nr_to_scan && sc->priority < DEF_PRIORITY)
> > +		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
> Does this code introduce a priority inversion where higher reclaim priorities
> result in fewer pages being scanned than lower priorities?
> If a small memcg has an evictable size of 100 and SWAP_CLUSTER_MAX is 32,
> at sc->priority == 7, 100 >> 7 is 0, so the fallback sets nr_to_scan to 32.
> At sc->priority == 6 (higher urgency), 100 >> 6 is 1. The fallback is
> bypassed and nr_to_scan remains 1.
> Would it be better for this fallback to apply a floor regardless of whether
> the shifted result is exactly zero?

Right, sashiko has been very concerned about this part for many times,
it wasn't happy about either design (unconditional floor, no handling
at all, or this version). But this is not a functional regression.

For tiny cgroups the old code before this patch could be over reclaiming.
Now the over reclaiming is no more but could be a trivial priority
inversion, which I don't think that's a practically observable problem.

But to be academically ideal, following change should better:

nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
nr_to_scan >>= sc->priority;

/* make escalated reclaim progress and balance with slab shrinking */
if (sc->priority < DEF_PRIORITY)
	nr_to_scan = max(nr_to_scan, min(evictable, SWAP_CLUSTER_MAX));

No inversion, and no under reclaim or over reclaim.

> [ ... ]
> > @@ -4985,31 +4977,44 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> >  	return true;
> >  }
> >  
> > +/*
> > + * For future optimizations:
> > + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> > + *    reclaim.
> > + */
> >  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >  {
> > +	bool need_rotate = false;
> >  	long nr_batch, nr_to_scan;
> > -	unsigned long scanned = 0;
> >  	int swappiness = get_swappiness(lruvec, sc);
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >  
> > -	while (true) {
> > +	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> > +	while (nr_to_scan > 0) {
> >  		int delta;
> > +		DEFINE_MAX_SEQ(lruvec);
> >  
> > -		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> > -		if (nr_to_scan <= 0)
> > +		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> > +			need_rotate = true;
> >  			break;
> > +		}
> > +
> > +		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> > +			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> > +				need_rotate = true;
> > +			/* stop scanning as it's low on cold folios */
> > +			break;
> > +		}
> Can this bypass critical memcg protection checks and aging when nr_to_scan
> is 0?
> If get_nr_to_scan() returns 0 (which happens if evictable == 0 or when
> nr_to_scan is shifted to 0 at DEF_PRIORITY), the loop body is bypassed
> entirely.
> This means mem_cgroup_below_min() is never evaluated in those cases,
> preventing need_rotate from being set to true for protected memcgs.
> The reclaimer relies on this flag to rotate away from protected memcgs.
> Additionally, does this skip aging for empty offline memcgs
> (where evictable == 0), since they never enter the loop where
> should_run_aging() is called?
> The commit message states that offline memcgs should now be aged
> to enable further reclaim, but this bypass seems to prevent that.

Pretty much a false positive. Bypassing the loop body is not a problem
at DEF_PRIORITY, as there wasn't aging before either, and the rotation
for below-min memcgs is already handled by shrink_one before entering
the loop in the global reclaim path, and is inconsequential in the
memcg reclaim path.

And for offline memcg, nr_to_scan is never zero unless that memcg is
completely empty. If we got a lot of empty memcg slowing down the
memcg iterator due to missing rotation, that's a different issue
and not related or introduce by this patch.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 09/14] mm/mglru: use the common routine for dirty/writeback reactivation
  2026-04-23 17:43 ` [PATCH v6 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
@ 2026-04-24 19:05   ` Kairui Song
  0 siblings, 0 replies; 29+ messages in thread
From: Kairui Song @ 2026-04-24 19:05 UTC (permalink / raw)
  To: linux-mm
  Cc: kasong, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

Forwarding sashiko review (the format doesn't seem right, maybe some sashiko bug):

> commit 1f5e2fcf6aa79d5fe07e850245e1292c205ab0fb
> Author: Kairui Song <kasong@tencent.com>
> mm/mglru: use the common routine for dirty/writeback reactivation
> This commit removes MGLRU-specific handling of dirty and writeback folios,
> instead falling back to the classical LRU routine to reactivate them. The
> intention is to improve scan efficiency and keep these folios away from the
> LRU tail until writeback completes.
> > And using the common routine also keeps the folio's referenced bits
> > (tier bits), which could improve metrics in the long term. Also no
> > more need to clean reclaim bit as the common routine will make use
> > of it.
> The commit message notes that this approach keeps the folio's referenced
> (tier) bits.
> When folios are processed by the common routine, they may be marked active
> and subsequently returned to MGLRU via move_folios_to_lru() ->
> lruvec_add_folio() -> lru_gen_add_folio().
> While lru_gen_add_folio() updates the generation, it does not clear
> LRU_REFS_FLAGS. Should the tier bits be explicitly cleared when a folio
> moves to a new generation so tier tracking starts fresh?
> If stale tier bits are retained across generation movements, could this
> artificially inflate access counts and disrupt the eviction algorithm?

I don't think that's an issue, writeback folios are moved to tail (oldest
gen) once writeback is done. And in real tests this seems actually an
improvement.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 35e3352a53e3bc..74255efc4ad9dd 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > @@ -4628,21 +4627,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >  		return true;
> >  	}
> >  
> > -	dirty = folio_test_dirty(folio);
> > -	writeback = folio_test_writeback(folio);
> > -	if (type == LRU_GEN_FILE && dirty) {
> > -		sc->nr.file_taken += delta;
> > -		if (!writeback)
> > -			sc->nr.unqueued_dirty += delta;
> > -	}
> > -
> > -	/* waiting for writeback */
> > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > -		gen = folio_inc_gen(lruvec, folio, true);
> > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > -		return true;
> > -	}
> > -
> >  	return false;
> >  }
> The removed folio_inc_gen() path previously cleared LRU_REFS_FLAGS when
> updating the generation. Can the new fallback approach cause regressions in
> tier tracking by skipping this clearing step?

Same as above, that's not an issue but expected, even an improvement
for many cases.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-24 12:55     ` Kairui Song
@ 2026-04-25 12:18       ` Barry Song
  2026-04-25 13:29         ` Kairui Song
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2026-04-25 12:18 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Fri, Apr 24, 2026 at 8:56 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Apr 24, 2026 at 7:58 PM Barry Song <baohua@kernel.org> wrote:
> > Could this be because get_nr_to_scan() was moved out of the loop by
> > [PATCH v6 04/14] mm/mglru: restructure the reclaim loop,
> > while in mainline it is re-evaluated in each iteration?
> >
> > Will take a look tomorrow or the day after.
> >
> > Thanks
> > Barry
> >
>
> Hi Barry,
>
> I ran your test script a few times, and strangely I can't reproduce
> it. Swapniess behaves similarly after or before this series. I
> directly checked out the mm-new commit of this series (4ce85c040e0a)
> and compare to the mm-new commit right before this series
> (31a112f05f62). I also extended your script a bit to test more
> swappiness:

Hi Kairui,
I reset the repository to commit 4ce85c040e0a using
git reset --hard, and I can still reproduce the
swappiness issue. My machine is:

barry@barry-desktop:~$ lscpu
Architecture:                x86_64
  CPU op-mode(s):            32-bit, 64-bit
  Address sizes:             39 bits physical, 48 bits virtual
  Byte Order:                Little Endian
CPU(s):                      20
  On-line CPU(s) list:       0-19
Vendor ID:                   GenuineIntel
  Model name:                Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz
    CPU family:              6
    Model:                   165
    Thread(s) per core:      2
    Core(s) per socket:      10
    Socket(s):               1
    Stepping:                5
    CPU max MHz:             2800.0000
    CPU min MHz:             800.0000
    BogoMIPS:                5599.85


swap is zRAM only:
barry@barry-desktop:~$ cat /proc/swaps
Filename Type Size Used Priority
/dev/zram0                              partition 12582908 280940 5

The data is as below,

*** Executing round 1 ***
set swappiness to 35

real 1m51.699s
user 25m31.134s
sys 4m13.127s
pswpin: 1562949
pswpout: 4840525
pgpgin: 8751872
pgpgout: 19741097
swpout_zero: 1095783
swpin_zero: 18079
refault_file: 515292
refault_anon: 1580980

*** Executing round 2 ***
set swappiness to 70

real 1m51.603s
user 25m33.600s
sys 4m21.738s
pswpin: 1786413
pswpout: 5350804
pgpgin: 8833652
pgpgout: 21715596
swpout_zero: 1230981
swpin_zero: 21051
refault_file: 313099
refault_anon: 1807417

*** Executing round 3 ***
set swappiness to 105

real 1m50.315s
user 25m40.863s
sys 4m12.446s
pswpin: 1555289
pswpout: 4911737
pgpgin: 7597548
pgpgout: 19956948
swpout_zero: 1125969
swpin_zero: 17594
refault_file: 237475
refault_anon: 1572835

*** Executing round 4 ***
set swappiness to 140

real 1m50.992s
user 25m34.774s
sys 4m14.068s
pswpin: 1642575
pswpout: 5027730
pgpgin: 7937214
pgpgout: 20426400
swpout_zero: 1155712
swpin_zero: 20248
refault_file: 215237
refault_anon: 1662775

*** Executing round 5 ***
set swappiness to 175

real 1m50.207s
user 25m38.244s
sys 4m7.655s
pswpin: 1522633
pswpout: 4788104
pgpgin: 7307172
pgpgout: 19464984
swpout_zero: 1109281
swpin_zero: 18085
refault_file: 186203
refault_anon: 1540669

I disabled turbo for the test:

echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

My bisect shows that the commit causing the swappiness issue is:

[PATCH v6 05/14] mm/mglru: scan and count the exact number of folios

Before that, swappiness behaves as expected, and there is
also less swap-out/in activity(and much shorter sys time).

*** Executing round 1 ***
set swappiness to 35

real 1m49.406s
user 25m28.458s
sys 3m41.098s
pswpin: 984605
pswpout: 3329809
pgpgin: 5985696
pgpgout: 13648560
swpout_zero: 780136
swpin_zero: 11379
refault_file: 367629
refault_anon: 995982

*** Executing round 2 ***
set swappiness to 70

real 1m48.577s
user 25m34.994s
sys 3m42.694s
pswpin: 985650
pswpout: 3450097
pgpgin: 5468828
pgpgout: 14116020
swpout_zero: 820143
swpin_zero: 11808
refault_file: 245353
refault_anon: 997410

*** Executing round 3 ***
set swappiness to 105

real 1m49.262s
user 25m34.871s
sys 3m41.633s
pswpin: 998178
pswpout: 3553741
pgpgin: 5328896
pgpgout: 14535068
swpout_zero: 840706
swpin_zero: 10393
refault_file: 205514
refault_anon: 1008525

*** Executing round 4 ***
set swappiness to 140

real 1m49.417s
user 25m35.395s
sys 3m47.169s
pswpin: 1138043
pswpout: 3756034
pgpgin: 5807584
pgpgout: 15345816
swpout_zero: 884539
swpin_zero: 12652
refault_file: 185767
refault_anon: 1150649

*** Executing round 5 ***
set swappiness to 175

real 1m49.654s
user 25m35.244s
sys 3m53.330s
pswpin: 1235427
pswpout: 4058085
pgpgin: 6108792
pgpgout: 16547764
swpout_zero: 974086
swpin_zero: 14280
refault_file: 170452
refault_anon: 1249705

It’s too late today; I’ll continue debugging tomorrow.

[...]
> Since you mentioned it's mm-new vs mainline, and you have reverted
> part of this series and the problem is still there. Could it be
> related to something else in mm-new? I'll keep testing more stress and
> workload to dig deeper too. Or maybe the swappiness behavior just
> changed slightly, some it may perform better or worse depending on
> timing and workload? Swappiness on MGLRU currently only works as a
> factor for calculating the refault and reclaim balance of anon / file
> so it may behave a bit unpredictable. There isn't a proportional
> calculation like active / inactive LRU. That's a problem too, and we
> might fix that later.

read_ctrl_pos() should also bias towards swappiness, as
both sp and pv gains are affected by it. Yes, we need to
fix the swappiness for mglru.

Best Regards
Barry


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-25 12:18       ` Barry Song
@ 2026-04-25 13:29         ` Kairui Song
  2026-04-25 20:57           ` Barry Song (Xiaomi)
  0 siblings, 1 reply; 29+ messages in thread
From: Kairui Song @ 2026-04-25 13:29 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Sat, Apr 25, 2026 at 8:18 PM Barry Song <baohua@kernel.org> wrote:
>
> On Fri, Apr 24, 2026 at 8:56 PM Kairui Song <ryncsn@gmail.com> wrote:
> > Hi Barry,
> >
> > I ran your test script a few times, and strangely I can't reproduce
> > it. Swapniess behaves similarly after or before this series. I
> > directly checked out the mm-new commit of this series (4ce85c040e0a)
> > and compare to the mm-new commit right before this series
> > (31a112f05f62). I also extended your script a bit to test more
> > swappiness:
>
> Hi Kairui,
> I reset the repository to commit 4ce85c040e0a using
> git reset --hard, and I can still reproduce the
> swappiness issue. My machine is:
>
> barry@barry-desktop:~$ lscpu
> Architecture:                x86_64
>   CPU op-mode(s):            32-bit, 64-bit
>   Address sizes:             39 bits physical, 48 bits virtual
>   Byte Order:                Little Endian
> CPU(s):                      20
>   On-line CPU(s) list:       0-19
> Vendor ID:                   GenuineIntel
>   Model name:                Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz
>     CPU family:              6
>     Model:                   165
>     Thread(s) per core:      2
>     Core(s) per socket:      10
>     Socket(s):               1
>     Stepping:                5
>     CPU max MHz:             2800.0000
>     CPU min MHz:             800.0000
>     BogoMIPS:                5599.85
>
>
> swap is zRAM only:
> barry@barry-desktop:~$ cat /proc/swaps
> Filename Type Size Used Priority
> /dev/zram0                              partition 12582908 280940 5
>
> The data is as below,
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m51.699s
> user 25m31.134s
> sys 4m13.127s
> pswpin: 1562949
> pswpout: 4840525
> pgpgin: 8751872
> pgpgout: 19741097
> swpout_zero: 1095783
> swpin_zero: 18079
> refault_file: 515292
> refault_anon: 1580980
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m51.603s
> user 25m33.600s
> sys 4m21.738s
> pswpin: 1786413
> pswpout: 5350804
> pgpgin: 8833652
> pgpgout: 21715596
> swpout_zero: 1230981
> swpin_zero: 21051
> refault_file: 313099
> refault_anon: 1807417
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m50.315s
> user 25m40.863s
> sys 4m12.446s
> pswpin: 1555289
> pswpout: 4911737
> pgpgin: 7597548
> pgpgout: 19956948
> swpout_zero: 1125969
> swpin_zero: 17594
> refault_file: 237475
> refault_anon: 1572835
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m50.992s
> user 25m34.774s
> sys 4m14.068s
> pswpin: 1642575
> pswpout: 5027730
> pgpgin: 7937214
> pgpgout: 20426400
> swpout_zero: 1155712
> swpin_zero: 20248
> refault_file: 215237
> refault_anon: 1662775
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m50.207s
> user 25m38.244s
> sys 4m7.655s
> pswpin: 1522633
> pswpout: 4788104
> pgpgin: 7307172
> pgpgout: 19464984
> swpout_zero: 1109281
> swpin_zero: 18085
> refault_file: 186203
> refault_anon: 1540669

Hmm, but reading the result you just posted, isn't swappiness actually
working as expected? Here is the data you just posted:

swappiness:  35 refault_file/anon: 515292 1580980
swappiness:  70 refault_file/anon: 313099 1807417
swappiness: 150 refault_file/anon: 237475 1572835
swappiness: 140 refault_file/anon: 215237 1662775
swappiness: 175 refault_file/anon: 186203 1540669

Higher swappiness we have, lower file refault we have.

> My bisect shows that the commit causing the swappiness issue is:
>
> [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios
>
> Before that, swappiness behaves as expected, and there is
> also less swap-out/in activity(and much shorter sys time).
>
> *** Executing round 1 ***
> set swappiness to 35
>
> real 1m49.406s
> user 25m28.458s
> sys 3m41.098s
> pswpin: 984605
> pswpout: 3329809
> pgpgin: 5985696
> pgpgout: 13648560
> swpout_zero: 780136
> swpin_zero: 11379
> refault_file: 367629
> refault_anon: 995982
>
> *** Executing round 2 ***
> set swappiness to 70
>
> real 1m48.577s
> user 25m34.994s
> sys 3m42.694s
> pswpin: 985650
> pswpout: 3450097
> pgpgin: 5468828
> pgpgout: 14116020
> swpout_zero: 820143
> swpin_zero: 11808
> refault_file: 245353
> refault_anon: 997410
>
> *** Executing round 3 ***
> set swappiness to 105
>
> real 1m49.262s
> user 25m34.871s
> sys 3m41.633s
> pswpin: 998178
> pswpout: 3553741
> pgpgin: 5328896
> pgpgout: 14535068
> swpout_zero: 840706
> swpin_zero: 10393
> refault_file: 205514
> refault_anon: 1008525
>
> *** Executing round 4 ***
> set swappiness to 140
>
> real 1m49.417s
> user 25m35.395s
> sys 3m47.169s
> pswpin: 1138043
> pswpout: 3756034
> pgpgin: 5807584
> pgpgout: 15345816
> swpout_zero: 884539
> swpin_zero: 12652
> refault_file: 185767
> refault_anon: 1150649
>
> *** Executing round 5 ***
> set swappiness to 175
>
> real 1m49.654s
> user 25m35.244s
> sys 3m53.330s
> pswpin: 1235427
> pswpout: 4058085
> pgpgin: 6108792
> pgpgout: 16547764
> swpout_zero: 974086
> swpin_zero: 14280
> refault_file: 170452
> refault_anon: 1249705
>
> It’s too late today; I’ll continue debugging tomorrow.

Checking the data you just posted before that commit:
swappiness:  35 refault_file/anon: 367629 995982
swappiness:  70 refault_file/anon: 245353 997410
swappiness: 105 refault_file/anon: 205514 1008525
swappiness: 140 refault_file/anon: 185767 1150649
swappiness: 175 refault_file/anon: 170452 1249705

And after that commit is:
swappiness:  35 refault_file/anon: 515292 1580980
swappiness:  70 refault_file/anon: 313099 1807417
swappiness: 150 refault_file/anon: 237475 1572835
swappiness: 140 refault_file/anon: 215237 1662775
swappiness: 175 refault_file/anon: 186203 1540669

So I think the problem is not swappiness, but there are more anon
refaults after that commit.

Before:
pswpin: 998178
pswpout: 3553741
pgpgin: 5328896
pgpgout: 14535068

After:
pswpin: 1555289
pswpout: 4911737
pgpgin: 7597548
pgpgout: 19956948

I just ran a matrix of for kernels (mainline, mm-new HEAD, before this
series, after this series) X 3 different memcg configs (-j96 3G, -j48
2G, -j24 1G), and none of these showed any regression but all
improvement. That's really odd.

One possibility is that I removed the:

if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS >
lrugen->max_seq)
    scanned = 0;

Which will make the reclaim loop go further and trigger aging.
Previously if reclaim drained the LRU's cold gens, it may go reclaim
slab instead. So idle inodes will be dropped with the mapping and
reclaim more file, and we won't see any refault data from that since
the mapping itself is gone. Sys will be lower too, as IO isn't counted
as sys. Checking your data, despite sys is higher, real is acutually
lower, which matches my guess.

Will the following patch help? I'm not sure if this is the problem,
but this added back that early abort, personally I don't think this
really makes much sense as it's more like a workaround for other
issues, but if that helps we might better keep it.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a89224117b..c1e7c65ff3b9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4837,6 +4837,7 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
        int scanned, reclaimed;
        int isolated = 0, type, type_scanned;
        bool skip_retry = false;
+       struct lru_gen_folio *lrugen = &lruvec->lrugen;
        struct mem_cgroup *memcg = lruvec_memcg(lruvec);
        struct pglist_data *pgdat = lruvec_pgdat(lruvec);

@@ -4852,6 +4853,10 @@ static int evict_folios(unsigned long
nr_to_scan, struct lruvec *lruvec,
        if (scanned)
                try_to_inc_min_seq(lruvec, swappiness);

+       /* Out of cold folios, return 0 to abort early and also
trigger shrinkers beside LRU */
+       if (evictable_min_seq(lrugen->min_seq, swappiness) +
MIN_NR_GENS > lrugen->max_seq)
+               scanned = 0;
+
        lruvec_unlock_irq(lruvec);

        if (list_empty(&list))

And this could cause early OOM, we have observed that for several
times due to the early return. So maybe we better check sc->priority
too, or move this to should_abort_scan?

Or perhaps we should just restore the behavior of never running aging
at DEF_PRIOTIY, which seems better and safer like below:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 30a89224117b..2080522ea924 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4917,14 +4917,14 @@ static bool should_run_aging(struct lruvec
*lruvec, unsigned long max_seq,
 {
        DEFINE_MIN_SEQ(lruvec);

-       /* have to run aging, since eviction is not possible anymore */
-       if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
-               return true;
-
        /* try to avoid aging, do gentle reclaim at the default priority */
        if (sc->priority == DEF_PRIORITY)
                return false;

+       /* have to run aging, since eviction is not possible anymore */
+       if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
+               return true;
+
        /* better to run aging even though eviction is still possible */
        return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }

I'll keep testing with more FS and setup. What FS are you using? This
might be related to FS side reclaim as well if it's caused by shrinker
balance.

> [...]
> > Since you mentioned it's mm-new vs mainline, and you have reverted
> > part of this series and the problem is still there. Could it be
> > related to something else in mm-new? I'll keep testing more stress and
> > workload to dig deeper too. Or maybe the swappiness behavior just
> > changed slightly, some it may perform better or worse depending on
> > timing and workload? Swappiness on MGLRU currently only works as a
> > factor for calculating the refault and reclaim balance of anon / file
> > so it may behave a bit unpredictable. There isn't a proportional
> > calculation like active / inactive LRU. That's a problem too, and we
> > might fix that later.
>
> read_ctrl_pos() should also bias towards swappiness, as
> both sp and pv gains are affected by it. Yes, we need to
> fix the swappiness for mglru.

Yes, read_ctrl_pos is the helper for calculating the refault and
reclaim balance that I was talking about.


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-25 13:29         ` Kairui Song
@ 2026-04-25 20:57           ` Barry Song (Xiaomi)
  2026-04-26  6:59             ` Kairui Song
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song (Xiaomi) @ 2026-04-25 20:57 UTC (permalink / raw)
  To: ryncsn
  Cc: akpm, axelrasmussen, baohua, baolin.wang, chenridong, chrisl,
	david, hannes, kaleshsingh, laoar.shao, lenohou, linux-kernel,
	linux-mm, ljs, mhocko, qi.zheng, shakeel.butt, stevensd, surenb,
	vernon2gm, wangzicheng, weixugc, yuanchu, yuzhao, zhengqi.arch

On Sat, Apr 25, 2026 at 9:30 PM Kairui Song <ryncsn@gmail.com> wrote:
[...]
>
> I just ran a matrix of for kernels (mainline, mm-new HEAD, before this
> series, after this series) X 3 different memcg configs (-j96 3G, -j48
> 2G, -j24 1G), and none of these showed any regression but all
> improvement. That's really odd.
>
> One possibility is that I removed the:
>
> if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS >
> lrugen->max_seq)
>     scanned = 0;
>
> Which will make the reclaim loop go further and trigger aging.
> Previously if reclaim drained the LRU's cold gens, it may go reclaim
> slab instead. So idle inodes will be dropped with the mapping and
> reclaim more file, and we won't see any refault data from that since
> the mapping itself is gone. Sys will be lower too, as IO isn't counted
> as sys. Checking your data, despite sys is higher, real is acutually
> lower, which matches my guess.
>
> Will the following patch help? I'm not sure if this is the problem,
> but this added back that early abort, personally I don't think this
> really makes much sense as it's more like a workaround for other
> issues, but if that helps we might better keep it.

Hi Kairui, after five hours of sleep I’m feeling much more
refreshed and should have identified the issue. I think the
problem will be clear once you look at the patch below.
Feel free to include it in the new version if you find it
helpful.

From e3a0b50dc53a3a5f2ef1adfb73111336e3c2d08b Mon Sep 17 00:00:00 2001
From: "Barry Song (Xiaomi)" <baohua@kernel.org>
Date: Sun, 26 Apr 2026 08:34:21 +1200
Subject: [PATCH] mm/mglru: Avoid falling back to another type when
 scan_folios() leaves no remaining pages

While remaining reaches 0 in scan_folios(), we quickly fall back
to the other type in isolate_folios(). This is incorrect, as the
current type may still have sufficient folios. Falling back can
undermine the positive_ctrl_err() result from get_type_to_scan(),
which is derived from swappiness.

A simple fix is to continue scanning this type for another round.

Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/vmscan.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ef45f3a4aa38..169fbbe17c7b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			*isolate_scanned = scanned;
 			break;
 		}
-
-		type = !type;
+		/*
+		 * If scanned > 0 and isolated == 0, avoid falling back to the
+		 * other type, as this type remains sufficient. Falling back
+		 * too readily can disrupt the positive_ctrl_err() bias.
+		 */
+		if (!scanned)
+			type = !type;
 	}
 
 	return total_scanned;
-- 
2.34.1

Thanks
Barry


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-25 20:57           ` Barry Song (Xiaomi)
@ 2026-04-26  6:59             ` Kairui Song
  2026-04-26  8:34               ` Barry Song
  0 siblings, 1 reply; 29+ messages in thread
From: Kairui Song @ 2026-04-26  6:59 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, axelrasmussen, baolin.wang, chenridong, chrisl, david,
	hannes, kaleshsingh, laoar.shao, lenohou, linux-kernel, linux-mm,
	ljs, mhocko, qi.zheng, shakeel.butt, stevensd, surenb, vernon2gm,
	wangzicheng, weixugc, yuanchu, yuzhao, zhengqi.arch

On Sun, Apr 26, 2026 at 4:58 AM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
>
> On Sat, Apr 25, 2026 at 9:30 PM Kairui Song <ryncsn@gmail.com> wrote:
> [...]
> >
> > I just ran a matrix of for kernels (mainline, mm-new HEAD, before this
> > series, after this series) X 3 different memcg configs (-j96 3G, -j48
> > 2G, -j24 1G), and none of these showed any regression but all
> > improvement. That's really odd.
> >
> > One possibility is that I removed the:
> >
> > if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS >
> > lrugen->max_seq)
> >     scanned = 0;
> >
> > Which will make the reclaim loop go further and trigger aging.
> > Previously if reclaim drained the LRU's cold gens, it may go reclaim
> > slab instead. So idle inodes will be dropped with the mapping and
> > reclaim more file, and we won't see any refault data from that since
> > the mapping itself is gone. Sys will be lower too, as IO isn't counted
> > as sys. Checking your data, despite sys is higher, real is acutually
> > lower, which matches my guess.
> >
> > Will the following patch help? I'm not sure if this is the problem,
> > but this added back that early abort, personally I don't think this
> > really makes much sense as it's more like a workaround for other
> > issues, but if that helps we might better keep it.
>
> Hi Kairui, after five hours of sleep I’m feeling much more
> refreshed and should have identified the issue. I think the
> problem will be clear once you look at the patch below.
> Feel free to include it in the new version if you find it
> helpful.
>
> From e3a0b50dc53a3a5f2ef1adfb73111336e3c2d08b Mon Sep 17 00:00:00 2001
> From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> Date: Sun, 26 Apr 2026 08:34:21 +1200
> Subject: [PATCH] mm/mglru: Avoid falling back to another type when
>  scan_folios() leaves no remaining pages
>
> While remaining reaches 0 in scan_folios(), we quickly fall back
> to the other type in isolate_folios(). This is incorrect, as the
> current type may still have sufficient folios. Falling back can
> undermine the positive_ctrl_err() result from get_type_to_scan(),
> which is derived from swappiness.
>
> A simple fix is to continue scanning this type for another round.
>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/vmscan.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ef45f3a4aa38..169fbbe17c7b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                         *isolate_scanned = scanned;
>                         break;
>                 }
> -
> -               type = !type;
> +               /*
> +                * If scanned > 0 and isolated == 0, avoid falling back to the
> +                * other type, as this type remains sufficient. Falling back
> +                * too readily can disrupt the positive_ctrl_err() bias.
> +                */
> +               if (!scanned)
> +                       type = !type;
>         }

Thanks so much for catching this! The fix makes perfect sense, it
restores the pre-patch behavior. I was too aggressive with the
fallback; positive_ctrl_err() should win whenever the preferred type
is actually productive.

Would you prefer I squash this into the original patch (with a
Co-developed-by for you?), or keep it as a standalone fix on top? I'm
fine either way.

One related thought for a follow-up: scanned == 0 from scan_folios()
is essentially driven by get_nr_gens(lruvec, type) == MIN_NR_GENS. So
when the type get_type_to_scan() picked is the one that's
gen-exhausted, maybe the right response is also to age that type
rather than fall back. Right now should_run_aging() only looks at the
combined evictable_min_seq, so the ctrl_err preferred type can
silently stay starved while we evict from the other one, that could be
an old issue we can fix later.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-26  6:59             ` Kairui Song
@ 2026-04-26  8:34               ` Barry Song
  2026-04-26 14:04                 ` Kairui Song
  0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2026-04-26  8:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, axelrasmussen, baolin.wang, chenridong, chrisl, david,
	hannes, kaleshsingh, laoar.shao, lenohou, linux-kernel, linux-mm,
	ljs, mhocko, qi.zheng, shakeel.butt, stevensd, surenb, vernon2gm,
	wangzicheng, weixugc, yuanchu, yuzhao, zhengqi.arch

On Sun, Apr 26, 2026 at 2:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sun, Apr 26, 2026 at 4:58 AM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
> >
> > On Sat, Apr 25, 2026 at 9:30 PM Kairui Song <ryncsn@gmail.com> wrote:
> > [...]
> > >
> > > I just ran a matrix of for kernels (mainline, mm-new HEAD, before this
> > > series, after this series) X 3 different memcg configs (-j96 3G, -j48
> > > 2G, -j24 1G), and none of these showed any regression but all
> > > improvement. That's really odd.
> > >
> > > One possibility is that I removed the:
> > >
> > > if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS >
> > > lrugen->max_seq)
> > >     scanned = 0;
> > >
> > > Which will make the reclaim loop go further and trigger aging.
> > > Previously if reclaim drained the LRU's cold gens, it may go reclaim
> > > slab instead. So idle inodes will be dropped with the mapping and
> > > reclaim more file, and we won't see any refault data from that since
> > > the mapping itself is gone. Sys will be lower too, as IO isn't counted
> > > as sys. Checking your data, despite sys is higher, real is acutually
> > > lower, which matches my guess.
> > >
> > > Will the following patch help? I'm not sure if this is the problem,
> > > but this added back that early abort, personally I don't think this
> > > really makes much sense as it's more like a workaround for other
> > > issues, but if that helps we might better keep it.
> >
> > Hi Kairui, after five hours of sleep I’m feeling much more
> > refreshed and should have identified the issue. I think the
> > problem will be clear once you look at the patch below.
> > Feel free to include it in the new version if you find it
> > helpful.
> >
> > From e3a0b50dc53a3a5f2ef1adfb73111336e3c2d08b Mon Sep 17 00:00:00 2001
> > From: "Barry Song (Xiaomi)" <baohua@kernel.org>
> > Date: Sun, 26 Apr 2026 08:34:21 +1200
> > Subject: [PATCH] mm/mglru: Avoid falling back to another type when
> >  scan_folios() leaves no remaining pages
> >
> > While remaining reaches 0 in scan_folios(), we quickly fall back
> > to the other type in isolate_folios(). This is incorrect, as the
> > current type may still have sufficient folios. Falling back can
> > undermine the positive_ctrl_err() result from get_type_to_scan(),
> > which is derived from swappiness.
> >
> > A simple fix is to continue scanning this type for another round.
> >
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  mm/vmscan.c | 9 +++++++--
> >  1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ef45f3a4aa38..169fbbe17c7b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >                         *isolate_scanned = scanned;
> >                         break;
> >                 }
> > -
> > -               type = !type;
> > +               /*
> > +                * If scanned > 0 and isolated == 0, avoid falling back to the
> > +                * other type, as this type remains sufficient. Falling back
> > +                * too readily can disrupt the positive_ctrl_err() bias.
> > +                */
> > +               if (!scanned)
> > +                       type = !type;
> >         }
>
> Thanks so much for catching this! The fix makes perfect sense, it
> restores the pre-patch behavior. I was too aggressive with the
> fallback; positive_ctrl_err() should win whenever the preferred type
> is actually productive.
>
> Would you prefer I squash this into the original patch (with a
> Co-developed-by for you?), or keep it as a standalone fix on top? I'm
> fine either way.

Either approach is fine—use whichever works better for organizing v7.
My boss would definitely be happier with the latter, as it would better
support and encourage more active engagement in upstream development.
It’s rare these days to find a boss like this who is open-minded and
insightful about upstream contributions, especially with everyone
moving toward AI. :-)

>
> One related thought for a follow-up: scanned == 0 from scan_folios()
> is essentially driven by get_nr_gens(lruvec, type) == MIN_NR_GENS. So
> when the type get_type_to_scan() picked is the one that's
> gen-exhausted, maybe the right response is also to age that type
> rather than fall back. Right now should_run_aging() only looks at the
> combined evictable_min_seq, so the ctrl_err preferred type can
> silently stay starved while we evict from the other one, that could be
> an old issue we can fix later.

Yep, I’ve been thinking about the same thing for quite a
few days. This might also help address swappiness. On the
other hand, it could lead to more aging, but it seems like
a necessary trade-off if we want a simple solution that
doesn’t require separate max_seq for file and anon to fix
swappiness for mglru.

I’m also trying another approach. For example, if the
number of folios in the old generation is too low relative
to swappiness, should_run_aging() could return true—similar
to Yu’s earliest patch as below, but with a swappiness bias.

+ /*
+ * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1)
+ * of the total number of pages for each generation. A reasonable range
+ * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
+ * aging cares about the upper bound of hot pages, while the eviction
+ * cares about the lower bound of cold pages.
+ */
+ if (young * MIN_NR_GENS > total)
+        return true;
+ if (old * (MIN_NR_GENS + 2) < total)
+        return true;

Hopefully, the above can resolve the problem before we have to
resort to separate max_seq, which would break the shared time
axis between file and anon.

Maybe I will send an RFC before LSF/MM/BPF if we have enough
time.

Thanks
Barry


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-26  8:34               ` Barry Song
@ 2026-04-26 14:04                 ` Kairui Song
  0 siblings, 0 replies; 29+ messages in thread
From: Kairui Song @ 2026-04-26 14:04 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, axelrasmussen, baolin.wang, chenridong, chrisl, david,
	hannes, kaleshsingh, laoar.shao, lenohou, linux-kernel, linux-mm,
	ljs, mhocko, qi.zheng, shakeel.butt, stevensd, surenb, vernon2gm,
	wangzicheng, weixugc, yuanchu, yuzhao, zhengqi.arch

On Sun, Apr 26, 2026 at 4:35 PM Barry Song <baohua@kernel.org> wrote:
>
>
> Either approach is fine—use whichever works better for organizing v7.
> My boss would definitely be happier with the latter, as it would better
> support and encourage more active engagement in upstream development.
> It’s rare these days to find a boss like this who is open-minded and
> insightful about upstream contributions, especially with everyone
> moving toward AI. :-)

Will attach your patch as a standalone one right next to patch 5 then,
the effect is not easy to hit I think, and shouldn't break any bisect.

I just ran a set of tests, just in case it has any unexpected
behavior, it seems identical to what we have in V5 with all my test cases:

FIO:
Before: 8968.76 MB/s
V5 (this series): 9005.89 MB/s
V6 (with your fix and nitpick from sashiko): 8995.63 MB/s

Build kernel:
Before: 2873.52
V5 (this series): 2816.44s
V6 (with your fix and nitpick from sashiko): 2811.88s

MySQL:
Before: 17303.414444 tps
V5 (this series): 17310.528750
V6 (with your fix and nitpick from sashiko): 17291.500000 tps

MongoDB with YCSB workoad b:
Before: 59227.33 ops/s
V6: 79928.53 ops/sec

Still matches the V5 cover letter and everything looks great, also
seems alright on my android phone. Will send V6 later.

>
> >
> > One related thought for a follow-up: scanned == 0 from scan_folios()
> > is essentially driven by get_nr_gens(lruvec, type) == MIN_NR_GENS. So
> > when the type get_type_to_scan() picked is the one that's
> > gen-exhausted, maybe the right response is also to age that type
> > rather than fall back. Right now should_run_aging() only looks at the
> > combined evictable_min_seq, so the ctrl_err preferred type can
> > silently stay starved while we evict from the other one, that could be
> > an old issue we can fix later.
>
> Yep, I’ve been thinking about the same thing for quite a
> few days. This might also help address swappiness. On the
> other hand, it could lead to more aging, but it seems like
> a necessary trade-off if we want a simple solution that
> doesn’t require separate max_seq for file and anon to fix
> swappiness for mglru.
>
> I’m also trying another approach. For example, if the
> number of folios in the old generation is too low relative
> to swappiness, should_run_aging() could return true—similar
> to Yu’s earliest patch as below, but with a swappiness bias.
>
> + /*
> + * It's also ideal to spread pages out evenly, i.e., 1/(MIN_NR_GENS+1)
> + * of the total number of pages for each generation. A reasonable range
> + * for this average portion is [1/MIN_NR_GENS, 1/(MIN_NR_GENS+2)]. The
> + * aging cares about the upper bound of hot pages, while the eviction
> + * cares about the lower bound of cold pages.
> + */
> + if (young * MIN_NR_GENS > total)
> +        return true;
> + if (old * (MIN_NR_GENS + 2) < total)
> +        return true;
>
> Hopefully, the above can resolve the problem before we have to
> resort to separate max_seq, which would break the shared time
> axis between file and anon.
>
> Maybe I will send an RFC before LSF/MM/BPF if we have enough
> time.

Good idea, that can also be discussed there.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-04-26 14:04 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-23 17:43 [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 02/14] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 03/14] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-04-24 17:04   ` Kairui Song
2026-04-23 17:43 ` [PATCH v6 05/14] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 07/14] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 08/14] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
2026-04-24 19:05   ` Kairui Song
2026-04-23 17:43 ` [PATCH v6 10/14] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 11/14] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 12/14] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 13/14] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
2026-04-23 17:43 ` [PATCH v6 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
2026-04-23 18:14 ` [PATCH v6 00/14] mm/mglru: improve reclaim loop and dirty folio handling Andrew Morton
2026-04-24 10:32 ` Barry Song
2026-04-24 11:58   ` Barry Song
2026-04-24 12:55     ` Kairui Song
2026-04-25 12:18       ` Barry Song
2026-04-25 13:29         ` Kairui Song
2026-04-25 20:57           ` Barry Song (Xiaomi)
2026-04-26  6:59             ` Kairui Song
2026-04-26  8:34               ` Barry Song
2026-04-26 14:04                 ` Kairui Song
2026-04-24 13:36 ` Andrew Morton
2026-04-24 14:16   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox