[PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
@ 2026-04-27 18:06 Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 01/15] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
                   ` (16 more replies)
  0 siblings, 17 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Before:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
pgpgout 5413332
workingset_refault_anon 0
workingset_refault_file 34522071

After:
Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
pgpgin 111093923                       (-30.3%, lower is better)
pgpgout 5437456
workingset_refault_anon 0
workingset_refault_file 19566366       (-43.3%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            79915
Per-worker 95% CI (mean):  [1233.9, 1263.5]
Per-worker stdev:          59.2
Jain's fairness:           0.997795 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26859   33.61%   33.61%
[1,2)s      7818    9.78%   43.39%
[2,4)s      5532    6.92%   50.31%
[4,8)s     39706   49.69%  100.00%

After:
Total requests:            81382
Per-worker 95% CI (mean):  [1241.9, 1301.3]
Per-worker stdev:          118.8
Jain's fairness:           0.991480 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26696   32.80%   32.80%
[1,2)s      8745   10.75%   43.55%
[2,4)s      6865    8.44%   51.98%
[4,8)s     39076   48.02%  100.00%

Reclaim is still fair and effective, total requests number seems
slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated and fixed by a later patch in this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]:
Spawns multiple workers that keep reading the given file using mmap,
and pauses for 120ms after one file read batch. It also spawns another
set of workers that keep allocating and freeing a given size of
anonymous memory. The total memory size exceeds the memory limit
(eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17303.41 tps
After this series: 17291.50 tps

Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Before:            8968.76 MB/s
After this series: 8995.63 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2873.52s
After this series: 2811.88s

Also seem only noise level changes, no regression or very slightly better.

Android:
========
Xinyu reported a performance gain on Android, too, with this series. The
test consisted of cold-starting multiple applications sequentially under
moderate system load. [6]

Before:
Launch Time Summary (all apps, all runs)
  Mean 868.0ms
  P50 888.0ms
  P90 1274.2ms
  P95 1399.0ms

After:
Launch Time Summary (all apps, all runs)
  Mean 850.5ms (-2.07%)
  P50 861.5ms  (-3.04%)
  P90 1179.0ms (-8.05%)
  P95 1228.0ms (-12.2%)

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
Link: https://lore.kernel.org/linux-mm/20260417025123.2971253-1-wxy2009nrrr@163.com/ [6]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v7:
- Fix swappiness not being effective with a standalone fix patch
  from Barry Song. It's OK to be a standalone fix since that is not a
  major bug but an unexpected behavior change, and shouldn't effect any
  bisecting. I slightly adjusted the commit message as the subjcect is too
  long and getting truncated for mail:
  https://lore.kernel.org/linux-mm/20260425205759.1701-1-baohua@kernel.org/
- Remove the min limit for calculating nr_to_scan:
  https://lore.kernel.org/linux-mm/aet1hd9DfRH4aSOO@KASONG-MC4/
  Instead just revert to V1:
  https://sashiko.dev/#/message/20260318-mglru-reclaim-v1-3-2c46f9eb0508%40tencent.com
  Everyone was fine with that, the min limit in later version was
  introduced to cover sashiko's review on V1, but now think again, that's
  actually not a bug and instead could be beneficial. This min
  check doesn't always make sense and there isn't any practical issue observed.
- Retest still looking very good in every case.
- Link to v6: https://patch.msgid.link/20260424-mglru-reclaim-v6-0-a57622d770c3@tencent.com

Changes in v6:
- Avoid potential over rotation of tiny cgroup (<16M):
  https://lore.kernel.org/linux-mm/CAMgjq7ArnmmoHOGRt6Wc8hu7tjx_t583-UVzJK+HOHgjjetQ9g@mail.gmail.com/
- Avoid potentially skewed stat counter:
  https://lore.kernel.org/linux-mm/CAMgjq7DCn8p_yMMhiejFjX6sdybZKYOw8qJbq=+OCsZ=AfJnFA@mail.gmail.com/
- Update a few comment and varible name as suggested by Barry Song.
- Tested over days, also tested on my Android phone, everything still
  matches the cover letter description. And add test result from Xinyu.
- Link to v5: https://patch.msgid.link/20260413-mglru-reclaim-v5-0-8eaeacbddc44@tencent.com

Changes in v5:
- Add back a more moderate minimal batch limit for each reclaim loop:
  https://lore.kernel.org/linux-mm/adYP81AhpNf0znp3@KASONG-MC4/
- Collect review-by.
- Link to v4: https://patch.msgid.link/20260407-mglru-reclaim-v4-0-98cf3dc69519@tencent.com

Changes in v4:
- Remove the minimal scan batch limit, and add rotate for
  unevictable memcg as reported by sashiko:
  https://lore.kernel.org/linux-mm/ac8xVN82LBLDZpIO@KASONG-MC4/
- Slightly imporove a few commit messages.
- Reran the test and seems identical with before so data are unchanged.
- Collect review-by.
- Link to v3: https://patch.msgid.link/20260403-mglru-reclaim-v3-0-a285efd6ff91@tencent.com

Changes in v3:
- Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
  loop. If the LRU is too small, adjust it accordingly. Now the
  multi-cgroup scan balance looked even better for tiny cgroups:
  https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
- Add one patch to remove the swap constraint check in isolate_folio. In
  theory, it's fine, and both stress test and performance test didn't
  show any issue:
  https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
- I reran most tests, all seem identical, so most data is kept.
  Intermediate test results are dropped. I ran tests on most patches
  individually, and there is no problem, but the series is getting too
  long, and posting them makes it harder to read and unnecessary.
- Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
  also some test result is collected to support the design:
  https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
  I kept Axel's review-by since the code is identical.
- Call try_to_inc_min_seq twice to avoid stale empty gen and drop
  its return argument [ Baolin Wang ]
- Move a few lines of code between patches to where they fits better,
  the final result is identical [ Baolin Wang ].
- Collect tested by and update test setup [ Leno Hou ]
- Collect review by.
- Update a few commit message [ Shakeel Butt ].
- Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com

Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Barry Song (Xiaomi) (1):
      mm/mglru: avoid reclaim type fall back when isolation makes no progress

Kairui Song (14):
      mm/mglru: consolidate common code for retrieving evictable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: remove redundant swap constrained check upon isolation
      mm/mglru: use the common routine for dirty/writeback reactivation
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 341 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 149 insertions(+), 192 deletions(-)
---
base-commit: 22f2053a471467342c51eb2e4ffd7daf601118d2
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>




^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v7 01/15] mm/mglru: consolidate common code for retrieving evictable size
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 02/15] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Merge commonly used code for counting evictable folios in a lruvec.

No behavior change.

Acked-by: Yuanchu Xie <yuanchu@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2d89ed69d22..b80fbc4fc285 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4084,27 +4084,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4909,9 +4915,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
 	*nr_to_scan = 0;
@@ -4919,18 +4922,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
-
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 02/15] mm/mglru: rename variables related to aging and rotation
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 01/15] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 03/15] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

The current variable name isn't helpful.  Make the variable names more
meaningful.

Only naming change, no behavior change.

Suggested-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b80fbc4fc285..7f011ff4c478 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4934,7 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  */
 static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
-	bool success;
+	bool need_aging;
 	unsigned long nr_to_scan;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
@@ -4942,7 +4942,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
 		return -1;
 
-	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (nr_to_scan && !mem_cgroup_online(memcg))
@@ -4951,7 +4951,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
 	/* try to get away with not aging at the default priority */
-	if (!success || sc->priority == DEF_PRIORITY)
+	if (!need_aging || sc->priority == DEF_PRIORITY)
 		return nr_to_scan >> sc->priority;
 
 	/* stop scanning this lruvec as it's low on cold folios */
@@ -5040,7 +5040,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
+	bool need_rotate;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5058,7 +5058,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		memcg_memory_event(memcg, MEMCG_LOW);
 	}
 
-	success = try_to_shrink_lruvec(lruvec, sc);
+	need_rotate = try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -5068,10 +5068,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	flush_reclaim_state(sc);
 
-	if (success && mem_cgroup_online(memcg))
+	if (need_rotate && mem_cgroup_online(memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (!success && lruvec_is_sizable(lruvec, sc))
+	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
 		return 0;
 
 	/* one retry if offlined or too small */

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 03/15] mm/mglru: relocate the LRU scan batch limit to callers
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 01/15] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 02/15] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 04/15] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Same as active / inactive LRU, MGLRU isolates and scans folios in batches.
The batch split is done hidden deep in the helper, which makes the code
harder to follow.  The helper's arguments are also confusing since callers
usually request more folios than the batch size, so the helper almost
never processes the full requested amount.

Move the batch splitting into the top loop to make it cleaner, there
should be no behavior change.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7f011ff4c478..a011733a6392 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4695,10 +4695,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int scanned = 0;
 	int isolated = 0;
 	int skipped = 0;
-	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
-	int remaining = scan_batch;
+	unsigned long remaining = nr_to_scan;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
+	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
 	VM_WARN_ON_ONCE(!list_empty(list));
 
 	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
@@ -4751,7 +4751,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	mod_lruvec_state(lruvec, item, isolated);
 	mod_lruvec_state(lruvec, PGREFILL, sorted);
 	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
-	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
@@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	long nr_to_scan;
+	long nr_batch, nr_to_scan;
 	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
 
@@ -4998,7 +4998,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (nr_to_scan <= 0)
 			break;
 
-		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
+		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
@@ -5623,6 +5624,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
 			int swappiness, unsigned long nr_to_reclaim)
 {
+	int nr_batch;
 	DEFINE_MAX_SEQ(lruvec);
 
 	if (seq + MIN_NR_GENS > max_seq)
@@ -5639,8 +5641,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			return 0;
 
-		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
-				  swappiness))
+		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
+		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
 			return 0;
 
 		cond_resched();

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 04/15] mm/mglru: restructure the reclaim loop
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-04-27 18:06 ` [PATCH v7 03/15] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 05/15] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

The current loop will calculate the scan number on each iteration.  The
number of folios to scan is based on the LRU length, with some unclear
behaviors, e.g, the scan number is only shifted by reclaim priority when
aging is not needed or when at the default priority, and it couples the
number calculation with aging and rotation.

Adjust, simplify it, and decouple aging and rotation.  Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how aging and offline memcg reclaim works:
Previously, aging was skipped at DEF_PRIORITY even when eviction was no
longer possible, so the reclaimer wasted an iteration until the priority
escalated.  Now aging runs immediately whenever it is needed to make
progress; the DEF_PRIORITY skip only applies when eviction is still
viable.  This may avoid wasted iterations that over-reclaim slab and break
reclaim balance in multi-cgroup setups.

Similar for offline memcg.  Previously, offline memcg wouldn't be aged
unless it didn't have any evictable folios.  Now, we might age it if it
has only 3 generations, which should be fine.  On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem.  These folios
might be used by other memcg, so aging them as ordinary memcg seems
correct.  Besides, aging enables further reclaim of an offlined memcg,
which will certainly happen if we keep shrinking it.  And offline memcg
might soon be no longer an issue with reparenting.

Overall, the memcg LRU rotation, as described in mmzone.h, remains the
same.

Note that because the scan budget is now pinned at loop entry, tiny
lruvec might skip this reclaim pass, also skipping aging, which could be
beneficial as aging is not helpful since it will still be un-reclaimable
after aging. Reclaim will go on as usual once priority escalates.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 72 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 36 insertions(+), 36 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a011733a6392..b247f216f28b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4913,49 +4913,37 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     int swappiness, unsigned long *nr_to_scan)
+			     struct scan_control *sc, int swappiness)
 {
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
+	/* try to avoid aging, do gentle reclaim at the default priority */
+	if (sc->priority == DEF_PRIORITY)
+		return false;
+
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
-/*
- * For future optimizations:
- * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
- *    reclaim.
- */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
 {
-	bool need_aging;
-	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
+	unsigned long nr_to_scan, evictable;
 
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	evictable = lruvec_evictable_size(lruvec, swappiness);
 
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
-		return nr_to_scan;
-
-	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
+	if (!mem_cgroup_online(memcg))
+		return evictable;
 
-	/* try to get away with not aging at the default priority */
-	if (!need_aging || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
+	nr_to_scan = apply_proportional_protection(memcg, sc, evictable);
+	nr_to_scan >>= sc->priority;
 
-	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	return nr_to_scan;
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4985,31 +4973,44 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
+	bool need_rotate = false;
 	long nr_batch, nr_to_scan;
-	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
+
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			/* stop scanning as it's low on cold folios */
+			break;
+		}
 
 		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
-			break;
-
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		nr_to_scan -= delta;
 		cond_resched();
 	}
 
@@ -5035,8 +5036,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 05/15] mm/mglru: scan and count the exact number of folios
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-04-27 18:06 ` [PATCH v7 04/15] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 06/15] mm/mglru: avoid reclaim type fall back when isolation makes no progress Kairui Song via B4 Relay
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Make the scan helpers return the exact number of folios being scanned or
isolated.  Since the reclaim loop now has a natural scan budget that
controls the scan progress, returning the scan number and consuming the
budget makes the scan more accurate and easier to follow.

The number of scanned folios for each iteration is always larger than 0,
unless the reclaim must stop for a forced aging, so there is no more need
for any special handling when there is no progress made:

- `return isolated || !remaining ? scanned : 0` in scan_folios: both
  the function and the call now just return the exact scan count,
  combined with the scan budget introduced in the previous commit to
  avoid livelock or under scan.

- `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as
  a scan count was kind of confusing and no longer needed, as scan
  number should never be zero as long as there are still evictable
  gens. We may encounter a empty old gen that returns 0 scan count,
  to avoid that, do a try_to_inc_min_seq before  toisolation which
  have slight to none overhead in most cases.

- `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
  the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
  naturally returns 0 when only two gens remain and breaks the loop.

Also change try_to_inc_min_seq to return void, as its return value is no
longer used by any caller.  Call it before isolate_folios to flush any
empty gens left by external folio freeing, and again after isolate_folios
when scanning moved or protected folios may have emptied the oldest gen.

The scan still stops if only two gens are left, as the scan number will be
zero.  This matches the previous behavior.  This forced gen protection may
be removed or softened later to improve reclaim further.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 58 +++++++++++++++++++++++++++++-----------------------------
 1 file changed, 29 insertions(+), 29 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b247f216f28b..2dbd39e29dfc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3878,10 +3878,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 	return true;
 }
 
-static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
+static void try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	bool success = false;
 	bool seq_inc_flag = false;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
@@ -3907,11 +3906,10 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 	/*
 	 * If min_seq[type] of both anonymous and file is not increased,
-	 * we can directly return false to avoid unnecessary checking
-	 * overhead later.
+	 * return here to avoid unnecessary checking overhead later.
 	 */
 	if (!seq_inc_flag)
-		return success;
+		return;
 
 	/* see the comment on lru_gen_folio */
 	if (swappiness && swappiness <= MAX_SWAPPINESS) {
@@ -3929,10 +3927,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 		reset_ctrl_pos(lruvec, type, true);
 		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
-		success = true;
 	}
-
-	return success;
 }
 
 static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
@@ -4686,7 +4681,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
 static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		       struct scan_control *sc, int type, int tier,
-		       struct list_head *list)
+		       struct list_head *list, int *isolatedp)
 {
 	int i;
 	int gen;
@@ -4756,11 +4751,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
 		sc->nr.file_taken += isolated;
-	/*
-	 * There might not be eligible folios due to reclaim_idx. Check the
-	 * remaining to prevent livelock if it's not making progress.
-	 */
-	return isolated || !remaining ? scanned : 0;
+
+	*isolatedp = isolated;
+	return scanned;
 }
 
 static int get_tier_idx(struct lruvec *lruvec, int type)
@@ -4804,33 +4797,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 
 static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			  struct scan_control *sc, int swappiness,
-			  int *type_scanned, struct list_head *list)
+			  struct list_head *list, int *isolated,
+			  int *isolate_type, int *isolate_scanned)
 {
 	int i;
+	int total_scanned = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
 		int scanned;
 		int tier = get_tier_idx(lruvec, type);
 
-		*type_scanned = type;
+		scanned = scan_folios(nr_to_scan, lruvec, sc,
+				      type, tier, list, isolated);
 
-		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
-		if (scanned)
-			return scanned;
+		total_scanned += scanned;
+		if (*isolated) {
+			*isolate_type = type;
+			*isolate_scanned = scanned;
+			break;
+		}
 
 		type = !type;
 	}
 
-	return 0;
+	return total_scanned;
 }
 
 static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			struct scan_control *sc, int swappiness)
 {
-	int type;
-	int scanned;
-	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
@@ -4838,19 +4834,23 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	enum node_stat_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
+	int scanned, reclaimed;
+	int isolated = 0, type, type_scanned;
 	bool skip_retry = false;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lruvec_lock_irq(lruvec);
 
-	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
+	/* In case folio deletion left empty old gens, flush them */
+	try_to_inc_min_seq(lruvec, swappiness);
 
-	scanned += try_to_inc_min_seq(lruvec, swappiness);
+	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
+				 &list, &isolated, &type, &type_scanned);
 
-	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
-		scanned = 0;
+	/* Scanning may have emptied the oldest gen, flush it */
+	if (scanned)
+		try_to_inc_min_seq(lruvec, swappiness);
 
 	lruvec_unlock_irq(lruvec);
 
@@ -4861,7 +4861,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			scanned, reclaimed, &stat, sc->priority,
+			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 06/15] mm/mglru: avoid reclaim type fall back when isolation makes no progress
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-04-27 18:06 ` [PATCH v7 05/15] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-28  4:18   ` Kairui Song
  2026-04-27 18:06 ` [PATCH v7 07/15] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: "Barry Song (Xiaomi)" <baohua@kernel.org>

While isolation makes no progress in scan_folios(), we quickly fall back
to the other type in isolate_folios(). This is incorrect, as the current
type may still have sufficient folios. Falling back can undermine the
positive_ctrl_err() result from get_type_to_scan(), which is derived
from swappiness.

So just continue scanning this type for another round.

Worth noting if the cold generations are all reclaimed, scan will no
longer make any progress either, which may undermine the swappiness
again. This is not a new issue and hence better be fixed later [1].

Link: https://lore.kernel.org/linux-mm/CAGsJ_4zjdOYEtuO6gNjABm7NDxW0skzBFNRNee-k2D6VwsYEQA@mail.gmail.com/ [1]
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Reviewed-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2dbd39e29dfc..ac9d2d4f8e65 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4817,8 +4817,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			*isolate_scanned = scanned;
 			break;
 		}
-
-		type = !type;
+		/*
+		 * If scanned > 0 and isolated == 0, avoid falling back to the
+		 * other type, as this type remains sufficient. Falling back
+		 * too readily can disrupt the positive_ctrl_err() bias.
+		 */
+		if (!scanned)
+			type = !type;
 	}
 
 	return total_scanned;

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 07/15] mm/mglru: use a smaller batch for reclaim
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-04-27 18:06 ` [PATCH v7 06/15] mm/mglru: avoid reclaim type fall back when isolation makes no progress Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:06 ` [PATCH v7 08/15] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac9d2d4f8e65..2a607546277c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5007,7 +5007,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			break;
 		}
 
-		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 08/15] mm/mglru: don't abort scan immediately right after aging
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-04-27 18:06 ` [PATCH v7 07/15] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-04-27 18:06 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 09/15] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Right now, if eviction triggers aging, the reclaimer will abort.  This is
not the optimal strategy for several reasons.

Aborting the reclaim early wastes a reclaim cycle when under pressure, and
for concurrent reclaim, if the LRU is under aging, all concurrent
reclaimers might fail.  And if the age has just finished, new cold folios
exposed by the aging are not reclaimed until the next reclaim iteration.

What's more, the current aging trigger is quite lenient, having 3 gens
with a reclaim priority lower than default will trigger aging, and blocks
reclaiming from one memcg.  This wastes reclaim retry cycles easily.  And
in the worst case, if the reclaim is making slower progress and all
following attempts fail due to being blocked by aging, it triggers
unexpected early OOM.

And if a lruvec requires aging, it doesn't mean it's hot.  Instead, the
lruvec could be idle for quite a while, and hence it might contain lots of
cold folios to be reclaimed.

While it's helpful to rotate memcg LRU after aging for global reclaim, as
global reclaim fairness is coupled with the rotation in shrink_many, memcg
fairness is instead handled by cgroup iteration in shrink_node_memcgs.
So, for memcg level pressure, this abort is not the key part for keeping
the fairness.  And in most cases, there is no need to age, and fairness
must be achieved by upper-level reclaim control.

So instead, just keep the scanning going unless one whole batch of folios
failed to be isolated or enough folios have been scanned, which is
triggered by evict_folios returning 0.  And only abort for global reclaim
after one batch, so when there are fewer memcgs, progress is still made,
and the fairness mechanism described above still works fine.

And in most cases, the one more batch attempt for global reclaim might
just be enough to satisfy what the reclaimer needs, hence improving global
reclaim performance by reducing reclaim retry cycles.

Rotation is still there after the reclaim is done, which still follows the
comment in mmzone.h.  And fairness still looking good.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2a607546277c..42ccc6eb0748 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4985,7 +4985,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
  */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool need_rotate = false;
+	bool need_rotate = false, should_age = false;
 	long nr_batch, nr_to_scan;
 	int swappiness = get_swappiness(lruvec, sc);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5003,8 +5003,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
 			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
 				need_rotate = true;
-			/* stop scanning as it's low on cold folios */
-			break;
+			should_age = true;
 		}

 		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
@@ -5015,6 +5014,13 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_abort_scan(lruvec, sc))
 			break;

+		/*
+		 * Root reclaim needs rotation when low on cold folio for better
+		 * fairness. Cgroup reclaim gets fairness from the iterator.
+		 */
+		if (root_reclaim(sc) && should_age)
+			break;
+
 		nr_to_scan -= delta;
 		cond_resched();
 	}

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 09/15] mm/mglru: remove redundant swap constrained check upon isolation
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-04-27 18:06 ` [PATCH v7 08/15] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 10/15] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Remove the swap-constrained early reject check upon isolation.  This check
is a micro optimization when swap IO is not allowed, so folios are
rejected early.  But it is redundant and overly broad since
shrink_folio_list() already handles all these cases with proper
granularity.

Notably, this check wrongly rejected lazyfree folios, and it doesn't cover
all rejection cases.  shrink_folio_list() uses may_enter_fs(), which
distinguishes non-SWP_FS_OPS devices from filesystem-backed swap and does
all the checks after folio is locked, so flags like swap cache are stable.

This check also covers dirty file folios, which are not a problem now
since sort_folio() already bumps dirty file folios to the next generation,
but causes trouble for unifying dirty folio writeback handling.

And there should be no performance impact from removing it.  We may have
lost a micro optimization, but unblocked lazyfree reclaim for NOIO
contexts, which is not a common case in the first place.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42ccc6eb0748..ea86297b604c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4650,12 +4650,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 {
 	bool success;

-	/* swap constrained */
-	if (!(sc->gfp_mask & __GFP_IO) &&
-	    (folio_test_dirty(folio) ||
-	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
-		return false;
-
 	/* raced with release_pages() */
 	if (!folio_try_get(folio))
 		return false;

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 10/15] mm/mglru: use the common routine for dirty/writeback reactivation
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (8 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 09/15] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 11/15] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Currently MGLRU will move the dirty writeback folios to the second oldest
gen instead of reactivate them like the classical LRU.  This might help to
reduce the LRU contention as it skipped the isolation.  But as a result we
will see these folios at the LRU tail more frequently leading to
inefficient reclaim.

Besides, the dirty / writeback check after isolation in shrink_folio_list
is more accurate and covers more cases.  So instead, just drop the special
handling for dirty writeback, use the common routine and re-activate it
like the classical LRU.

This should in theory improve the scan efficiency.  These folios will be
rotated back to LRU tail once writeback is done so there is no risk of
hotness inversion.  And now each reclaim loop will have a higher success
rate.  This also prepares for unifying the writeback and throttling
mechanism with classical LRU, we keep these folios far from tail so
detecting the tail batch will have a similar pattern with classical LRU.

The micro optimization that avoids LRU contention by skipping the
isolation is gone, which should be fine.  Compared to IO and writeback
cost, the isolation overhead is trivial.

And using the common routine also keeps the folio's referenced bits (tier
bits), which could improve metrics in the long term.  Also no more need to
clean reclaim bit as the common routine will make use of it.

Note the common routine updates a few throttling and writeback counters,
which are not used, and never have been for the MGLRU case.  We will start
making use of these in later commits.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ea86297b604c..bb7e2cecf48e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4578,7 +4578,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4628,21 +4627,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}

-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }

@@ -4664,9 +4648,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 	if (!folio_test_referenced(folio))
 		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);

-	/* for shrink_folio_list() */
-	folio_clear_reclaim(folio);
-
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);

-- 
2.54.0

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 11/15] mm/mglru: simplify and improve dirty writeback handling
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (9 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 10/15] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 12/15] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Right now the flusher wakeup mechanism for MGLRU is less responsive and
unlikely to trigger compared to classical LRU.  The classical LRU wakes
the flusher if one batch of folios passed to shrink_folio_list is
unevictable due to under writeback.  MGLRU instead check and handle this
after the whole reclaim loop is done.

We previously even saw OOM problems due to passive flusher, which were
fixed but still not perfect [1].

We have just unified the dirty folio counting and activation routine, now
just move the dirty flush into the loop right after shrink_folio_list.
This improves the performance a lot for workloads involving heavy
writeback and prepares for throttling too.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
workingset_refault_file 34522071

After this commit:
Throughput(ops/sec): 80857.08510208207
AverageLatency(us): 386.653262968934
pgpgin 112233121
workingset_refault_file 19516246

The performance is a lot better with significantly lower refault.  We also
observed similar or higher performance gain for other real-world
workloads.

We were concerned that the dirty flush could cause more wear for SSD: that
should not be the problem here, since the wakeup condition is when the
dirty folios have been pushed to the tail of LRU, which indicates that
memory pressure is so high that writeback is blocking the workload
already.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 41 ++++++++++++++++-------------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bb7e2cecf48e..244cdae99573 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4724,8 +4724,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
 
 	*isolatedp = isolated;
 	return scanned;
@@ -4838,12 +4836,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (stat.nr_unqueued_dirty == isolated) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -5000,28 +5013,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	return need_rotate;
 }
 

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 12/15] mm/mglru: remove no longer used reclaim argument for folio protection
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (10 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 11/15] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 13/15] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Now dirty reclaim folios are handled after isolation, not before, since
dirty reactivation must take the folio off LRU first, and that helps to
unify the dirty handling logic.

So this argument is no longer needed.  Just remove it.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 244cdae99573..eb7eb2ed1830 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3220,7 +3220,7 @@ static int folio_update_gen(struct folio *folio, int gen)
 }
 
 /* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio)
 {
 	int type = folio_is_file_lru(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -3239,9 +3239,6 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
-		/* for folio_end_writeback() */
-		if (reclaiming)
-			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags.f, &old_flags, new_flags));
 
 	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
@@ -3855,7 +3852,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(lruvec, folio);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
 			/* don't count the workingset being lazily promoted */
@@ -4607,7 +4604,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* protected */
 	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		/* don't count the workingset being lazily promoted */
@@ -4622,7 +4619,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (zone > sc->reclaim_idx) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 13/15] mm/vmscan: remove sc->file_taken
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (11 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 12/15] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 14/15] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eb7eb2ed1830..a071f7444232 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -173,7 +173,6 @@ struct scan_control {
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
-		unsigned int file_taken;
 		unsigned int taken;
 	} nr;
 
@@ -2040,8 +2039,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 14/15] mm/vmscan: remove sc->unqueued_dirty
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (12 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 13/15] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:07 ` [PATCH v7 15/15] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Suggested-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a071f7444232..902ca52ca381 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,7 +169,6 @@ struct scan_control {
 
 	struct {
 		unsigned int dirty;
-		unsigned int unqueued_dirty;
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
@@ -2035,7 +2034,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	sc->nr.dirty += stat.nr_dirty;
 	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH v7 15/15] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (13 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 14/15] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
@ 2026-04-27 18:07 ` Kairui Song via B4 Relay
  2026-04-27 18:22 ` [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Andrew Morton
  2026-05-11 18:51 ` Shakeel Butt
  16 siblings, 0 replies; 21+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-27 18:07 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu, Kairui Song,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	linux-kernel, Kairui Song, Qi Zheng

From: Kairui Song <kasong@tencent.com>

Currently MGLRU and non-MGLRU handle the reclaim statistic and writeback
handling very differently, especially throttling.  Basically MGLRU just
ignored the throttling part.

Let's just unify this part, use a helper to deduplicate the code so both
setups will share the same behavior.

Test using following reproducer using bash:

  echo "Setup a slow device using dm delay"
  dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
  LOOP=$(losetup --show -f /var/tmp/backing)
  mkfs.ext4 -q $LOOP
  echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
      dmsetup create slow_dev
  mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow

  echo "Start writeback pressure"
  sync && echo 3 > /proc/sys/vm/drop_caches
  mkdir /sys/fs/cgroup/test_wb
  echo 128M > /sys/fs/cgroup/test_wb/memory.max
  (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
      dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)

  echo "Clean up"
  echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
  dmsetup resume slow_dev
  umount -l /mnt/slow && sync
  dmsetup remove slow_dev

Before this commit, `dd` will get OOM killed immediately if MGLRU is
enabled.  Classic LRU is fine.

After this commit, throttling is now effective and no more spin on LRU or
premature OOM.  Stress test on other workloads also looks good.

Global throttling is not here yet, we will fix that separately later.

Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Tested-by: Leno Hou <lenohou@gmail.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 92 +++++++++++++++++++++++++++++--------------------------------
 1 file changed, 43 insertions(+), 49 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 902ca52ca381..e452cb043d46 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
-
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -4829,26 +4835,13 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr_reclaimed += reclaimed;
+	/* Retry pass is only meant for clean folios without new isolation */
+	if (isolated)
+		handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (stat.nr_unqueued_dirty == isolated) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4891,6 +4884,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}
 

-- 
2.54.0




^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (14 preceding siblings ...)
  2026-04-27 18:07 ` [PATCH v7 15/15] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
@ 2026-04-27 18:22 ` Andrew Morton
  2026-05-11 18:51 ` Shakeel Butt
  16 siblings, 0 replies; 21+ messages in thread
From: Andrew Morton @ 2026-04-27 18:22 UTC (permalink / raw)
  To: kasong
  Cc: Kairui Song via B4 Relay, linux-mm, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Baolin Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Kairui Song, Qi Zheng

On Tue, 28 Apr 2026 02:06:51 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.

Thanks, I've updated mm.git's mm-new branch to this version.

> Changes in v7:
> - Fix swappiness not being effective with a standalone fix patch
>   from Barry Song. It's OK to be a standalone fix since that is not a
>   major bug but an unexpected behavior change, and shouldn't effect any
>   bisecting. I slightly adjusted the commit message as the subjcect is too
>   long and getting truncated for mail:
>   https://lore.kernel.org/linux-mm/20260425205759.1701-1-baohua@kernel.org/
> - Remove the min limit for calculating nr_to_scan:
>   https://lore.kernel.org/linux-mm/aet1hd9DfRH4aSOO@KASONG-MC4/
>   Instead just revert to V1:
>   https://sashiko.dev/#/message/20260318-mglru-reclaim-v1-3-2c46f9eb0508%40tencent.com
>   Everyone was fine with that, the min limit in later version was
>   introduced to cover sashiko's review on V1, but now think again, that's
>   actually not a bug and instead could be beneficial. This min
>   check doesn't always make sense and there isn't any practical issue observed.
> - Retest still looking very good in every case.

Here's how v7 altered mm.git.   (Looks small - did I mess this up?)


--- a/mm/vmscan.c~b
+++ a/mm/vmscan.c
@@ -4788,8 +4788,13 @@ static int isolate_folios(unsigned long
 			*isolate_scanned = scanned;
 			break;
 		}
-
-		type = !type;
+		/*
+		 * If scanned > 0 and isolated == 0, avoid falling back to the
+		 * other type, as this type remains sufficient. Falling back
+		 * too readily can disrupt the positive_ctrl_err() bias.
+		 */
+		if (!scanned)
+			type = !type;
 	}
 
 	return total_scanned;
@@ -4909,18 +4914,14 @@ static long get_nr_to_scan(struct lruvec
 	unsigned long nr_to_scan, evictable;
 
 	evictable = lruvec_evictable_size(lruvec, swappiness);
-	nr_to_scan = evictable;
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (!mem_cgroup_online(memcg))
-		return nr_to_scan;
+		return evictable;
 
-	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
+	nr_to_scan = apply_proportional_protection(memcg, sc, evictable);
 	nr_to_scan >>= sc->priority;
 
-	if (!nr_to_scan && sc->priority < DEF_PRIORITY)
-		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
-
 	return nr_to_scan;
 }
 
_



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 06/15] mm/mglru: avoid reclaim type fall back when isolation makes no progress
  2026-04-27 18:06 ` [PATCH v7 06/15] mm/mglru: avoid reclaim type fall back when isolation makes no progress Kairui Song via B4 Relay
@ 2026-04-28  4:18   ` Kairui Song
  0 siblings, 0 replies; 21+ messages in thread
From: Kairui Song @ 2026-04-28  4:18 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng

Forwarding sashiko review, which I think is just false positives, just
to make things clear:

> commit dd4c6bd11860d1bc69ef77e9f0d457abec425fc6
> Author: Barry Song (Xiaomi) <baohua@kernel.org>
> mm/mglru: avoid reclaim type fall back when isolation makes no progress
> This commit modifies isolate_folios() to only fallback to the alternate LRU
> type if no folios were scanned. This prevents falling back when folios are
> scanned but none are isolated.
> Link: https://lore.kernel.org/linux-mm/CAGsJ_4zjdOYEtuO6gNjABm7NDxW0skzBFNRNee-k2D6VwsYEQA@mail.gmail.com/ [1]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 43c0f22f15d5..304d5a4704ec 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4817,8 +4817,13 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  *isolate_scanned = scanned;
> >  break;
> >  }
> > -
> > - type = !type;
> > + /*
> > + * If scanned > 0 and isolated == 0, avoid falling back to the
> > + * other type, as this type remains sufficient. Falling back
> > + * too readily can disrupt the positive_ctrl_err() bias.
> > + */
> > + if (!scanned)
> > + type = !type;
> >  }
> >
> >  return total_scanned;
> Could skipping this fallback cause a livelock when folios are evictable but
> un-isolatable?

No. The total scan budget in try_to_shrink_lruvec() bounds the loop and scanned
folios are moved to newer gen. And we used to skip it here just fine, this
commit just restored that behavior.

> For example, in constrained direct reclaim like GFP_NOIO, isolate_folio()
> can early-reject anonymous folios due to missing __GFP_IO and return false.
> When scan_folios() places these rejected folios into a temporary list and
> splices them back at the head of the oldest generation's list, this code
> path will see scanned > 0 but isolated == 0.
> Because the fallback is skipped, isolate_folios() scans the exact same LRU
> type on the next iteration. It encounters the identical un-isolatable
> folios at the head of the list, skips them again, and returns 0 isolated
> folios.
> When try_to_shrink_lruvec() decrements nr_to_scan and retries, no memory
> has been reclaimed, so the MGLRU statistics are unchanged.
> get_type_to_scan() will then select the exact same blocked LRU type again.
> Will this starve the alternate LRU type and potentially trigger premature
> OOMs?

No. Folios stuck in a non-reclaimable state (GFP_NOIO anon, writeback-waiting
file, etc.) is a pre-existing concern orthogonal to this change, and it also
applies to the fallback type

Falling back to another type is still done if we ran out of gens (scan
consumes cold gen). In following commit the writeback throttling we just
introduced improves the folio stuck in non-reclaimable state issue but
not related to this commit.

> I note this behavior is addressed later in the patch series by commit
> 8992ac501e543b77bf0d5e24175632b6eba8086c, which removes the redundant swap
> constraint check. That allows the folios to be isolated and promoted, emptying
> the oldest generation so scanned can become 0. However, does this patch
> introduce an intermediate bisection regression for the LRU fallback mechanism?

No, as said above, that's not related.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (15 preceding siblings ...)
  2026-04-27 18:22 ` [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Andrew Morton
@ 2026-05-11 18:51 ` Shakeel Butt
  2026-05-12  5:08   ` Kairui Song
  16 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2026-05-11 18:51 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Lorenzo Stoakes,
	Barry Song, David Stevens, Chen Ridong, Leno Hou, Yafang Shao,
	Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel,
	Kairui Song, Qi Zheng


Hi Kairui,

On Tue, Apr 28, 2026 at 02:06:51AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
> 
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
> 
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.
> 
> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
> 
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.

Please include traditional LRU results for all of the following experiments as
well (where it makes sense).

> 
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.

Can you add a sentence here on why this workload is chosen and is important for
evaluation?

> 
> Not using SWAP.

Any specific reason to not have swap in this test?

> 
> Before:
> Throughput(ops/sec): 62485.02962831822
> AverageLatency(us): 500.9746963330107
> pgpgin 159347462
> pgpgout 5413332
> workingset_refault_anon 0
> workingset_refault_file 34522071
> 
> After:
> Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> pgpgin 111093923                       (-30.3%, lower is better)
> pgpgout 5437456
> workingset_refault_anon 0
> workingset_refault_file 19566366       (-43.3%, lower is better)
> 
> We can see a significant performance improvement after this series.
> The test is done on NVME and the performance gap would be even larger
> for slow devices, such as HDD or network storage. We observed over
> 100% gain for some workloads with slow IO.
> 
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
> 
> Before:
> Total requests:            79915
> Per-worker 95% CI (mean):  [1233.9, 1263.5]
> Per-worker stdev:          59.2
> Jain's fairness:           0.997795 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26859   33.61%   33.61%
> [1,2)s      7818    9.78%   43.39%
> [2,4)s      5532    6.92%   50.31%
> [4,8)s     39706   49.69%  100.00%
> 
> After:
> Total requests:            81382
> Per-worker 95% CI (mean):  [1241.9, 1301.3]
> Per-worker stdev:          118.8
> Jain's fairness:           0.991480 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26696   32.80%   32.80%
> [1,2)s      8745   10.75%   43.55%
> [2,4)s      6865    8.44%   51.98%
> [4,8)s     39076   48.02%  100.00%
> 
> Reclaim is still fair and effective, total requests number seems
> slightly better.

Please add a reference to Jain's fairness and a sentence on why we should care
about it.

> 
> OOM issue with aging and throttling
> ===================================
> For the throttling OOM issue, it can be easily reproduced using dd and
> cgroup limit as demonstrated and fixed by a later patch in this series.
> 
> The aging OOM is a bit tricky, a specific reproducer can be used to
> simulate what we encountered in production environment [4]:
> Spawns multiple workers that keep reading the given file using mmap,
> and pauses for 120ms after one file read batch. It also spawns another
> set of workers that keep allocating and freeing a given size of
> anonymous memory. The total memory size exceeds the memory limit
> (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).
> 
> - MGLRU disabled:
>   Finished 128 iterations.
> 
> - MGLRU enabled:
>   OOM with following info after about ~10-20 iterations:
>     [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
>     [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [   62.640823] Memory cgroup stats for /demo:
>     [   62.641017] anon 10604879872
>     [   62.641941] file 6574858240
> 
>   OOM occurs despite there being still evictable file folios.
> 
> - MGLRU enabled after this series:
>   Finished 128 iterations.
> 
> Worth noting there is another OOM related issue reported in V1 of
> this series, which is tested and looking OK now [5].

Oh this is good as it seems like you are already running traditional LRU.

> 
> MySQL:
> ======
> 
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
> 
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> 
> Before:            17303.41 tps
> After this series: 17291.50 tps
> 
> Seems only noise level changes, no regression.
> 

Please add a sentence on why this specific params.

> FIO:
> ====
> Testing with the following command, where /mnt/ramdisk is a
> 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> 6 test run each:
> 
> fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
>   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
>   --rw=randread --norandommap --time_based \
>   --ramp_time=1m --runtime=5m --group_reporting
> 
> Before:            8968.76 MB/s
> After this series: 8995.63 MB/s
> 
> Also seem only noise level changes and no regression or slightly better.

Same here.

> 
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 12 test run each.
> 
> Before:            2873.52s
> After this series: 2811.88s
> 
> Also seem only noise level changes, no regression or very slightly better.

So, the kernel source code is on tmpfs, right? Also 3G memcg means memory.max is
3G, correct?

> 
> Android:
> ========
> Xinyu reported a performance gain on Android, too, with this series. The
> test consisted of cold-starting multiple applications sequentially under
> moderate system load. [6]
> 
> Before:
> Launch Time Summary (all apps, all runs)
>   Mean 868.0ms
>   P50 888.0ms
>   P90 1274.2ms
>   P95 1399.0ms
> 
> After:
> Launch Time Summary (all apps, all runs)
>   Mean 850.5ms (-2.07%)
>   P50 861.5ms  (-3.04%)
>   P90 1179.0ms (-8.05%)
>   P95 1228.0ms (-12.2%)

It would be awesome if Xinyu can gather traditional LRU numbers but if not then
it is fine.



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
  2026-05-11 18:51 ` Shakeel Butt
@ 2026-05-12  5:08   ` Kairui Song
  2026-05-12  5:56     ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Kairui Song @ 2026-05-12  5:08 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Lorenzo Stoakes,
	Barry Song, David Stevens, Chen Ridong, Leno Hou, Yafang Shao,
	Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng

On Tue, May 12, 2026 at 2:51 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
>
> Hi Kairui,

Hello,

>
> On Tue, Apr 28, 2026 at 02:06:51AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> > and a 128G memory machine using NVME as storage.
>
> Please include traditional LRU results for all of the following experiments as
> well (where it makes sense).

Sure, I've spawn a few test instances, was busy travelling last week.
That specific test machine is occupied so it might take a while.

A systematic test run takes roughly one or two days to complete for
one kernel version or config, e.g. the JS test takes at least 2 hours
to finish. Comparing versions/setups takes more time.

>
> >
> > MongoDB
> > =======
> > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> > threads:32), which does 95% read and 5% update to generate mixed read
> > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> > the WiredTiger cache size is set to 4.5G, using NVME as storage.
>
> Can you add a sentence here on why this workload is chosen and is important for
> evaluation?

Because that's exactly the one we observed with regression since it
involves mixed writeback, and it's a pratical case.

>
> >
> > Not using SWAP.
>
> Any specific reason to not have swap in this test?

Because we are testing the writeback here, not related to SWAP, so
just to avoid noise and irrelevant parts.

A longer history involving SWAP is explained here:
https://lore.kernel.org/linux-mm/20230920190244.16839-1-ryncsn@gmail.com/

And a longer discussion on that:
https://lore.kernel.org/linux-mm/CAMgjq7BRaRgYLf2+8=+=nWtzkrHFKmudZPRm41PR6W+A+L=AKA@mail.gmail.com/

Both are not easy to reproduce, though. YCSB with MongoDB seems close
enough and I believe we are heading in the right track.

In an internal workload, we observed that patched MGLRU is about 20%
faster than classical LRU with MongoDB. Upstream MGLRU is still
slightly behind classical LRU at this point, and will hopefully be
patched soon, which is the RFC I posted:
https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/

>
> >
> > Before:
> > Throughput(ops/sec): 62485.02962831822
> > AverageLatency(us): 500.9746963330107
> > pgpgin 159347462
> > pgpgout 5413332
> > workingset_refault_anon 0
> > workingset_refault_file 34522071
> >
> > After:
> > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> > pgpgin 111093923                       (-30.3%, lower is better)
> > pgpgout 5437456
> > workingset_refault_anon 0
> > workingset_refault_file 19566366       (-43.3%, lower is better)
> >
> > We can see a significant performance improvement after this series.
> > The test is done on NVME and the performance gap would be even larger
> > for slow devices, such as HDD or network storage. We observed over
> > 100% gain for some workloads with slow IO.
> >
> > Chrome & Node.js [3]
> > ====================
> > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> > workers:
> >
> > Before:
> > Total requests:            79915
> > Per-worker 95% CI (mean):  [1233.9, 1263.5]
> > Per-worker stdev:          59.2
> > Jain's fairness:           0.997795 (1.0 = perfectly fair)
> > Latency:
> > Bucket     Count      Pct    Cumul
> > [0,1)s     26859   33.61%   33.61%
> > [1,2)s      7818    9.78%   43.39%
> > [2,4)s      5532    6.92%   50.31%
> > [4,8)s     39706   49.69%  100.00%
> >
> > After:
> > Total requests:            81382
> > Per-worker 95% CI (mean):  [1241.9, 1301.3]
> > Per-worker stdev:          118.8
> > Jain's fairness:           0.991480 (1.0 = perfectly fair)
> > Latency:
> > Bucket     Count      Pct    Cumul
> > [0,1)s     26696   32.80%   32.80%
> > [1,2)s      8745   10.75%   43.55%
> > [2,4)s      6865    8.44%   51.98%
> > [4,8)s     39076   48.02%  100.00%
> >
> > Reclaim is still fair and effective, total requests number seems
> > slightly better.
>
> Please add a reference to Jain's fairness and a sentence on why we should care
> about it.

So first, Here is the previous test setup for that:
https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/

The basical idea is simple: if all memcgs are under similar pressure,
they should be reclaimed equally, which seems fair.

The fairness index measures the equality of resource allocation among
users. It is commonly used to evaluate network bandwidth distribution
for multiple users under pressure, which seems suitable here. We are
also measuring the reclaim ratio for multiple users under pressure.
I'm using a numeric index here so I don't need to post 500 lines of
raw test results every time:
https://www.sciencedirect.com/topics/computer-science/fairness-index

Also here is the longer version of test result I just collected in
past few days. The test closely mirrors real-world usage from desktop
to web services.

Classical LRU:
------------------------------------------------------------------------
THROUGHPUT
------------------------------------------------------------------------
  Total requests:           60226
  Per-worker mean:          941.0
  Per-worker median:          931
  Per-worker min:             678
  Per-worker max:            1252
  Per-worker stdev:         131.3
  95% CI (mean):       [   908.2,    973.9]
------------------------------------------------------------------------
LATENCY DISTRIBUTION (all workers aggregated)
------------------------------------------------------------------------
      Bucket     Count      Pct    Cumul
      [0,1)s     19493   32.37%   32.37%
      [1,2)s      2024    3.36%   35.73%
      [2,4)s      5621    9.33%   45.06%
      [4,8)s     32881   54.60%   99.66%
     [8,16)s       207    0.34%  100.00%
    [16,32)s         0    0.00%  100.00%
    [32,64)s         0    0.00%  100.00%
   [64,128)s         0    0.00%  100.00%
  [128,inf)s         0    0.00%  100.00%
------------------------------------------------------------------------
FAIRNESS (per-worker total requests)
------------------------------------------------------------------------
  Jain's fairness index: 0.981188  (1.0 = perfectly fair)
  Coeff of variation:    0.1396  (0.0 = perfectly fair)
  Min/Max ratio:         0.5415
  P10:                      780
  P25:                      855
  P50 (median):             931
  P75:                     1040
  P90:                     1112
------------------------------------------------------------------------
PER-MEMCG BREAKDOWN (sorted by total, top/bottom 5)
------------------------------------------------------------------------
  Memcgs: 32  mean=1882.1  95%CI=[1799.8, 1964.4]  Jain=0.9860
         --- Top 5 ---
    memcg  6:   2423 requests
    memcg 10:   2364 requests
    memcg 31:   2213 requests
    memcg 20:   2207 requests
    memcg 30:   2156 requests
      --- Bottom 5 ---
    memcg 27:   1658 requests
    memcg 19:   1645 requests
    memcg 12:   1610 requests
    memcg  0:   1566 requests
    memcg 28:   1533 requests
Raw Result:
client: 8047 total:    984, 0:    293, 1:     44, 2:    108, 4:
538, 8:      1, 16:      0, 32:      0, 64:      0, 128:      0
client: 8058 total:    882, 0:    289, 1:     18, 2:     63, 4:
507, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8017 total:   1051, 0:    347, 1:     43, 2:    133, 4:
528, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8059 total:    952, 0:    274, 1:     41, 2:     92, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8005 total:    921, 0:    230, 1:     43, 2:    113, 4:
535, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8063 total:   1173, 0:    459, 1:     50, 2:    161, 4:
503, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8051 total:    986, 0:    296, 1:     34, 2:    122, 4:
534, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8043 total:    949, 0:    260, 1:     53, 2:     90, 4:
546, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8045 total:   1069, 0:    362, 1:     46, 2:    143, 4:
518, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8008 total:    857, 0:    259, 1:     25, 2:     69, 4:
500, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8023 total:   1049, 0:    348, 1:     44, 2:    136, 4:
521, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8015 total:    895, 0:    221, 1:     34, 2:    105, 4:
534, 8:      1, 16:      0, 32:      0, 64:      0, 128:      0
client: 8027 total:    899, 0:    219, 1:     42, 2:     96, 4:
542, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8061 total:   1093, 0:    396, 1:     31, 2:    157, 4:
509, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8038 total:    737, 0:    174, 1:      7, 2:     46, 4:
501, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8056 total:    678, 0:    133, 1:      5, 2:     32, 4:
501, 8:      7, 16:      0, 32:      0, 64:      0, 128:      0
client: 8040 total:   1039, 0:    423, 1:     37, 2:     93, 4:
477, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8036 total:    766, 0:    202, 1:      7, 2:     54, 4:
494, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8000 total:    697, 0:    136, 1:     13, 2:     48, 4:
495, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8030 total:    804, 0:    232, 1:     14, 2:     53, 4:
501, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8006 total:    852, 0:    267, 1:     18, 2:     62, 4:
495, 8:     10, 16:      0, 32:      0, 64:      0, 128:      0
client: 8062 total:   1040, 0:    437, 1:     43, 2:     61, 4:
489, 8:     10, 16:      0, 32:      0, 64:      0, 128:      0
client: 8014 total:    833, 0:    254, 1:     14, 2:     58, 4:
497, 8:     10, 16:      0, 32:      0, 64:      0, 128:      0
client: 8060 total:   1063, 0:    465, 1:     23, 2:     81, 4:
485, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8046 total:    814, 0:    244, 1:     18, 2:     40, 4:
508, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8049 total:   1080, 0:    388, 1:     40, 2:    123, 4:
529, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8022 total:   1001, 0:    422, 1:     22, 2:     62, 4:
486, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8019 total:    988, 0:    304, 1:     36, 2:    116, 4:
532, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8026 total:    780, 0:    218, 1:     12, 2:     47, 4:
500, 8:      3, 16:      0, 32:      0, 64:      0, 128:      0
client: 8024 total:    719, 0:    163, 1:      7, 2:     43, 4:
501, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8053 total:   1045, 0:    360, 1:     38, 2:    120, 4:
527, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8034 total:    873, 0:    286, 1:     19, 2:     57, 4:
508, 8:      3, 16:      0, 32:      0, 64:      0, 128:      0
client: 8048 total:    889, 0:    301, 1:     26, 2:     59, 4:
497, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8055 total:    871, 0:    199, 1:     40, 2:     89, 4:
543, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8001 total:    869, 0:    196, 1:     35, 2:     95, 4:
543, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8003 total:   1051, 0:    369, 1:     42, 2:    103, 4:
537, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8011 total:   1118, 0:    398, 1:     53, 2:    156, 4:
511, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8018 total:    762, 0:    192, 1:     15, 2:     45, 4:
503, 8:      7, 16:      0, 32:      0, 64:      0, 128:      0
client: 8021 total:   1112, 0:    410, 1:     41, 2:    145, 4:
516, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8050 total:    869, 0:    276, 1:     21, 2:     71, 4:
496, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8032 total:    823, 0:    238, 1:     21, 2:     54, 4:
505, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8044 total:   1030, 0:    433, 1:     31, 2:     66, 4:
496, 8:      4, 16:      0, 32:      0, 64:      0, 128:      0
client: 8035 total:    965, 0:    283, 1:     42, 2:    112, 4:
528, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8025 total:    891, 0:    212, 1:     43, 2:     90, 4:
546, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8039 total:    908, 0:    241, 1:     36, 2:     86, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8009 total:    963, 0:    286, 1:     36, 2:    108, 4:
533, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8037 total:    917, 0:    227, 1:     45, 2:    100, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8020 total:   1252, 0:    607, 1:     51, 2:    115, 4:
477, 8:      2, 16:      0, 32:      0, 64:      0, 128:      0
client: 8004 total:    818, 0:    245, 1:     16, 2:     47, 4:
501, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8052 total:    870, 0:    285, 1:     20, 2:     52, 4:
507, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8033 total:    925, 0:    269, 1:     28, 2:     83, 4:
545, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8010 total:    931, 0:    334, 1:     29, 2:     62, 4:
500, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8016 total:    990, 0:    388, 1:     27, 2:     70, 4:
500, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8012 total:   1173, 0:    556, 1:     51, 2:     78, 4:
480, 8:      8, 16:      0, 32:      0, 64:      0, 128:      0
client: 8028 total:    837, 0:    253, 1:     32, 2:     47, 4:
500, 8:      5, 16:      0, 32:      0, 64:      0, 128:      0
client: 8031 total:    992, 0:    315, 1:     33, 2:    119, 4:
525, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8041 total:   1168, 0:    452, 1:     52, 2:    162, 4:
502, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8054 total:    787, 0:    212, 1:     15, 2:     58, 4:
493, 8:      9, 16:      0, 32:      0, 64:      0, 128:      0
client: 8042 total:    799, 0:    227, 1:     13, 2:     45, 4:
508, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8002 total:   1034, 0:    449, 1:     32, 2:     59, 4:
488, 8:      6, 16:      0, 32:      0, 64:      0, 128:      0
client: 8057 total:    855, 0:    184, 1:     47, 2:     81, 4:
543, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8007 total:    965, 0:    269, 1:     36, 2:    135, 4:
525, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8013 total:   1250, 0:    536, 1:     53, 2:    143, 4:
518, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8029 total:    973, 0:    290, 1:     41, 2:    102, 4:
539, 8:      1, 16:      0, 32:      0, 64:      0, 128:      0

MGLRU (after this series, results are similar before this with seems
slightly lower throughput or maybe just noise, see cover letter):
------------------------------------------------------------------------
THROUGHPUT
------------------------------------------------------------------------
  Total requests:           83926
  Per-worker mean:         1311.3
  Per-worker median:         1306
  Per-worker min:            1170
  Per-worker max:            1466
  Per-worker stdev:          70.8
  95% CI (mean):       [  1293.6,   1329.0]
------------------------------------------------------------------------
LATENCY DISTRIBUTION (all workers aggregated)
------------------------------------------------------------------------
      Bucket     Count      Pct    Cumul
      [0,1)s     27929   33.28%   33.28%
      [1,2)s      9075   10.81%   44.09%
      [2,4)s      8558   10.20%   54.29%
      [4,8)s     38364   45.71%  100.00%
     [8,16)s         0    0.00%  100.00%
    [16,32)s         0    0.00%  100.00%
    [32,64)s         0    0.00%  100.00%
   [64,128)s         0    0.00%  100.00%
  [128,inf)s         0    0.00%  100.00%
------------------------------------------------------------------------
FAIRNESS (per-worker total requests)
------------------------------------------------------------------------
  Jain's fairness index: 0.997138  (1.0 = perfectly fair)
  Coeff of variation:    0.0540  (0.0 = perfectly fair)
  Min/Max ratio:         0.7981
  P10:                     1220
  P25:                     1253
  P50 (median):            1306
  P75:                     1367
  P90:                     1398
------------------------------------------------------------------------
PER-MEMCG BREAKDOWN (sorted by total, top/bottom 5)
------------------------------------------------------------------------
  Memcgs: 32  mean=2622.7  95%CI=[2601.4, 2643.9]  Jain=0.9995
         --- Top 5 ---
    memcg 24:   2719 requests
    memcg  5:   2711 requests
    memcg 16:   2703 requests
    memcg  0:   2696 requests
    memcg 19:   2689 requests
      --- Bottom 5 ---
    memcg 20:   2550 requests
    memcg 21:   2534 requests
    memcg 23:   2521 requests
    memcg 22:   2514 requests
    memcg 27:   2514 requests
Raw result:
client: 8028 total:   1252, 0:    410, 1:    132, 2:    106, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8026 total:   1220, 0:    390, 1:    107, 2:    106, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8036 total:   1260, 0:    403, 1:    154, 2:     92, 4:
611, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8038 total:   1322, 0:    475, 1:    150, 2:     90, 4:
607, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8002 total:   1220, 0:    384, 1:    137, 2:     82, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8008 total:   1264, 0:    410, 1:    138, 2:    108, 4:
608, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8044 total:   1180, 0:    339, 1:    123, 2:     94, 4:
624, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8062 total:   1267, 0:    428, 1:    125, 2:    111, 4:
603, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8050 total:   1197, 0:    351, 1:    131, 2:    113, 4:
602, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8057 total:   1379, 0:    480, 1:    142, 2:    158, 4:
599, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8048 total:   1301, 0:    454, 1:    142, 2:    101, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8034 total:   1266, 0:    422, 1:    140, 2:     98, 4:
606, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8020 total:   1282, 0:    425, 1:    153, 2:     98, 4:
606, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8000 total:   1245, 0:    404, 1:    137, 2:     88, 4:
616, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8030 total:   1282, 0:    411, 1:    164, 2:    104, 4:
603, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8045 total:   1334, 0:    424, 1:    147, 2:    168, 4:
595, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8053 total:   1359, 0:    462, 1:    139, 2:    161, 4:
597, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8060 total:   1240, 0:    375, 1:    158, 2:    110, 4:
597, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8043 total:   1338, 0:    437, 1:    138, 2:    171, 4:
592, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8041 total:   1323, 0:    438, 1:    124, 2:    155, 4:
606, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8025 total:   1331, 0:    435, 1:    130, 2:    180, 4:
586, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8040 total:   1227, 0:    389, 1:    133, 2:     92, 4:
613, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8022 total:   1240, 0:    393, 1:    139, 2:    100, 4:
608, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8049 total:   1418, 0:    510, 1:    145, 2:    172, 4:
591, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8012 total:   1205, 0:    373, 1:    120, 2:     93, 4:
619, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8059 total:   1375, 0:    462, 1:    171, 2:    152, 4:
590, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8037 total:   1412, 0:    513, 1:    144, 2:    171, 4:
584, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8001 total:   1451, 0:    536, 1:    160, 2:    191, 4:
564, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8009 total:   1356, 0:    451, 1:    133, 2:    182, 4:
590, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8039 total:   1367, 0:    456, 1:    144, 2:    186, 4:
581, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8042 total:   1196, 0:    345, 1:    134, 2:     97, 4:
620, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8013 total:   1409, 0:    519, 1:    134, 2:    172, 4:
584, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8021 total:   1392, 0:    478, 1:    156, 2:    169, 4:
589, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8031 total:   1373, 0:    477, 1:    135, 2:    174, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8014 total:   1271, 0:    419, 1:    152, 2:     96, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8015 total:   1305, 0:    398, 1:    139, 2:    179, 4:
589, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8024 total:   1251, 0:    390, 1:    167, 2:     78, 4:
616, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8033 total:   1335, 0:    408, 1:    169, 2:    172, 4:
586, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8004 total:   1245, 0:    398, 1:    129, 2:    107, 4:
611, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8003 total:   1394, 0:    494, 1:    144, 2:    154, 4:
602, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8052 total:   1296, 0:    444, 1:    154, 2:    106, 4:
592, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8061 total:   1353, 0:    455, 1:    142, 2:    147, 4:
609, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8017 total:   1355, 0:    451, 1:    153, 2:    166, 4:
585, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8063 total:   1367, 0:    474, 1:    136, 2:    152, 4:
605, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8018 total:   1225, 0:    379, 1:    132, 2:     97, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8029 total:   1345, 0:    460, 1:    129, 2:    152, 4:
604, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8027 total:   1398, 0:    518, 1:    121, 2:    158, 4:
601, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8007 total:   1253, 0:    375, 1:    118, 2:    124, 4:
636, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8047 total:   1302, 0:    414, 1:    126, 2:    170, 4:
592, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8005 total:   1397, 0:    488, 1:    151, 2:    161, 4:
597, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8019 total:   1347, 0:    437, 1:    145, 2:    178, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8035 total:   1361, 0:    453, 1:    151, 2:    179, 4:
578, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8011 total:   1416, 0:    517, 1:    147, 2:    172, 4:
580, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8023 total:   1385, 0:    473, 1:    161, 2:    172, 4:
579, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8046 total:   1219, 0:    388, 1:    123, 2:     91, 4:
617, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8056 total:   1304, 0:    463, 1:    135, 2:     95, 4:
611, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8006 total:   1306, 0:    442, 1:    147, 2:    130, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8054 total:   1170, 0:    321, 1:    136, 2:    101, 4:
612, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8055 total:   1344, 0:    447, 1:    141, 2:    169, 4:
587, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8010 total:   1295, 0:    429, 1:    164, 2:    100, 4:
602, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8016 total:   1292, 0:    448, 1:    140, 2:    108, 4:
596, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8051 total:   1466, 0:    555, 1:    152, 2:    200, 4:
559, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8058 total:   1278, 0:    430, 1:    144, 2:     86, 4:
618, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0
client: 8032 total:   1368, 0:    502, 1:    168, 2:    113, 4:
585, 8:      0, 16:      0, 32:      0, 64:      0, 128:      0

The test is rebased on 7.1 rc, MGLRU seems ~30% faster compared to
classical LRU, better latency distribution, and better fairness too.
On my x86 machine the gain is not as much as the one Yu posted
for ARM, but it still looks very good.

Ridong also reproduced with a much better result where MGLRU seems to
be much faster than classical LRU on ARM (or maybe using different
time priod?):
https://lore.kernel.org/linux-mm/20260120134256.2271710-1-chenridong@huaweicloud.com/

During one or two test runs, a single memcg might archive much higher
throughput with MGLRU, causing fairness to look slightly worse,
however, overall performance still seems much better than classical
LRU. I suspect improvements are needed for aging or the random bucket
part, but I think that's an irrelevant topic for now.

> >
> > MySQL:
> > ======
> >
> > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> > ZRAM as swap and test command:
> >
> > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> >   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> >
> > Before:            17303.41 tps
> > After this series: 17291.50 tps
> >
> > Seems only noise level changes, no regression.
> >
>
> Please add a sentence on why this specific params.
>
> > FIO:
> > ====
> > Testing with the following command, where /mnt/ramdisk is a
> > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> > 6 test run each:
> >
> > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
> >   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
> >   --rw=randread --norandommap --time_based \
> >   --ramp_time=1m --runtime=5m --group_reporting
> >
> > Before:            8968.76 MB/s
> > After this series: 8995.63 MB/s
> >
> > Also seem only noise level changes and no regression or slightly better.
>
> Same here.

I tested the page cache performance with buffered read. There is
another test involving classical LRU, where MGLRU seems to
significantly outperform classical LRU. The case was provided by the
CachyOS community, I didn't include it here because the cover letter
is already getting tediously long.

https://lore.kernel.org/all/acgNCzRDVmSbXrOE@KASONG-MC4/

MGLRU seems to have significantly lower jitter and better performance with that.

BTW I also disabled OOMD or any related daemon to avoid noise during
that test. I repeated the test several times, and recorded one test
run as well since it's meant for a desktop test and I was discussing
with distro communities at that time. MGLRU TTL can completely avoid
jitter, however, it's not enabled during the test to prevent
confusion.

Classical LRU:
https://www.youtube.com/watch?v=pujboGNcBNI

MGLRU:
https://www.youtube.com/watch?v=ffnFUeaBQ_0

>
> >
> > Build kernel:
> > =============
> > Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> > using make -j96 and defconfig, measuring system time, 12 test run each.
> >
> > Before:            2873.52s
> > After this series: 2811.88s
> >
> > Also seem only noise level changes, no regression or very slightly better.
>
> So, the kernel source code is on tmpfs, right? Also 3G memcg means memory.max is
> 3G, correct?

Right. That's to avoid I/O noise. I also tested with source code on
disk, I didn't post that because I think the MySQL test already shows
a workload of mixed anon / file.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling
  2026-05-12  5:08   ` Kairui Song
@ 2026-05-12  5:56     ` Shakeel Butt
  0 siblings, 0 replies; 21+ messages in thread
From: Shakeel Butt @ 2026-05-12  5:56 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Lorenzo Stoakes,
	Barry Song, David Stevens, Chen Ridong, Leno Hou, Yafang Shao,
	Yu Zhao, Zicheng Wang, Baolin Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng

On Tue, May 12, 2026 at 01:08:49PM +0800, Kairui Song wrote:
> On Tue, May 12, 2026 at 2:51 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> >
> > Hi Kairui,
> 
> Hello,
> 
> >
> > On Tue, Apr 28, 2026 at 02:06:51AM +0800, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> > > and a 128G memory machine using NVME as storage.
> >
> > Please include traditional LRU results for all of the following experiments as
> > well (where it makes sense).
> 
> Sure, I've spawn a few test instances, was busy travelling last week.
> That specific test machine is occupied so it might take a while.
> 
> A systematic test run takes roughly one or two days to complete for
> one kernel version or config, e.g. the JS test takes at least 2 hours
> to finish. Comparing versions/setups takes more time.
> 

No worries, we have couple of weeks before the next merge window, so no urgency.
I will go through the series in depth, hopefully there will not be a need for
next version and in that case, please just resend the cover letter with the
information you provided below and don't worry about the length of the cover
letter.

> >
> > >
> > > MongoDB
> > > =======
> > > Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> > > threads:32), which does 95% read and 5% update to generate mixed read
> > > and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> > > the WiredTiger cache size is set to 4.5G, using NVME as storage.
> >
> > Can you add a sentence here on why this workload is chosen and is important for
> > evaluation?
> 
> Because that's exactly the one we observed with regression since it
> involves mixed writeback, and it's a pratical case.
> 

Sure, add this sentence in the cover letter.

> >
> > >
> > > Not using SWAP.
> >
> > Any specific reason to not have swap in this test?
> 
> Because we are testing the writeback here, not related to SWAP, so
> just to avoid noise and irrelevant parts.
> 
> A longer history involving SWAP is explained here:
> https://lore.kernel.org/linux-mm/20230920190244.16839-1-ryncsn@gmail.com/
> 
> And a longer discussion on that:
> https://lore.kernel.org/linux-mm/CAMgjq7BRaRgYLf2+8=+=nWtzkrHFKmudZPRm41PR6W+A+L=AKA@mail.gmail.com/
> 
> Both are not easy to reproduce, though. YCSB with MongoDB seems close
> enough and I believe we are heading in the right track.
> 
> In an internal workload, we observed that patched MGLRU is about 20%
> faster than classical LRU with MongoDB. Upstream MGLRU is still
> slightly behind classical LRU at this point, and will hopefully be
> patched soon, which is the RFC I posted:
> https://lore.kernel.org/linux-mm/20260502-mglru-fg-v1-0-913619b014d9@tencent.com/
> 

Same here but don't need to go in such details.

> >
> > >
> > > Before:
> > > Throughput(ops/sec): 62485.02962831822
> > > AverageLatency(us): 500.9746963330107
> > > pgpgin 159347462
> > > pgpgout 5413332
> > > workingset_refault_anon 0
> > > workingset_refault_file 34522071
> > >
> > > After:
> > > Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> > > AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> > > pgpgin 111093923                       (-30.3%, lower is better)
> > > pgpgout 5437456
> > > workingset_refault_anon 0
> > > workingset_refault_file 19566366       (-43.3%, lower is better)
> > >
> > > We can see a significant performance improvement after this series.
> > > The test is done on NVME and the performance gap would be even larger
> > > for slow devices, such as HDD or network storage. We observed over
> > > 100% gain for some workloads with slow IO.
> > >
> > > Chrome & Node.js [3]
> > > ====================
> > > Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> > > nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> > > workers:
> > >
> > > Before:
> > > Total requests:            79915
> > > Per-worker 95% CI (mean):  [1233.9, 1263.5]
> > > Per-worker stdev:          59.2
> > > Jain's fairness:           0.997795 (1.0 = perfectly fair)
> > > Latency:
> > > Bucket     Count      Pct    Cumul
> > > [0,1)s     26859   33.61%   33.61%
> > > [1,2)s      7818    9.78%   43.39%
> > > [2,4)s      5532    6.92%   50.31%
> > > [4,8)s     39706   49.69%  100.00%
> > >
> > > After:
> > > Total requests:            81382
> > > Per-worker 95% CI (mean):  [1241.9, 1301.3]
> > > Per-worker stdev:          118.8
> > > Jain's fairness:           0.991480 (1.0 = perfectly fair)
> > > Latency:
> > > Bucket     Count      Pct    Cumul
> > > [0,1)s     26696   32.80%   32.80%
> > > [1,2)s      8745   10.75%   43.55%
> > > [2,4)s      6865    8.44%   51.98%
> > > [4,8)s     39076   48.02%  100.00%
> > >
> > > Reclaim is still fair and effective, total requests number seems
> > > slightly better.
> >
> > Please add a reference to Jain's fairness and a sentence on why we should care
> > about it.
> 
> So first, Here is the previous test setup for that:
> https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/
> 
> The basical idea is simple: if all memcgs are under similar pressure,
> they should be reclaimed equally, which seems fair.

I think this is too much information. Just summarize this in couple of sentences
in the cover letter. You can refer to your email in the cover letter for more
details.

[...]

> > >
> > > MySQL:
> > > ======
> > >
> > > Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> > > ZRAM as swap and test command:
> > >
> > > sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> > >   --tables=48 --table-size=2000000 --threads=48 --time=600 run
> > >
> > > Before:            17303.41 tps
> > > After this series: 17291.50 tps
> > >
> > > Seems only noise level changes, no regression.
> > >
> >
> > Please add a sentence on why this specific params.
> >
> > > FIO:
> > > ====
> > > Testing with the following command, where /mnt/ramdisk is a
> > > 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> > > 6 test run each:
> > >
> > > fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
> > >   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
> > >   --rw=randread --norandommap --time_based \
> > >   --ramp_time=1m --runtime=5m --group_reporting
> > >
> > > Before:            8968.76 MB/s
> > > After this series: 8995.63 MB/s
> > >
> > > Also seem only noise level changes and no regression or slightly better.
> >
> > Same here.
> 
> I tested the page cache performance with buffered read. There is
> another test involving classical LRU, where MGLRU seems to
> significantly outperform classical LRU. The case was provided by the
> CachyOS community, I didn't include it here because the cover letter
> is already getting tediously long.
> 
> https://lore.kernel.org/all/acgNCzRDVmSbXrOE@KASONG-MC4/
> 
> MGLRU seems to have significantly lower jitter and better performance with that.
> 
> BTW I also disabled OOMD or any related daemon to avoid noise during
> that test. I repeated the test several times, and recorded one test
> run as well since it's meant for a desktop test and I was discussing
> with distro communities at that time. MGLRU TTL can completely avoid
> jitter, however, it's not enabled during the test to prevent
> confusion.
> 
> Classical LRU:
> https://www.youtube.com/watch?v=pujboGNcBNI
> 
> MGLRU:
> https://www.youtube.com/watch?v=ffnFUeaBQ_0

The point is not which is better but documenting the performance difference
between them for the given workload.

At the high level, I am just asking for a given benchmark/workload, let's add a
sentence why we think this specific workload is important to measure and
evaluate reclaim mechanism.


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2026-05-12  5:56 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-27 18:06 [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 01/15] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 02/15] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 03/15] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 04/15] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 05/15] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 06/15] mm/mglru: avoid reclaim type fall back when isolation makes no progress Kairui Song via B4 Relay
2026-04-28  4:18   ` Kairui Song
2026-04-27 18:06 ` [PATCH v7 07/15] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-04-27 18:06 ` [PATCH v7 08/15] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 09/15] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 10/15] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 11/15] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 12/15] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 13/15] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 14/15] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
2026-04-27 18:07 ` [PATCH v7 15/15] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
2026-04-27 18:22 ` [PATCH v7 00/15] mm/mglru: improve reclaim loop and dirty folio handling Andrew Morton
2026-05-11 18:51 ` Shakeel Butt
2026-05-12  5:08   ` Kairui Song
2026-05-12  5:56     ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox