[PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling
@ 2026-03-28 19:52 Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 01/12] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
                   ` (12 more replies)
  0 siblings, 13 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

This series is based on mm-new to avoid conflict with Baolin's Cgroup V1
MGLRU fix.

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Median of 3 test run, results are stable.

Before:
Throughput(ops/sec): 63050.37725142389
AverageLatency(us): 497.0088950307069
pgpgin 164636727
pgpgout 5551216
workingset_refault_anon 0
workingset_refault_file 34365441

After:
Throughput(ops/sec): 79937.11613530689 (+26.7%, higher is better)
AverageLatency(us): 390.1616943501661  (-21.5%, lower is better)
pgpgin 108820685                       (-33.9%, lower is better)
pgpgout 5406292
workingset_refault_anon 0
workingset_refault_file 18934364       (-44.9%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            81832
Per-worker 95% CI (mean):  [1248.8, 1308.4]
Per-worker stdev:          119.1
Jain's fairness:           0.991530 (1.0 = perfectly fair)
Latency:
[0,1)s     27951   34.16%   34.16%
[1,2)s      7495    9.16%   43.32%
[2,4)s      8140    9.95%   53.26%
[4,8)s     38246   46.74%  100.00%

After:
Total requests:            82413
Per-worker 95% CI (mean):  [1241.4, 1334.0]
Per-worker stdev:          185.3
Jain's fairness:           0.980016 (1.0 = perfectly fair)
Latency:
[0,1)s     27940   33.90%   33.90%
[1,2)s      8772   10.64%   44.55%
[2,4)s      6827    8.28%   52.83%
[4,8)s     38874   47.17%  100.00%

Seems identical, reclaim is still fair and effective, total requests
number seems slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated in patch 12, and fixed by this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]: Spawns
multiple workers that keep reading the given file using mmap, and pauses
for 120ms after one file read batch. It also spawns another set of
workers that keep allocating and freeing a given size of anonymous memory.
The total memory size exceeds the memory limit (eg. 44G anon + 8G file,
which is 52G vs 48G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [  154.365634] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [  154.366456] memory: usage 50331648kB, limit 50331648kB, failcnt 354207
    [  154.378941] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [  154.379408] Memory cgroup stats for /demo:
    [  154.379544] anon 44352327680
    [  154.380079] file 7187271680

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17237.570000 tps
After patch 5:     17259.975714 tps
After patch 6:     17230.475714 tps
After patch 7:     17250.316667 tps
After patch 8:     17278.933333 tps
After this series: 17265.361667 tps (+0.2%, higher is better)

MySQL is anon folios heavy, involves writeback and file and still
looking good. Seems only noise level changes, no regression in
any step.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, 6 test runs each in a
12G memcg:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
       --name=cached --numjobs=16 --buffered=1 --ioengine=mmap \
       --rw=randread --random_distribution=zipf:1.2 --norandommap \
       --time_based --ramp_time=1m --runtime=5m --group_reporting

Before:            75912.75 MB/s
After this series: 75907.46 MB/s

Also seem only noise level changes and no regression.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2604.29s
After this series: 2538.90s

Also seem only noise level changes, no regression or very slightly better.

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Kairui Song (12):
      mm/mglru: consolidate common code for retrieving evitable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 308 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 132 insertions(+), 176 deletions(-)
---
base-commit: e4b3c4494ae831396aded19f30132826a0d63031
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 01/12] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Merge commonly used code for counting evictable folios in a lruvec.

No behavior change.

Return unsigned long instead of long as suggested [ Axel Rasmussen ]

Acked-by: Yuanchu Xie <yuanchu@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a8c8fcccbfc..adc07501a137 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4084,27 +4084,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4909,9 +4915,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
 	*nr_to_scan = 0;
@@ -4919,18 +4922,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
-
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 01/12] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-30  1:57   ` Chen Ridong
                     ` (2 more replies)
  2026-03-28 19:52 ` [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
                   ` (10 subsequent siblings)
  12 siblings, 3 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current variable name isn't helpful. Make the variable names more
meaningful.

Only naming change, no behavior change.

Suggested-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc07501a137..f336f89a2de6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4934,7 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  */
 static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
-	bool success;
+	bool need_aging;
 	unsigned long nr_to_scan;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
@@ -4942,7 +4942,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
 		return -1;
 
-	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (nr_to_scan && !mem_cgroup_online(memcg))
@@ -4951,7 +4951,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
 	/* try to get away with not aging at the default priority */
-	if (!success || sc->priority == DEF_PRIORITY)
+	if (!need_aging || sc->priority == DEF_PRIORITY)
 		return nr_to_scan >> sc->priority;
 
 	/* stop scanning this lruvec as it's low on cold folios */
@@ -5040,7 +5040,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
+	bool need_rotate;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5058,7 +5058,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		memcg_memory_event(memcg, MEMCG_LOW);
 	}
 
-	success = try_to_shrink_lruvec(lruvec, sc);
+	need_rotate = try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -5068,10 +5068,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	flush_reclaim_state(sc);
 
-	if (success && mem_cgroup_online(memcg))
+	if (need_rotate && mem_cgroup_online(memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (!success && lruvec_is_sizable(lruvec, sc))
+	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
 		return 0;
 
 	/* one retry if offlined or too small */

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 01/12] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-30  8:14   ` Baolin Wang
  2026-03-28 19:52 ` [PATCH v2 04/12] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Same as active / inactive LRU, MGLRU isolates and scans folios in
batches.  The batch split is done hidden deep in the helper, which
makes the code harder to follow.  The helper's arguments are also
confusing since callers usually request more folios than the batch
size, so the helper almost never processes the full requested amount.

Move the batch splitting into the top loop to make it cleaner, there
should be no behavior change.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f336f89a2de6..963362523782 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4695,10 +4695,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int scanned = 0;
 	int isolated = 0;
 	int skipped = 0;
-	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
-	int remaining = scan_batch;
+	unsigned long remaining = nr_to_scan;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
+	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
 	VM_WARN_ON_ONCE(!list_empty(list));
 
 	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
@@ -4751,7 +4751,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	mod_lruvec_state(lruvec, item, isolated);
 	mod_lruvec_state(lruvec, PGREFILL, sorted);
 	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
-	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
@@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	long nr_to_scan;
+	long nr_batch, nr_to_scan;
 	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
 
@@ -4998,7 +4998,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (nr_to_scan <= 0)
 			break;
 
-		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
+		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
@@ -5623,6 +5624,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
 			int swappiness, unsigned long nr_to_reclaim)
 {
+	int nr_batch;
 	DEFINE_MAX_SEQ(lruvec);
 
 	if (seq + MIN_NR_GENS > max_seq)
@@ -5639,8 +5641,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			return 0;
 
-		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
-				  swappiness))
+		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
+		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
 			return 0;
 
 		cond_resched();

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 04/12] mm/mglru: restructure the reclaim loop
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-29  6:47   ` Kairui Song
  2026-03-28 19:52 ` [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current loop will calculate the scan number on each iteration. The
number of folios to scan is based on the LRU length, with some unclear
behaviors, eg, it only shifts the scan number by reclaim priority at the
default priority, and it couples the number calculation with aging and
rotation.

Adjust, simplify it, and decouple aging and rotation. Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how offline memcg aging works: previously, offline
memcg wouldn't be aged unless it didn't have any evictable folios. Now,
we might age it if it has only 3 generations and the reclaim priority is
less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem. These folios
might be used by other memcg, so aging them as ordinary memcg doesn't
seem wrong. And besides, aging enables further reclaim of an offlined
memcg, which will certainly happen if we keep shrinking it. And offline
memcg might soon be no longer an issue once reparenting is all ready.

Overall, the memcg LRU rotation, as described in mmzone.h,
remains the same.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 70 +++++++++++++++++++++++++++++++------------------------------
 1 file changed, 36 insertions(+), 34 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 963362523782..ab81ffdb241a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4913,49 +4913,40 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     int swappiness, unsigned long *nr_to_scan)
+			     struct scan_control *sc, int swappiness)
 {
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
+	/* try to get away with not aging at the default priority */
+	if (sc->priority == DEF_PRIORITY)
+		return false;
+
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
-/*
- * For future optimizations:
- * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
- *    reclaim.
- */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
 {
-	bool need_aging;
 	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
-
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
+	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
+	if (!mem_cgroup_online(memcg))
 		return nr_to_scan;
 
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
-	/* try to get away with not aging at the default priority */
-	if (!need_aging || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
-
-	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	/*
+	 * Always respect scan priority, minimally target
+	 * SWAP_CLUSTER_MAX pages to keep reclaim moving forwards.
+	 */
+	nr_to_scan >>= sc->priority;
+	return max(nr_to_scan, SWAP_CLUSTER_MAX);
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4985,31 +4976,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
+	bool need_rotate = false;
 	long nr_batch, nr_to_scan;
-	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
+
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			break;
+		}
 
 		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
-			break;
-
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		nr_to_scan -= delta;
 		cond_resched();
 	}
 
@@ -5035,8 +5038,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 04/12] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-31  8:04   ` Baolin Wang
  2026-03-28 19:52 ` [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Make the scan helpers return the exact number of folios being scanned
or isolated. Since the reclaim loop now has a natural scan budget that
controls the scan progress, returning the scan number directly should
make the scan more accurate and easier to follow.

The number of scanned folios for each iteration is always positive and
larger than 0, unless the reclaim must stop for a forced aging, so
there is no more need for any special handling when there is no
progress made:

- `return isolated || !remaining ? scanned : 0` in scan_folios: both
  the function and the call now just return the exact scan count,
  combined with the scan budget introduced in the previous commit to
  avoid livelock or under scan.

- `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
  scan count was kind of confusing and no longer needed too, as scan
  number will never be zero even if none of the folio in oldest
  generation is isolated.

- `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
  the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
  naturally returns 0 when only two gens remain and breaks the loop.

Also move try_to_inc_min_seq before isolate_folios, so that any empty
gens created by external folio freeing are also skipped.

The scan still stops if there are only two gens left as the scan number
will be zero, this behavior is same as before. This force gen protection
may get removed or softened later to improve the reclaim a bit more.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ab81ffdb241a..c5361efa6776 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4686,7 +4686,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
 static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		       struct scan_control *sc, int type, int tier,
-		       struct list_head *list)
+		       struct list_head *list, int *isolatedp)
 {
 	int i;
 	int gen;
@@ -4756,11 +4756,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
 		sc->nr.file_taken += isolated;
-	/*
-	 * There might not be eligible folios due to reclaim_idx. Check the
-	 * remaining to prevent livelock if it's not making progress.
-	 */
-	return isolated || !remaining ? scanned : 0;
+
+	*isolatedp = isolated;
+	return scanned;
 }
 
 static int get_tier_idx(struct lruvec *lruvec, int type)
@@ -4804,33 +4802,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 
 static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			  struct scan_control *sc, int swappiness,
-			  int *type_scanned, struct list_head *list)
+			  struct list_head *list, int *isolated,
+			  int *isolate_type, int *isolate_scanned)
 {
 	int i;
+	int scanned = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
-		int scanned;
+		int type_scan;
 		int tier = get_tier_idx(lruvec, type);
 
-		*type_scanned = type;
+		type_scan = scan_folios(nr_to_scan, lruvec, sc,
+					type, tier, list, isolated);
 
-		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
-		if (scanned)
-			return scanned;
+		scanned += type_scan;
+		if (*isolated) {
+			*isolate_type = type;
+			*isolate_scanned = type_scan;
+			break;
+		}
 
 		type = !type;
 	}
 
-	return 0;
+	return scanned;
 }
 
 static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			struct scan_control *sc, int swappiness)
 {
-	int type;
-	int scanned;
-	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
@@ -4838,19 +4839,18 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	enum node_stat_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
+	int scanned, reclaimed;
+	int isolated = 0, type, type_scanned;
 	bool skip_retry = false;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lruvec_lock_irq(lruvec);
 
-	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
-
-	scanned += try_to_inc_min_seq(lruvec, swappiness);
+	try_to_inc_min_seq(lruvec, swappiness);
 
-	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
-		scanned = 0;
+	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
+				 &list, &isolated, &type, &type_scanned);
 
 	lruvec_unlock_irq(lruvec);
 
@@ -4861,7 +4861,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			scanned, reclaimed, &stat, sc->priority,
+			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-31  8:08   ` Baolin Wang
  2026-03-28 19:52 ` [PATCH v2 07/12] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c5361efa6776..e3ca38d0c4cd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5004,7 +5004,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			break;
 		}
 
-		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 07/12] mm/mglru: don't abort scan immediately right after aging
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Right now, if eviction triggers aging, the reclaimer will abort. This is
not the optimal strategy for several reasons.

Aborting the reclaim early wastes a reclaim cycle when under pressure,
and for concurrent reclaim, if the LRU is under aging, all concurrent
reclaimers might fail. And if the age has just finished, new cold folios
exposed by the aging are not reclaimed until the next reclaim iteration.

What's more, the current aging trigger is quite lenient, having 3 gens
with a reclaim priority lower than default will trigger aging, and
blocks reclaiming from one memcg. This wastes reclaim retry cycles
easily. And in the worst case, if the reclaim is making slower progress
and all following attempts fail due to being blocked by aging, it
triggers unexpected early OOM.

And if a lruvec requires aging, it doesn't mean it's hot. Instead, the
lruvec could be idle for quite a while, and hence it might contain lots
of cold folios to be reclaimed.

While it's helpful to rotate memcg LRU after aging for global reclaim,
as global reclaim fairness is coupled with the rotation in shrink_many,
memcg fairness is instead handled by cgroup iteration in
shrink_node_memcgs. So, for memcg level pressure, this abort is not the
key part for keeping the fairness. And in most cases, there is no need
to age, and fairness must be achieved by upper-level reclaim control.

So instead, just keep the scanning going unless one whole batch of
folios failed to be isolated or enough folios have been scanned, which
is triggered by evict_folios returning 0. And only abort for global
reclaim after one batch, so when there are fewer memcgs, progress is
still made, and the fairness mechanism described above still works fine.

And in most cases, the one more batch attempt for global reclaim might
just be enough to satisfy what the reclaimer needs, hence improving
global reclaim performance by reducing reclaim retry cycles.

Rotation is still there after the reclaim is done, which still follows
the comment in mmzone.h. And fairness still looking good.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e3ca38d0c4cd..8de5c8d5849e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4983,7 +4983,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
  */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool need_rotate = false;
+	bool need_rotate = false, should_age = false;
 	long nr_batch, nr_to_scan;
 	int swappiness = get_swappiness(lruvec, sc);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5001,7 +5001,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
 			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
 				need_rotate = true;
-			break;
+			should_age = true;
 		}

 		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
@@ -5012,6 +5012,10 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_abort_scan(lruvec, sc))
 			break;

+		/* Cgroup reclaim fairness not guarded by rotate */
+		if (root_reclaim(sc) && should_age)
+			break;
+
 		nr_to_scan -= delta;
 		cond_resched();
 	}

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 07/12] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-29  8:21   ` Kairui Song
                     ` (2 more replies)
  2026-03-28 19:52 ` [PATCH v2 09/12] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  12 siblings, 3 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current handling of dirty writeback folios is not working well for
file page heavy workloads: Dirty folios are protected and move to next
gen upon isolation of getting throttled or reactivation upon pageout
(shrink_folio_list).

This might help to reduce the LRU lock contention slightly, but as a
result, the ping-pong effect of folios between head and tail of last two
gens is serious as the shrinker will run into protected dirty writeback
folios more frequently compared to activation. The dirty flush wakeup
condition is also much more passive compared to active/inactive LRU.
Active / inactve LRU wakes the flusher if one batch of folios passed to
shrink_folio_list is unevictable due to under writeback, but MGLRU
instead has to check this after the whole reclaim loop is done, and then
count the isolation protection number compared to the total reclaim
number.

And we previously saw OOM problems with it, too, which were fixed but
still not perfect [1].

So instead, just drop the special handling for dirty writeback, just
re-activate it like active / inactive LRU. And also move the dirty flush
wake up check right after shrink_folio_list. This should improve both
throttling and performance.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 61642.78008938203
AverageLatency(us):  507.11127774145166
pgpgin 158190589
pgpgout 5880616
workingset_refault 7262988

After this commit:
Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
pgpgin 101871227                        (-35.6%, lower is better)
pgpgout 5770028
workingset_refault 3418186              (-52.9%, lower is better)

The refault rate is ~50% lower, and throughput is ~30% higher, which
is a huge gain. We also observed significant performance gain for
other real-world workloads.

We were concerned that the dirty flush could cause more wear for SSD:
that should not be the problem here, since the wakeup condition is when
the dirty folios have been pushed to the tail of LRU, which indicates
that memory pressure is so high that writeback is blocking the workload
already.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
 1 file changed, 16 insertions(+), 41 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8de5c8d5849e..17b5318fad39 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }
 
@@ -4754,8 +4738,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
 
 	*isolatedp = isolated;
 	return scanned;
@@ -4858,12 +4840,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (stat.nr_unqueued_dirty == isolated) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -5020,28 +5017,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	return need_rotate;
 }
 

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 09/12] mm/mglru: remove no longer used reclaim argument for folio protection
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-28 19:52 ` [PATCH v2 10/12] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now dirty reclaim folios are handled after isolation, not before,
since dirty reactivation must take the folio off LRU first, and that
helps to unify the dirty handling logic.

So this argument is no longer needed. Just remove it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 17b5318fad39..07667649c5e2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3220,7 +3220,7 @@ static int folio_update_gen(struct folio *folio, int gen)
 }
 
 /* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio)
 {
 	int type = folio_is_file_lru(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -3239,9 +3239,6 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
-		/* for folio_end_writeback() */
-		if (reclaiming)
-			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags.f, &old_flags, new_flags));
 
 	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
@@ -3855,7 +3852,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(lruvec, folio);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
 			/* don't count the workingset being lazily promoted */
@@ -4612,7 +4609,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* protected */
 	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		/* don't count the workingset being lazily promoted */
@@ -4627,7 +4624,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (zone > sc->reclaim_idx) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 10/12] mm/vmscan: remove sc->file_taken
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (8 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 09/12] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-31  8:49   ` Baolin Wang
  2026-03-28 19:52 ` [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 07667649c5e2..603be5ef3ef2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -173,7 +173,6 @@ struct scan_control {
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
-		unsigned int file_taken;
 		unsigned int taken;
 	} nr;
 
@@ -2040,8 +2039,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (9 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 10/12] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-31  8:51   ` Baolin Wang
  2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
  2026-04-01  5:18 ` [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Leno Hou
  12 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Suggested-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 603be5ef3ef2..1783da54ada1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,7 +169,6 @@ struct scan_control {
 
 	struct {
 		unsigned int dirty;
-		unsigned int unqueued_dirty;
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
@@ -2035,7 +2034,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	sc->nr.dirty += stat.nr_dirty;
 	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (10 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
@ 2026-03-28 19:52 ` Kairui Song via B4 Relay
  2026-03-31  9:24   ` Baolin Wang
                     ` (2 more replies)
  2026-04-01  5:18 ` [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Leno Hou
  12 siblings, 3 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-28 19:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently MGLRU and non-MGLRU handle the reclaim statistic and
writeback handling very differently, especially throttling.
Basically MGLRU just ignored the throttling part.

Let's just unify this part, use a helper to deduplicate the code
so both setups will share the same behavior. Also remove the
folio_clear_reclaim in isolate_folio which was actively invalidating
the congestion control. PG_reclaim is now handled by shrink_folio_list,
keeping it in isolate_folio is not helpful.

Test using following reproducer using bash:

  echo "Setup a slow device using dm delay"
  dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
  LOOP=$(losetup --show -f /var/tmp/backing)
  mkfs.ext4 -q $LOOP
  echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
      dmsetup create slow_dev
  mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow

  echo "Start writeback pressure"
  sync && echo 3 > /proc/sys/vm/drop_caches
  mkdir /sys/fs/cgroup/test_wb
  echo 128M > /sys/fs/cgroup/test_wb/memory.max
  (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
      dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)

  echo "Clean up"
  echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
  dmsetup resume slow_dev
  umount -l /mnt/slow && sync
  dmsetup remove slow_dev

Before this commit, `dd` will get OOM killed immediately if
MGLRU is enabled. Classic LRU is fine.

After this commit, congestion control is now effective and no more
spin on LRU or premature OOM.

Stress test on other workloads also looking good.

Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 93 +++++++++++++++++++++++++++----------------------------------
 1 file changed, 41 insertions(+), 52 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1783da54ada1..83c8fdf8fdc4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
-
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -4651,9 +4657,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 	if (!folio_test_referenced(folio))
 		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
 
-	/* for shrink_folio_list() */
-	folio_clear_reclaim(folio);
-
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);
 
@@ -4833,26 +4836,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr_reclaimed += reclaimed;
+	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (stat.nr_unqueued_dirty == isolated) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4895,6 +4883,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}
 

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 04/12] mm/mglru: restructure the reclaim loop
  2026-03-28 19:52 ` [PATCH v2 04/12] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-03-29  6:47   ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-29  6:47 UTC (permalink / raw)
  To: linux-mm
  Cc: Kairui Song, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

So sashiko provided some review, I'll post it below:

https://sashiko.dev/#/patchset/20260329-mglru-reclaim-v2-0-b53a3678513c%40tencent.com?patch=12986

> From sashiko:
>
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4913,49 +4913,40 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  }
> >  
> >  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> > -			     int swappiness, unsigned long *nr_to_scan)
> > +			     struct scan_control *sc, int swappiness)
> >  {
> >  	DEFINE_MIN_SEQ(lruvec);
> >  
> > -	*nr_to_scan = 0;
> >  	/* have to run aging, since eviction is not possible anymore */
> >  	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
> >  		return true;
> >  
> > -	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> > +	/* try to get away with not aging at the default priority */
> > +	if (sc->priority == DEF_PRIORITY)
> > +		return false;
> 
> Is this a regression that unconditionally forces aging even at DEF_PRIORITY?
> In the original code, if eviction wasn't possible, it set need_aging to
> true, but the subsequent check against DEF_PRIORITY could still bypass aging
> and allow MGLRU to skip memcgs without cold pages.
> With the new logic, the mandatory aging check precedes the DEF_PRIORITY
> check, returning true immediately. Will this cause try_to_shrink_lruvec()
> to inappropriately force hot pages into older generations on the gentlest
> reclaim pass instead of trying other memcgs first?

Right, this part is a bit confusing indeed, I tried to split it
into two patches but it seem could be more messy that way.

This is more of an intentional change. When eviction is not possible
(fewer than MIN_NR_GENS generations), aging is the only way to make
progress. The old code always skip aging at DEF_PRIORITY.

For single cgroup setup, this is not helpful in anyway, it waste a
reclaim iteration only to come back at a higher priority and age
then. And it causes over reclaim of slab.

For multi cgroup setup, it may seems reasonable that defer the aging
and reclaim other cgroup a good idea, but we will have to proceed
with aging anyway, potentially when all cgroups are run out of
gens, causing a larger jitter, and break reclaim balance.
And the escalated priority also may cause over reclaim.

And right now we still have 3 gens as the buffer.

Testing the multi cgroup setup using JaveScript workload in the
cover letter showed that after this series, multi cgroup reclaim
balance and latency are good, the total throughput seems better.

Testing the reclaim balance of cgroup tree (see below) also
seems better.

I think to avoid the latency of aging, we really need async
aging, not deferred aging. Async aging might be easier to be
integrated here now after this series.

Could add some word on that in the commit message.

> From sashiko:
>
> > -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> > +			   struct mem_cgroup *memcg, int swappiness)
> >  {
> > -	bool need_aging;
> >  	unsigned long nr_to_scan;
> > -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > -	DEFINE_MAX_SEQ(lruvec);
> > -
> > -	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> > -		return -1;
> > -
> > -	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> >  
> > +	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> >  	/* try to scrape all its memory if this memcg was deleted */
> > -	if (nr_to_scan && !mem_cgroup_online(memcg))
> > +	if (!mem_cgroup_online(memcg))
> >  		return nr_to_scan;
>
> Will this cause a regression resulting in a permanent memory leak for
> offline memcgs that have zero evictable folios?
> If an offline memcg has all its folios in the unevictable youngest
> generation, lruvec_evictable_size() returns 0. The early return here
> then evaluates to 0.
> When try_to_shrink_lruvec() receives 0, its while (nr_to_scan > 0) loop
> is completely bypassed. Since the aging logic was moved inside that loop,
> the offline memcg is never aged, meaning its folios will remain in the
> unevictable youngest generation indefinitely.

Seems hallucination. lruvec_evictable_size returns evictable
folios across gen, not excluding the youngest one since aging
may make then evictable. Perhaps we can add some comment for this,
if it confuses AI, it may confuse human too.

And aging is not out of the loop, it's just moved to a different
place helper.

> From sashiko:
> >
> >  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> >  
> > -	/* try to get away with not aging at the default priority */
> > -	if (!need_aging || sc->priority == DEF_PRIORITY)
> > -		return nr_to_scan >> sc->priority;
> > -
> > -	/* stop scanning this lruvec as it's low on cold folios */
> > -	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> > +	/*
> > +	 * Always respect scan priority, minimally target
> > +	 * SWAP_CLUSTER_MAX pages to keep reclaim moving forwards.
> > +	 */
> > +	nr_to_scan >>= sc->priority;
> > +	return max(nr_to_scan, SWAP_CLUSTER_MAX);
> >  }
>
> Does this regression break proportional reclaim and memory protection
> semantics for small memcgs by forcing a minimum scan size?

Actually this change was inspired by sashiko's review for V1:

https://sashiko.dev/#/patchset/20260318-mglru-reclaim-v1-0-2c46f9eb0508%40tencent.com?patch=2909

Without this, for cgroups smaller than 16M, def priority scan will just
do nothing. It's still OK but to make it more efficient I added a
minimal batch. But now think again, it better be:

if (!nr_to_scan)
	nr_to_scan = min(lruvec_evictable_size, SWAP_CLUSTER_MAX);

Using max() here could get very small cgroups over reclaimed.
I did test V2 using test_memcg_min suggested by af827e090489:

Before:
Proportional reclaim results:
  c[0]     actual=  29069312 (27M)  ideal=  30408704 (29M)  err=4.4%
  c[1]     actual=  23257088 (22M)  ideal=  22020096 (21M)  err=5.6%
  c[2]     actual=   1552384 (1M)  (expected ~0)
  c[3]     actual=         0 (0M)  (expected =0)

After:
Proportional reclaim results:
  c[0]     actual=  31391744 (29M)  ideal=  30408704 (29M)  err=3.2%
  c[1]     actual=  21028864 (20M)  ideal=  22020096 (21M)  err=4.5%
  c[2]     actual=   1515520 (1M)  (expected ~0)
  c[3]     actual=         0 (0M)  (expected =0)

In both case the result is somehow not very stable, I run the test
7 times using the medium stable result, after this series it seems
sometimes the result is even better but likely just noisy. And
didn't see a regression.

The 32 folios minimal batch seems already small enough for
typical usage, but min(evictable_size, SWAP_CLUSTER_MAX) is definitely
better. Will send a V3 to update this.

I think non of the benchmark or test would be effected by this.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-03-29  8:21   ` Kairui Song
  2026-03-29  8:46     ` Kairui Song
  2026-03-31  8:42   ` Baolin Wang
  2026-04-01 23:37   ` Shakeel Butt
  2 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-03-29  8:21 UTC (permalink / raw)
  To: linux-mm
  Cc: kasong, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

Forwarding sashiko review:

https://sashiko.dev/#/message/20260329-mglru-reclaim-v2-8-b53a3678513c%40tencent.com

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8de5c8d5849e..17b5318fad39 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >  		return true;
> >  	}
> >  
> > -	dirty = folio_test_dirty(folio);
> > -	writeback = folio_test_writeback(folio);
> > -	if (type == LRU_GEN_FILE && dirty) {
> > -		sc->nr.file_taken += delta;
> > -		if (!writeback)
> > -			sc->nr.unqueued_dirty += delta;
> > -	}
> > -
> > -	/* waiting for writeback */
> > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > -		gen = folio_inc_gen(lruvec, folio, true);
> > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > -		return true;
> > -	}
> > -
> >  	return false;
> >  }
>
> Could removing this proactive promotion cause a livelock and premature OOM
> during GFP_NOIO or GFP_NOFS direct reclaim?
> Without this check, dirty file folios fall through to isolate_folio(). If
> the reclaim does not have __GFP_IO, isolate_folio() will skip them.
> These skipped folios are spliced back to the front of the oldest generation
> list in scan_folios(). Because isolate_folio() isolates 0 folios,
> evict_folios() returns early.
> This completely bypasses the new wakeup_flusher_threads() call, leaving the
> dirty folios unqueued for writeback, and the oldest generation can never
> advance.

This is a nice found. For GFP_NOIO (or actually !__GFP_IO), we also need
to active and set reclaim for the dirty folios. It's a narrow case though.

Following update should work:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8170aee096e9..342ba3afe77c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4641,8 +4641,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
        /* swap constrained */
        if (!(sc->gfp_mask & __GFP_IO) &&
-           (folio_test_dirty(folio) ||
-            (folio_test_anon(folio) && !folio_test_swapcache(folio))))
+            (folio_test_anon(folio) && !folio_test_swapcache(folio)))
                return false;

> [ ... ]
> > @@ -4858,12 +4840,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  		return scanned;
> >  retry:
> >  	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
> > -	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
> >  	sc->nr_reclaimed += reclaimed;
> >  	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> >  			type_scanned, reclaimed, &stat, sc->priority,
> >  			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >  
> > +	/*
> > +	 * If too many file cache in the coldest generation can't be evicted
> > +	 * due to being dirty, wake up the flusher.
> > +	 */
> > +	if (stat.nr_unqueued_dirty == isolated) {
>
> Is the isolated variable stale when evaluated on the retry path?
> If evict_folios() jumps back to the retry label, shrink_folio_list()
> processes a smaller list of only clean folios. The isolated variable retains
> the size of the original list, while stat.nr_unqueued_dirty can only be as
> large as the new, smaller list.
> Does this logically impossible condition cause any unintended behavior, or
> should the check be moved outside the retry loop to avoid confusion?

This is fine, stat.nr_unqueued_dirty is always smaller than isolated. The
"retry" label above only used to handle some folios that are failed to be
reclaimed after isolation. Meanwhile I do think we should clean up this
retry logic as it will also confuse the tracepoint. Better do it later.


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-29  8:21   ` Kairui Song
@ 2026-03-29  8:46     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-29  8:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

On Sun, Mar 29, 2026 at 4:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Forwarding sashiko review:
>
> https://sashiko.dev/#/message/20260329-mglru-reclaim-v2-8-b53a3678513c%40tencent.com
>
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 8de5c8d5849e..17b5318fad39 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > [ ... ]
> > > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >             return true;
> > >     }
> > >
> > > -   dirty = folio_test_dirty(folio);
> > > -   writeback = folio_test_writeback(folio);
> > > -   if (type == LRU_GEN_FILE && dirty) {
> > > -           sc->nr.file_taken += delta;
> > > -           if (!writeback)
> > > -                   sc->nr.unqueued_dirty += delta;
> > > -   }
> > > -
> > > -   /* waiting for writeback */
> > > -   if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > > -           gen = folio_inc_gen(lruvec, folio, true);
> > > -           list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > -           return true;
> > > -   }
> > > -
> > >     return false;
> > >  }
> >
> > Could removing this proactive promotion cause a livelock and premature OOM
> > during GFP_NOIO or GFP_NOFS direct reclaim?
> > Without this check, dirty file folios fall through to isolate_folio(). If
> > the reclaim does not have __GFP_IO, isolate_folio() will skip them.
> > These skipped folios are spliced back to the front of the oldest generation
> > list in scan_folios(). Because isolate_folio() isolates 0 folios,
> > evict_folios() returns early.
> > This completely bypasses the new wakeup_flusher_threads() call, leaving the
> > dirty folios unqueued for writeback, and the oldest generation can never
> > advance.
>
> This is a nice found. For GFP_NOIO (or actually !__GFP_IO), we also need
> to active and set reclaim for the dirty folios. It's a narrow case though.
>
> Following update should work:
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8170aee096e9..342ba3afe77c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4641,8 +4641,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>
>         /* swap constrained */
>         if (!(sc->gfp_mask & __GFP_IO) &&
> -           (folio_test_dirty(folio) ||
> -            (folio_test_anon(folio) && !folio_test_swapcache(folio))))
> +            (folio_test_anon(folio) && !folio_test_swapcache(folio)))

Or this check should just be removed. shrink_folio_list already has a
check for swap and a more accurate may_enter_fs check.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation
  2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
@ 2026-03-30  1:57   ` Chen Ridong
  2026-03-30  7:59   ` Baolin Wang
  2026-04-01  0:00   ` Barry Song
  2 siblings, 0 replies; 44+ messages in thread
From: Chen Ridong @ 2026-03-30  1:57 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang



On 2026/3/29 3:52, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current variable name isn't helpful. Make the variable names more
> meaningful.
> 
> Only naming change, no behavior change.
> 
> Suggested-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index adc07501a137..f336f89a2de6 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4934,7 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>   */
>  static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>  {
> -	bool success;
> +	bool need_aging;
>  	unsigned long nr_to_scan;
>  	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	DEFINE_MAX_SEQ(lruvec);
> @@ -4942,7 +4942,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>  	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
>  		return -1;
>  
> -	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> +	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
>  
>  	/* try to scrape all its memory if this memcg was deleted */
>  	if (nr_to_scan && !mem_cgroup_online(memcg))
> @@ -4951,7 +4951,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
>  
>  	/* try to get away with not aging at the default priority */
> -	if (!success || sc->priority == DEF_PRIORITY)
> +	if (!need_aging || sc->priority == DEF_PRIORITY)
>  		return nr_to_scan >> sc->priority;
>  
>  	/* stop scanning this lruvec as it's low on cold folios */
> @@ -5040,7 +5040,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  
>  static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	bool success;
> +	bool need_rotate;
>  	unsigned long scanned = sc->nr_scanned;
>  	unsigned long reclaimed = sc->nr_reclaimed;
>  	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> @@ -5058,7 +5058,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  		memcg_memory_event(memcg, MEMCG_LOW);
>  	}
>  
> -	success = try_to_shrink_lruvec(lruvec, sc);
> +	need_rotate = try_to_shrink_lruvec(lruvec, sc);
>  
>  	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>  
> @@ -5068,10 +5068,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  
>  	flush_reclaim_state(sc);
>  
> -	if (success && mem_cgroup_online(memcg))
> +	if (need_rotate && mem_cgroup_online(memcg))
>  		return MEMCG_LRU_YOUNG;
>  
> -	if (!success && lruvec_is_sizable(lruvec, sc))
> +	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
>  		return 0;
>  
>  	/* one retry if offlined or too small */
> 

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation
  2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
  2026-03-30  1:57   ` Chen Ridong
@ 2026-03-30  7:59   ` Baolin Wang
  2026-04-01  0:00   ` Barry Song
  2 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-30  7:59 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current variable name isn't helpful. Make the variable names more
> meaningful.
> 
> Only naming change, no behavior change.
> 
> Suggested-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-28 19:52 ` [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-03-30  8:14   ` Baolin Wang
  2026-04-01  0:20     ` Barry Song
  0 siblings, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-03-30  8:14 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Same as active / inactive LRU, MGLRU isolates and scans folios in
> batches.  The batch split is done hidden deep in the helper, which
> makes the code harder to follow.  The helper's arguments are also
> confusing since callers usually request more folios than the batch
> size, so the helper almost never processes the full requested amount.
> 
> Move the batch splitting into the top loop to make it cleaner, there
> should be no behavior change.
> 
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Some nits as follows, otherwise LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

>   mm/vmscan.c | 16 +++++++++-------
>   1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f336f89a2de6..963362523782 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4695,10 +4695,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   	int scanned = 0;
>   	int isolated = 0;
>   	int skipped = 0;
> -	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
> -	int remaining = scan_batch;
> +	unsigned long remaining = nr_to_scan;
>   	struct lru_gen_folio *lrugen = &lruvec->lrugen;
>   
> +	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
>   	VM_WARN_ON_ONCE(!list_empty(list));
>   
>   	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> @@ -4751,7 +4751,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   	mod_lruvec_state(lruvec, item, isolated);
>   	mod_lruvec_state(lruvec, PGREFILL, sorted);
>   	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
> -	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
> +	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>   				scanned, skipped, isolated,
>   				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>   	if (type == LRU_GEN_FILE)
> @@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>   
>   static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   {
> -	long nr_to_scan;
> +	long nr_batch, nr_to_scan;

Nit: Since evict_folios() expects an unsgined long, why not define 
'unsigned long nr_batch'?

>   	unsigned long scanned = 0;
>   	int swappiness = get_swappiness(lruvec, sc);
>   
> @@ -4998,7 +4998,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>   		if (nr_to_scan <= 0)
>   			break;
>   
> -		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
> +		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>   		if (!delta)
>   			break;
>   
> @@ -5623,6 +5624,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
>   static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
>   			int swappiness, unsigned long nr_to_reclaim)
>   {
> +	int nr_batch;

Nit: since 'nr_to_reclaim' is unsigned long, better to use unsigned long 
for 'nr_batch'.

>   	DEFINE_MAX_SEQ(lruvec);
>   
>   	if (seq + MIN_NR_GENS > max_seq)
> @@ -5639,8 +5641,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
>   		if (sc->nr_reclaimed >= nr_to_reclaim)
>   			return 0;
>   
> -		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
> -				  swappiness))
> +		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
> +		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
>   			return 0;
>   
>   		cond_resched();
> 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios
  2026-03-28 19:52 ` [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-03-31  8:04   ` Baolin Wang
  2026-03-31  9:01     ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  8:04 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Make the scan helpers return the exact number of folios being scanned
> or isolated. Since the reclaim loop now has a natural scan budget that
> controls the scan progress, returning the scan number directly should
> make the scan more accurate and easier to follow.
> 
> The number of scanned folios for each iteration is always positive and
> larger than 0, unless the reclaim must stop for a forced aging, so
> there is no more need for any special handling when there is no
> progress made:
> 
> - `return isolated || !remaining ? scanned : 0` in scan_folios: both
>    the function and the call now just return the exact scan count,
>    combined with the scan budget introduced in the previous commit to
>    avoid livelock or under scan.

Make sense to me.

> 
> - `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
>    scan count was kind of confusing and no longer needed too, as scan
>    number will never be zero even if none of the folio in oldest
>    generation is isolated.

Yes, agree.

> 
> - `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
>    the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
>    naturally returns 0 when only two gens remain and breaks the loop.
> 
> Also move try_to_inc_min_seq before isolate_folios, so that any empty
> gens created by external folio freeing are also skipped.

This part is somewhat confusing. You probably mean the case where the 
list of that gen becomes empty via isolate_folio(), right?

If that's the case, the original logic would remove the empty gens 
produced by isolate_folio() after calling try_to_inc_min_seq().

However, with your changes, this removal won't happen until the next 
eviction. Does this provide any additional benefits? Or could you 
describe how this change impacts your testing?

> The scan still stops if there are only two gens left as the scan number
> will be zero, this behavior is same as before. This force gen protection
> may get removed or softened later to improve the reclaim a bit more.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
>   1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ab81ffdb241a..c5361efa6776 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4686,7 +4686,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>   
>   static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   		       struct scan_control *sc, int type, int tier,
> -		       struct list_head *list)
> +		       struct list_head *list, int *isolatedp)
>   {
>   	int i;
>   	int gen;
> @@ -4756,11 +4756,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>   	if (type == LRU_GEN_FILE)
>   		sc->nr.file_taken += isolated;
> -	/*
> -	 * There might not be eligible folios due to reclaim_idx. Check the
> -	 * remaining to prevent livelock if it's not making progress.
> -	 */
> -	return isolated || !remaining ? scanned : 0;
> +
> +	*isolatedp = isolated;
> +	return scanned;
>   }
>   
>   static int get_tier_idx(struct lruvec *lruvec, int type)
> @@ -4804,33 +4802,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
>   
>   static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   			  struct scan_control *sc, int swappiness,
> -			  int *type_scanned, struct list_head *list)
> +			  struct list_head *list, int *isolated,
> +			  int *isolate_type, int *isolate_scanned)
>   {

8 parameters:), can we reduce some of them?

>   	int i;
> +	int scanned = 0;
>   	int type = get_type_to_scan(lruvec, swappiness);
>   
>   	for_each_evictable_type(i, swappiness) {
> -		int scanned;
> +		int type_scan;
>   		int tier = get_tier_idx(lruvec, type);
>   
> -		*type_scanned = type;
> +		type_scan = scan_folios(nr_to_scan, lruvec, sc,
> +					type, tier, list, isolated);
>   
> -		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
> -		if (scanned)
> -			return scanned;
> +		scanned += type_scan;
> +		if (*isolated) {
> +			*isolate_type = type;
> +			*isolate_scanned = type_scan;
> +			break;
> +		}
>   
>   		type = !type;
>   	}
>   
> -	return 0;
> +	return scanned;
>   }
>   
>   static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   			struct scan_control *sc, int swappiness)
>   {
> -	int type;
> -	int scanned;
> -	int reclaimed;
>   	LIST_HEAD(list);
>   	LIST_HEAD(clean);
>   	struct folio *folio;
> @@ -4838,19 +4839,18 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   	enum node_stat_item item;
>   	struct reclaim_stat stat;
>   	struct lru_gen_mm_walk *walk;
> +	int scanned, reclaimed;
> +	int isolated = 0, type, type_scanned;
>   	bool skip_retry = false;
> -	struct lru_gen_folio *lrugen = &lruvec->lrugen;
>   	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>   	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>   
>   	lruvec_lock_irq(lruvec);
>   
> -	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
> -
> -	scanned += try_to_inc_min_seq(lruvec, swappiness);
> +	try_to_inc_min_seq(lruvec, swappiness);
>   
> -	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
> -		scanned = 0;
> +	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
> +				 &list, &isolated, &type, &type_scanned);
>   
>   	lruvec_unlock_irq(lruvec);
>   
> @@ -4861,7 +4861,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>   	sc->nr_reclaimed += reclaimed;
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> -			scanned, reclaimed, &stat, sc->priority,
> +			type_scanned, reclaimed, &stat, sc->priority,
>   			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>   
>   	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim
  2026-03-28 19:52 ` [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-03-31  8:08   ` Baolin Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  8:08 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> With a fixed number to reclaim calculated at the beginning, making each
> following step smaller should reduce the lock contention and avoid
> over-aggressive reclaim of folios, as it will abort earlier when the
> number of folios to be reclaimed is reached.
> 
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
  2026-03-29  8:21   ` Kairui Song
@ 2026-03-31  8:42   ` Baolin Wang
  2026-03-31  9:18     ` Kairui Song
  2026-04-01 23:37   ` Shakeel Butt
  2 siblings, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  8:42 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivation upon pageout
> (shrink_folio_list).
> 
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.
> 
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
> 
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.
> 
> Test with YCSB workloadb showed a major performance improvement:
> 
> Before this series:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
> 
> After this commit:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
> 
> The refault rate is ~50% lower, and throughput is ~30% higher, which
> is a huge gain. We also observed significant performance gain for
> other real-world workloads.
> 
> We were concerned that the dirty flush could cause more wear for SSD:
> that should not be the problem here, since the wakeup condition is when
> the dirty folios have been pushed to the tail of LRU, which indicates
> that memory pressure is so high that writeback is blocking the workload
> already.
> 
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
>   1 file changed, 16 insertions(+), 41 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 8de5c8d5849e..17b5318fad39 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>   		       int tier_idx)
>   {
>   	bool success;
> -	bool dirty, writeback;
>   	int gen = folio_lru_gen(folio);
>   	int type = folio_is_file_lru(folio);
>   	int zone = folio_zonenum(folio);
> @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>   		return true;
>   	}
>   
> -	dirty = folio_test_dirty(folio);
> -	writeback = folio_test_writeback(folio);
> -	if (type == LRU_GEN_FILE && dirty) {
> -		sc->nr.file_taken += delta;
> -		if (!writeback)
> -			sc->nr.unqueued_dirty += delta;
> -	}
> -
> -	/* waiting for writeback */
> -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> -		gen = folio_inc_gen(lruvec, folio, true);
> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> -		return true;
> -	}

I'm a bit concerned about the handling of dirty folios.

In the original logic, if we encounter a dirty folio, we increment its 
generation counter by 1 and move it to the *second oldest generation*.

However, with your patch, shrink_folio_list() will activate the dirty 
folio by calling folio_set_active(). Then, evict_folios() -> 
move_folios_to_lru() will put the dirty folio back into the MGLRU list.

But because the folio_test_active() is true for this dirty folio, the 
dirty folio will now be placed into the *second youngest generation* 
(see lru_gen_folio_seq()).

As a result, during the next eviction, these dirty folios won't be 
scanned again (because they are in the second youngest generation). 
Wouldn't this lead to a situation where the flusher cannot be woken up 
in time, making OOM more likely?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 10/12] mm/vmscan: remove sc->file_taken
  2026-03-28 19:52 ` [PATCH v2 10/12] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-03-31  8:49   ` Baolin Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  8:49 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> No one is using it now, just remove it.
> 
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty
  2026-03-28 19:52 ` [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
@ 2026-03-31  8:51   ` Baolin Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  8:51 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> No one is using it now, just remove it.
> 
> Suggested-by: Axel Rasmussen <axelrasmussen@google.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios
  2026-03-31  8:04   ` Baolin Wang
@ 2026-03-31  9:01     ` Kairui Song
  2026-03-31  9:52       ` Baolin Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-03-31  9:01 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng

On Tue, Mar 31, 2026 at 04:04:30PM +0800, Baolin Wang wrote:
> 
> 
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > Make the scan helpers return the exact number of folios being scanned
> > or isolated. Since the reclaim loop now has a natural scan budget that
> > controls the scan progress, returning the scan number directly should
> > make the scan more accurate and easier to follow.
> > 
> > The number of scanned folios for each iteration is always positive and
> > larger than 0, unless the reclaim must stop for a forced aging, so
> > there is no more need for any special handling when there is no
> > progress made:
> > 
> > - `return isolated || !remaining ? scanned : 0` in scan_folios: both
> >    the function and the call now just return the exact scan count,
> >    combined with the scan budget introduced in the previous commit to
> >    avoid livelock or under scan.
> 
> Make sense to me.
> 
> > 
> > - `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
> >    scan count was kind of confusing and no longer needed too, as scan
> >    number will never be zero even if none of the folio in oldest
> >    generation is isolated.
> 
> Yes, agree.
> 
> > 
> > - `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
> >    the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
> >    naturally returns 0 when only two gens remain and breaks the loop.
> > 
> > Also move try_to_inc_min_seq before isolate_folios, so that any empty
> > gens created by external folio freeing are also skipped.
> 
> This part is somewhat confusing. You probably mean the case where the list
> of that gen becomes empty via isolate_folio(), right?
> 
> If that's the case, the original logic would remove the empty gens produced
> by isolate_folio() after calling try_to_inc_min_seq().
> 
> However, with your changes, this removal won't happen until the next
> eviction. Does this provide any additional benefits? Or could you describe
> how this change impacts your testing?

Hi Baolin, thanks for the review.

Yeah, I also notices this issue after sending this while doing more
self review.

So I did some test with the patch below:

static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
@@ -4818,11 +4814,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
        lruvec_lock_irq(lruvec);
 
+       /* In case folio deletion created empty gen, flush them */
        try_to_inc_min_seq(lruvec, swappiness);
 
        scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
                                 &list, &isolated, &type, &type_scanned);
 
+       /* Isolation might created empty gen, flush them */
+       try_to_inc_min_seq(lruvec, swappiness);
+
        lruvec_unlock_irq(lruvec);
 
        if (list_empty(&list))

The return value of try_to_inc_min_seq can also be dropped
since it's no longer used, and the function call should be cheap.

After system time of build kernel using 3G memory and make -j96
with ZRAM as swap, system time in seconds average of 12 test run each:

mm-new:
9136.055833

After V2:
8819.932222

After V2, with above patch:
8783.944444

After V2, without above patch but move try_to_inc_min_seq
back to after isolate_folios:
8807.874444

This series is looking good, this inc_min change seems trivial
but in theory it does have have real effect.

- Moving the try_to_inc_min_seq after isolate_folios may result in a
wasted isolate_folios call and early abort of reclaim loop if there
is a stalled oldest gen created by folio deletion.

- Moving the try_to_inc_min_seq before isolate_folios may leave a
empty gen after isolation. Usually it's fine because next eviction
will still reclaim them. But before next eviction, during that period,
new file folios could be added the oldest gen and get reclaim too
early. That looks a real problem.

This maybe trivial since MGLRU itself also may suffer the same
problem when the oldest gen is just too short, that's a much more
common case (For this short oldest gen issue we can solve later).

- Having try_to_inc_min_seq both before and after isolate_folios
seems the best choice here and somehow matches the benchmark
result above, very close to the noise level though.

Well I only tested one cases, the cover letter described a
larger matrix, still all good with this series and I'm not
100% sure how this particular change effects them, I guess
it's still trivial.

The try_to_inc_min_seq call should be cheap enough since it's
called only for one batch of 64 folios, and it's only reading
a few lists for the non inc path.

How do you think that we just call it twice here?

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ab81ffdb241a..c5361efa6776 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4686,7 +4686,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
> >   static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >   		       struct scan_control *sc, int type, int tier,
> > -		       struct list_head *list)
> > +		       struct list_head *list, int *isolatedp)
> >   {
> >   	int i;
> >   	int gen;
> > @@ -4756,11 +4756,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >   				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >   	if (type == LRU_GEN_FILE)
> >   		sc->nr.file_taken += isolated;
> > -	/*
> > -	 * There might not be eligible folios due to reclaim_idx. Check the
> > -	 * remaining to prevent livelock if it's not making progress.
> > -	 */
> > -	return isolated || !remaining ? scanned : 0;
> > +
> > +	*isolatedp = isolated;
> > +	return scanned;
> >   }
> >   static int get_tier_idx(struct lruvec *lruvec, int type)
> > @@ -4804,33 +4802,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
> >   static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >   			  struct scan_control *sc, int swappiness,
> > -			  int *type_scanned, struct list_head *list)
> > +			  struct list_head *list, int *isolated,
> > +			  int *isolate_type, int *isolate_scanned)
> >   {
> 
> 8 parameters:), can we reduce some of them?

I'm not too concerned about this yet since it's a static function
with only one caller so in most cases it's inlined.

It's a bit long indeed :), haven't find out a good way since the
tracepoint below needs the isolate type and number. Maybe can be
simplify by unfolding the function here later or some other
refactor.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-31  8:42   ` Baolin Wang
@ 2026-03-31  9:18     ` Kairui Song
  2026-04-01  2:52       ` Baolin Wang
  2026-04-02  0:11       ` Barry Song
  0 siblings, 2 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-31  9:18 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng

On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
> 
> 
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current handling of dirty writeback folios is not working well for
> > file page heavy workloads: Dirty folios are protected and move to next
> > gen upon isolation of getting throttled or reactivation upon pageout
> > (shrink_folio_list).
> > 
> > This might help to reduce the LRU lock contention slightly, but as a
> > result, the ping-pong effect of folios between head and tail of last two
> > gens is serious as the shrinker will run into protected dirty writeback
> > folios more frequently compared to activation. The dirty flush wakeup
> > condition is also much more passive compared to active/inactive LRU.
> > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > instead has to check this after the whole reclaim loop is done, and then
> > count the isolation protection number compared to the total reclaim
> > number.
> > 
> > And we previously saw OOM problems with it, too, which were fixed but
> > still not perfect [1].
> > 
> > So instead, just drop the special handling for dirty writeback, just
> > re-activate it like active / inactive LRU. And also move the dirty flush
> > wake up check right after shrink_folio_list. This should improve both
> > throttling and performance.
> > 
> > Test with YCSB workloadb showed a major performance improvement:
> > 
> > Before this series:
> > Throughput(ops/sec): 61642.78008938203
> > AverageLatency(us):  507.11127774145166
> > pgpgin 158190589
> > pgpgout 5880616
> > workingset_refault 7262988
> > 
> > After this commit:
> > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > pgpgin 101871227                        (-35.6%, lower is better)
> > pgpgout 5770028
> > workingset_refault 3418186              (-52.9%, lower is better)
> > 
> > The refault rate is ~50% lower, and throughput is ~30% higher, which
> > is a huge gain. We also observed significant performance gain for
> > other real-world workloads.
> > 
> > We were concerned that the dirty flush could cause more wear for SSD:
> > that should not be the problem here, since the wakeup condition is when
> > the dirty folios have been pushed to the tail of LRU, which indicates
> > that memory pressure is so high that writeback is blocking the workload
> > already.
> > 
> > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
> >   1 file changed, 16 insertions(+), 41 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 8de5c8d5849e..17b5318fad39 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >   		       int tier_idx)
> >   {
> >   	bool success;
> > -	bool dirty, writeback;
> >   	int gen = folio_lru_gen(folio);
> >   	int type = folio_is_file_lru(folio);
> >   	int zone = folio_zonenum(folio);
> > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >   		return true;
> >   	}
> > -	dirty = folio_test_dirty(folio);
> > -	writeback = folio_test_writeback(folio);
> > -	if (type == LRU_GEN_FILE && dirty) {
> > -		sc->nr.file_taken += delta;
> > -		if (!writeback)
> > -			sc->nr.unqueued_dirty += delta;
> > -	}
> > -
> > -	/* waiting for writeback */
> > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > -		gen = folio_inc_gen(lruvec, folio, true);
> > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > -		return true;
> > -	}
> 
> I'm a bit concerned about the handling of dirty folios.
> 
> In the original logic, if we encounter a dirty folio, we increment its
> generation counter by 1 and move it to the *second oldest generation*.
> 
> However, with your patch, shrink_folio_list() will activate the dirty folio
> by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
> will put the dirty folio back into the MGLRU list.
> 
> But because the folio_test_active() is true for this dirty folio, the dirty
> folio will now be placed into the *second youngest generation* (see
> lru_gen_folio_seq()).

Yeah, and that's exactly what we want. Or else, these folios will
stay at oldest gen, following scan will keep seeing them and hence
keep bouncing these folios again and again to a younger gen since
they are not reclaimable.

The writeback callback (folio_rotate_reclaimable) will move them
back to tail once they are actually reclaimable. So we are not
losing any ability to reclaim them. Am I missing anything?

> 
> As a result, during the next eviction, these dirty folios won't be scanned
> again (because they are in the second youngest generation). Wouldn't this
> lead to a situation where the flusher cannot be woken up in time, making OOM
> more likely?

No? Flusher is already waken up by the time they are seen for the
first time. If we see these folios again very soon, the LRU is
congested, one following patch handles the congested case too by
throttling (which was completely missing previously). And now we
are actually a bit more proactive about waking up the flusher,
since the wakeup hook is moved inside the loop instead of after
the whole loop is finished.

These two behavior change above is basically just unifying MGLRU to do
what the classical LRU has been doing for years, and result looks
really good.

The global congestion handling is still missing after this series
though. Have to fix that later I guess...


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
@ 2026-03-31  9:24   ` Baolin Wang
  2026-03-31  9:29     ` Kairui Song
  2026-04-01  5:01   ` Leno Hou
  2026-04-02  2:39   ` Shakeel Butt
  2 siblings, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  9:24 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng



On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
> 
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior. Also remove the
> folio_clear_reclaim in isolate_folio which was actively invalidating
> the congestion control. PG_reclaim is now handled by shrink_folio_list,
> keeping it in isolate_folio is not helpful.
> 
> Test using following reproducer using bash:
> 
>    echo "Setup a slow device using dm delay"
>    dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>    LOOP=$(losetup --show -f /var/tmp/backing)
>    mkfs.ext4 -q $LOOP
>    echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>        dmsetup create slow_dev
>    mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> 
>    echo "Start writeback pressure"
>    sync && echo 3 > /proc/sys/vm/drop_caches
>    mkdir /sys/fs/cgroup/test_wb
>    echo 128M > /sys/fs/cgroup/test_wb/memory.max
>    (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>        dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> 
>    echo "Clean up"
>    echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>    dmsetup resume slow_dev
>    umount -l /mnt/slow && sync
>    dmsetup remove slow_dev
> 
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
> 
> After this commit, congestion control is now effective and no more
> spin on LRU or premature OOM.
> 
> Stress test on other workloads also looking good.
> 
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/vmscan.c | 93 +++++++++++++++++++++++++++----------------------------------
>   1 file changed, 41 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1783da54ada1..83c8fdf8fdc4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
>   	return !(current->flags & PF_LOCAL_THROTTLE);
>   }
>   
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +				     struct pglist_data *pgdat,
> +				     struct scan_control *sc,
> +				     struct reclaim_stat *stat)
> +{
> +	/*
> +	 * If dirty folios are scanned that are not queued for IO, it
> +	 * implies that flushers are not doing their job. This can
> +	 * happen when memory pressure pushes dirty folios to the end of
> +	 * the LRU before the dirty limits are breached and the dirty
> +	 * data has expired. It can also happen when the proportion of
> +	 * dirty folios grows not through writes but through memory
> +	 * pressure reclaiming all the clean cache. And in some cases,
> +	 * the flushers simply cannot keep up with the allocation
> +	 * rate. Nudge the flusher threads in case they are asleep.
> +	 */
> +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> +		/*
> +		 * For cgroupv1 dirty throttling is achieved by waking up
> +		 * the kernel flusher here and later waiting on folios
> +		 * which are in writeback to finish (see shrink_folio_list()).
> +		 *
> +		 * Flusher may not be able to issue writeback quickly
> +		 * enough for cgroupv1 writeback throttling to work
> +		 * on a large system.
> +		 */
> +		if (!writeback_throttling_sane(sc))
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +	}
> +
> +	sc->nr.dirty += stat->nr_dirty;
> +	sc->nr.congested += stat->nr_congested;
> +	sc->nr.writeback += stat->nr_writeback;
> +	sc->nr.immediate += stat->nr_immediate;
> +	sc->nr.taken += nr_taken;
> +}
> +
>   /*
>    * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>    * of reclaimed pages
> @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>   	lruvec_lock_irq(lruvec);
>   	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>   					nr_scanned - nr_reclaimed);
> -
> -	/*
> -	 * If dirty folios are scanned that are not queued for IO, it
> -	 * implies that flushers are not doing their job. This can
> -	 * happen when memory pressure pushes dirty folios to the end of
> -	 * the LRU before the dirty limits are breached and the dirty
> -	 * data has expired. It can also happen when the proportion of
> -	 * dirty folios grows not through writes but through memory
> -	 * pressure reclaiming all the clean cache. And in some cases,
> -	 * the flushers simply cannot keep up with the allocation
> -	 * rate. Nudge the flusher threads in case they are asleep.
> -	 */
> -	if (stat.nr_unqueued_dirty == nr_taken) {
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -		/*
> -		 * For cgroupv1 dirty throttling is achieved by waking up
> -		 * the kernel flusher here and later waiting on folios
> -		 * which are in writeback to finish (see shrink_folio_list()).
> -		 *
> -		 * Flusher may not be able to issue writeback quickly
> -		 * enough for cgroupv1 writeback throttling to work
> -		 * on a large system.
> -		 */
> -		if (!writeback_throttling_sane(sc))
> -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -	}
> -
> -	sc->nr.dirty += stat.nr_dirty;
> -	sc->nr.congested += stat.nr_congested;
> -	sc->nr.writeback += stat.nr_writeback;
> -	sc->nr.immediate += stat.nr_immediate;
> -	sc->nr.taken += nr_taken;
> -
> +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>   	return nr_reclaimed;
> @@ -4651,9 +4657,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>   	if (!folio_test_referenced(folio))
>   		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
>   
> -	/* for shrink_folio_list() */
> -	folio_clear_reclaim(folio);

IMO, Moving this change into patch 8 would make more sense. Otherwise LGTM.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-31  9:24   ` Baolin Wang
@ 2026-03-31  9:29     ` Kairui Song
  2026-03-31  9:36       ` Baolin Wang
  0 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-03-31  9:29 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng

On Tue, Mar 31, 2026 at 05:24:39PM +0800, Baolin Wang wrote:
> 
> 
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > Currently MGLRU and non-MGLRU handle the reclaim statistic and
> > writeback handling very differently, especially throttling.
> > Basically MGLRU just ignored the throttling part.
> > 
> > Let's just unify this part, use a helper to deduplicate the code
> > so both setups will share the same behavior. Also remove the
> > folio_clear_reclaim in isolate_folio which was actively invalidating
> > the congestion control. PG_reclaim is now handled by shrink_folio_list,
> > keeping it in isolate_folio is not helpful.
> > 
> > Test using following reproducer using bash:
> > 
> >    echo "Setup a slow device using dm delay"
> >    dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> >    LOOP=$(losetup --show -f /var/tmp/backing)
> >    mkfs.ext4 -q $LOOP
> >    echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> >        dmsetup create slow_dev
> >    mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> > 
> >    echo "Start writeback pressure"
> >    sync && echo 3 > /proc/sys/vm/drop_caches
> >    mkdir /sys/fs/cgroup/test_wb
> >    echo 128M > /sys/fs/cgroup/test_wb/memory.max
> >    (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> >        dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> > 
> >    echo "Clean up"
> >    echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> >    dmsetup resume slow_dev
> >    umount -l /mnt/slow && sync
> >    dmsetup remove slow_dev
> > 
> > Before this commit, `dd` will get OOM killed immediately if
> > MGLRU is enabled. Classic LRU is fine.
> > 
> > After this commit, congestion control is now effective and no more
> > spin on LRU or premature OOM.
> > 
> > Stress test on other workloads also looking good.
> > 
> > Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/vmscan.c | 93 +++++++++++++++++++++++++++----------------------------------
> >   1 file changed, 41 insertions(+), 52 deletions(-)
> > 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 1783da54ada1..83c8fdf8fdc4 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
> >   	return !(current->flags & PF_LOCAL_THROTTLE);
> >   }
> > +static void handle_reclaim_writeback(unsigned long nr_taken,
> > +				     struct pglist_data *pgdat,
> > +				     struct scan_control *sc,
> > +				     struct reclaim_stat *stat)
> > +{
> > +	/*
> > +	 * If dirty folios are scanned that are not queued for IO, it
> > +	 * implies that flushers are not doing their job. This can
> > +	 * happen when memory pressure pushes dirty folios to the end of
> > +	 * the LRU before the dirty limits are breached and the dirty
> > +	 * data has expired. It can also happen when the proportion of
> > +	 * dirty folios grows not through writes but through memory
> > +	 * pressure reclaiming all the clean cache. And in some cases,
> > +	 * the flushers simply cannot keep up with the allocation
> > +	 * rate. Nudge the flusher threads in case they are asleep.
> > +	 */
> > +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> > +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> > +		/*
> > +		 * For cgroupv1 dirty throttling is achieved by waking up
> > +		 * the kernel flusher here and later waiting on folios
> > +		 * which are in writeback to finish (see shrink_folio_list()).
> > +		 *
> > +		 * Flusher may not be able to issue writeback quickly
> > +		 * enough for cgroupv1 writeback throttling to work
> > +		 * on a large system.
> > +		 */
> > +		if (!writeback_throttling_sane(sc))
> > +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> > +	}
> > +
> > +	sc->nr.dirty += stat->nr_dirty;
> > +	sc->nr.congested += stat->nr_congested;
> > +	sc->nr.writeback += stat->nr_writeback;
> > +	sc->nr.immediate += stat->nr_immediate;
> > +	sc->nr.taken += nr_taken;
> > +}
> > +
> >   /*
> >    * shrink_inactive_list() is a helper for shrink_node().  It returns the number
> >    * of reclaimed pages
> > @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> >   	lruvec_lock_irq(lruvec);
> >   	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
> >   					nr_scanned - nr_reclaimed);
> > -
> > -	/*
> > -	 * If dirty folios are scanned that are not queued for IO, it
> > -	 * implies that flushers are not doing their job. This can
> > -	 * happen when memory pressure pushes dirty folios to the end of
> > -	 * the LRU before the dirty limits are breached and the dirty
> > -	 * data has expired. It can also happen when the proportion of
> > -	 * dirty folios grows not through writes but through memory
> > -	 * pressure reclaiming all the clean cache. And in some cases,
> > -	 * the flushers simply cannot keep up with the allocation
> > -	 * rate. Nudge the flusher threads in case they are asleep.
> > -	 */
> > -	if (stat.nr_unqueued_dirty == nr_taken) {
> > -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> > -		/*
> > -		 * For cgroupv1 dirty throttling is achieved by waking up
> > -		 * the kernel flusher here and later waiting on folios
> > -		 * which are in writeback to finish (see shrink_folio_list()).
> > -		 *
> > -		 * Flusher may not be able to issue writeback quickly
> > -		 * enough for cgroupv1 writeback throttling to work
> > -		 * on a large system.
> > -		 */
> > -		if (!writeback_throttling_sane(sc))
> > -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> > -	}
> > -
> > -	sc->nr.dirty += stat.nr_dirty;
> > -	sc->nr.congested += stat.nr_congested;
> > -	sc->nr.writeback += stat.nr_writeback;
> > -	sc->nr.immediate += stat.nr_immediate;
> > -	sc->nr.taken += nr_taken;
> > -
> > +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
> >   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> >   			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> >   	return nr_reclaimed;
> > @@ -4651,9 +4657,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
> >   	if (!folio_test_referenced(folio))
> >   		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
> > -	/* for shrink_folio_list() */
> > -	folio_clear_reclaim(folio);
> 
> IMO, Moving this change into patch 8 would make more sense. Otherwise LGTM.

Thanks for the review! I made it a separate patch so we can better
identify which part had the performance gain, and patch 8 can keep
the review by. Patch 8 is still good without this, a few counters
are updated with no user, kind of wasted but that's harmless.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-31  9:29     ` Kairui Song
@ 2026-03-31  9:36       ` Baolin Wang
  2026-03-31  9:40         ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  9:36 UTC (permalink / raw)
  To: Kairui Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng



On 3/31/26 5:29 PM, Kairui Song wrote:
> On Tue, Mar 31, 2026 at 05:24:39PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Currently MGLRU and non-MGLRU handle the reclaim statistic and
>>> writeback handling very differently, especially throttling.
>>> Basically MGLRU just ignored the throttling part.
>>>
>>> Let's just unify this part, use a helper to deduplicate the code
>>> so both setups will share the same behavior. Also remove the
>>> folio_clear_reclaim in isolate_folio which was actively invalidating
>>> the congestion control. PG_reclaim is now handled by shrink_folio_list,
>>> keeping it in isolate_folio is not helpful.
>>>
>>> Test using following reproducer using bash:
>>>
>>>     echo "Setup a slow device using dm delay"
>>>     dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>>>     LOOP=$(losetup --show -f /var/tmp/backing)
>>>     mkfs.ext4 -q $LOOP
>>>     echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>>>         dmsetup create slow_dev
>>>     mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
>>>
>>>     echo "Start writeback pressure"
>>>     sync && echo 3 > /proc/sys/vm/drop_caches
>>>     mkdir /sys/fs/cgroup/test_wb
>>>     echo 128M > /sys/fs/cgroup/test_wb/memory.max
>>>     (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>>>         dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
>>>
>>>     echo "Clean up"
>>>     echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>>>     dmsetup resume slow_dev
>>>     umount -l /mnt/slow && sync
>>>     dmsetup remove slow_dev
>>>
>>> Before this commit, `dd` will get OOM killed immediately if
>>> MGLRU is enabled. Classic LRU is fine.
>>>
>>> After this commit, congestion control is now effective and no more
>>> spin on LRU or premature OOM.
>>>
>>> Stress test on other workloads also looking good.
>>>
>>> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/vmscan.c | 93 +++++++++++++++++++++++++++----------------------------------
>>>    1 file changed, 41 insertions(+), 52 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 1783da54ada1..83c8fdf8fdc4 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
>>>    	return !(current->flags & PF_LOCAL_THROTTLE);
>>>    }
>>> +static void handle_reclaim_writeback(unsigned long nr_taken,
>>> +				     struct pglist_data *pgdat,
>>> +				     struct scan_control *sc,
>>> +				     struct reclaim_stat *stat)
>>> +{
>>> +	/*
>>> +	 * If dirty folios are scanned that are not queued for IO, it
>>> +	 * implies that flushers are not doing their job. This can
>>> +	 * happen when memory pressure pushes dirty folios to the end of
>>> +	 * the LRU before the dirty limits are breached and the dirty
>>> +	 * data has expired. It can also happen when the proportion of
>>> +	 * dirty folios grows not through writes but through memory
>>> +	 * pressure reclaiming all the clean cache. And in some cases,
>>> +	 * the flushers simply cannot keep up with the allocation
>>> +	 * rate. Nudge the flusher threads in case they are asleep.
>>> +	 */
>>> +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
>>> +		wakeup_flusher_threads(WB_REASON_VMSCAN);
>>> +		/*
>>> +		 * For cgroupv1 dirty throttling is achieved by waking up
>>> +		 * the kernel flusher here and later waiting on folios
>>> +		 * which are in writeback to finish (see shrink_folio_list()).
>>> +		 *
>>> +		 * Flusher may not be able to issue writeback quickly
>>> +		 * enough for cgroupv1 writeback throttling to work
>>> +		 * on a large system.
>>> +		 */
>>> +		if (!writeback_throttling_sane(sc))
>>> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>>> +	}
>>> +
>>> +	sc->nr.dirty += stat->nr_dirty;
>>> +	sc->nr.congested += stat->nr_congested;
>>> +	sc->nr.writeback += stat->nr_writeback;
>>> +	sc->nr.immediate += stat->nr_immediate;
>>> +	sc->nr.taken += nr_taken;
>>> +}
>>> +
>>>    /*
>>>     * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>>>     * of reclaimed pages
>>> @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>>>    	lruvec_lock_irq(lruvec);
>>>    	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>>>    					nr_scanned - nr_reclaimed);
>>> -
>>> -	/*
>>> -	 * If dirty folios are scanned that are not queued for IO, it
>>> -	 * implies that flushers are not doing their job. This can
>>> -	 * happen when memory pressure pushes dirty folios to the end of
>>> -	 * the LRU before the dirty limits are breached and the dirty
>>> -	 * data has expired. It can also happen when the proportion of
>>> -	 * dirty folios grows not through writes but through memory
>>> -	 * pressure reclaiming all the clean cache. And in some cases,
>>> -	 * the flushers simply cannot keep up with the allocation
>>> -	 * rate. Nudge the flusher threads in case they are asleep.
>>> -	 */
>>> -	if (stat.nr_unqueued_dirty == nr_taken) {
>>> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
>>> -		/*
>>> -		 * For cgroupv1 dirty throttling is achieved by waking up
>>> -		 * the kernel flusher here and later waiting on folios
>>> -		 * which are in writeback to finish (see shrink_folio_list()).
>>> -		 *
>>> -		 * Flusher may not be able to issue writeback quickly
>>> -		 * enough for cgroupv1 writeback throttling to work
>>> -		 * on a large system.
>>> -		 */
>>> -		if (!writeback_throttling_sane(sc))
>>> -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
>>> -	}
>>> -
>>> -	sc->nr.dirty += stat.nr_dirty;
>>> -	sc->nr.congested += stat.nr_congested;
>>> -	sc->nr.writeback += stat.nr_writeback;
>>> -	sc->nr.immediate += stat.nr_immediate;
>>> -	sc->nr.taken += nr_taken;
>>> -
>>> +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>>>    	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>>>    			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>>>    	return nr_reclaimed;
>>> @@ -4651,9 +4657,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>>>    	if (!folio_test_referenced(folio))
>>>    		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
>>> -	/* for shrink_folio_list() */
>>> -	folio_clear_reclaim(folio);
>>
>> IMO, Moving this change into patch 8 would make more sense. Otherwise LGTM.
> 
> Thanks for the review! I made it a separate patch so we can better
> identify which part had the performance gain, and patch 8 can keep
> the review by. Patch 8 is still good without this, a few counters
> are updated with no user, kind of wasted but that's harmless.

I’m not referring to all the above changes. What I mean is that the 
'folio_clear_reclaim' removal should belong to patch 8. Since 
shrink_folio_list() in patch 8 will handle the writeback logic, 
folio_clear_reclaim() should also be removed in the same patch.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-31  9:36       ` Baolin Wang
@ 2026-03-31  9:40         ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-31  9:40 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng

On Tue, Mar 31, 2026 at 05:36:26PM +0800, Baolin Wang wrote:
> 
> 
> On 3/31/26 5:29 PM, Kairui Song wrote:
> > On Tue, Mar 31, 2026 at 05:24:39PM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > > 
> > > > Currently MGLRU and non-MGLRU handle the reclaim statistic and
> > > > writeback handling very differently, especially throttling.
> > > > Basically MGLRU just ignored the throttling part.
> > > > 
> > > > Let's just unify this part, use a helper to deduplicate the code
> > > > so both setups will share the same behavior. Also remove the
> > > > folio_clear_reclaim in isolate_folio which was actively invalidating
> > > > the congestion control. PG_reclaim is now handled by shrink_folio_list,
> > > > keeping it in isolate_folio is not helpful.
> > > > 
> > > > Test using following reproducer using bash:
> > > > 
> > > >     echo "Setup a slow device using dm delay"
> > > >     dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> > > >     LOOP=$(losetup --show -f /var/tmp/backing)
> > > >     mkfs.ext4 -q $LOOP
> > > >     echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> > > >         dmsetup create slow_dev
> > > >     mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> > > > 
> > > >     echo "Start writeback pressure"
> > > >     sync && echo 3 > /proc/sys/vm/drop_caches
> > > >     mkdir /sys/fs/cgroup/test_wb
> > > >     echo 128M > /sys/fs/cgroup/test_wb/memory.max
> > > >     (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> > > >         dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> > > > 
> > > >     echo "Clean up"
> > > >     echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> > > >     dmsetup resume slow_dev
> > > >     umount -l /mnt/slow && sync
> > > >     dmsetup remove slow_dev
> > > > 
> > > > Before this commit, `dd` will get OOM killed immediately if
> > > > MGLRU is enabled. Classic LRU is fine.
> > > > 
> > > > After this commit, congestion control is now effective and no more
> > > > spin on LRU or premature OOM.
> > > > 
> > > > Stress test on other workloads also looking good.
> > > > 
> > > > Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >    mm/vmscan.c | 93 +++++++++++++++++++++++++++----------------------------------
> > > >    1 file changed, 41 insertions(+), 52 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 1783da54ada1..83c8fdf8fdc4 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
> > > >    	return !(current->flags & PF_LOCAL_THROTTLE);
> > > >    }
> > > > +static void handle_reclaim_writeback(unsigned long nr_taken,
> > > > +				     struct pglist_data *pgdat,
> > > > +				     struct scan_control *sc,
> > > > +				     struct reclaim_stat *stat)
> > > > +{
> > > > +	/*
> > > > +	 * If dirty folios are scanned that are not queued for IO, it
> > > > +	 * implies that flushers are not doing their job. This can
> > > > +	 * happen when memory pressure pushes dirty folios to the end of
> > > > +	 * the LRU before the dirty limits are breached and the dirty
> > > > +	 * data has expired. It can also happen when the proportion of
> > > > +	 * dirty folios grows not through writes but through memory
> > > > +	 * pressure reclaiming all the clean cache. And in some cases,
> > > > +	 * the flushers simply cannot keep up with the allocation
> > > > +	 * rate. Nudge the flusher threads in case they are asleep.
> > > > +	 */
> > > > +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> > > > +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > > +		/*
> > > > +		 * For cgroupv1 dirty throttling is achieved by waking up
> > > > +		 * the kernel flusher here and later waiting on folios
> > > > +		 * which are in writeback to finish (see shrink_folio_list()).
> > > > +		 *
> > > > +		 * Flusher may not be able to issue writeback quickly
> > > > +		 * enough for cgroupv1 writeback throttling to work
> > > > +		 * on a large system.
> > > > +		 */
> > > > +		if (!writeback_throttling_sane(sc))
> > > > +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> > > > +	}
> > > > +
> > > > +	sc->nr.dirty += stat->nr_dirty;
> > > > +	sc->nr.congested += stat->nr_congested;
> > > > +	sc->nr.writeback += stat->nr_writeback;
> > > > +	sc->nr.immediate += stat->nr_immediate;
> > > > +	sc->nr.taken += nr_taken;
> > > > +}
> > > > +
> > > >    /*
> > > >     * shrink_inactive_list() is a helper for shrink_node().  It returns the number
> > > >     * of reclaimed pages
> > > > @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > > >    	lruvec_lock_irq(lruvec);
> > > >    	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
> > > >    					nr_scanned - nr_reclaimed);
> > > > -
> > > > -	/*
> > > > -	 * If dirty folios are scanned that are not queued for IO, it
> > > > -	 * implies that flushers are not doing their job. This can
> > > > -	 * happen when memory pressure pushes dirty folios to the end of
> > > > -	 * the LRU before the dirty limits are breached and the dirty
> > > > -	 * data has expired. It can also happen when the proportion of
> > > > -	 * dirty folios grows not through writes but through memory
> > > > -	 * pressure reclaiming all the clean cache. And in some cases,
> > > > -	 * the flushers simply cannot keep up with the allocation
> > > > -	 * rate. Nudge the flusher threads in case they are asleep.
> > > > -	 */
> > > > -	if (stat.nr_unqueued_dirty == nr_taken) {
> > > > -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> > > > -		/*
> > > > -		 * For cgroupv1 dirty throttling is achieved by waking up
> > > > -		 * the kernel flusher here and later waiting on folios
> > > > -		 * which are in writeback to finish (see shrink_folio_list()).
> > > > -		 *
> > > > -		 * Flusher may not be able to issue writeback quickly
> > > > -		 * enough for cgroupv1 writeback throttling to work
> > > > -		 * on a large system.
> > > > -		 */
> > > > -		if (!writeback_throttling_sane(sc))
> > > > -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> > > > -	}
> > > > -
> > > > -	sc->nr.dirty += stat.nr_dirty;
> > > > -	sc->nr.congested += stat.nr_congested;
> > > > -	sc->nr.writeback += stat.nr_writeback;
> > > > -	sc->nr.immediate += stat.nr_immediate;
> > > > -	sc->nr.taken += nr_taken;
> > > > -
> > > > +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
> > > >    	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
> > > >    			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
> > > >    	return nr_reclaimed;
> > > > @@ -4651,9 +4657,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
> > > >    	if (!folio_test_referenced(folio))
> > > >    		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
> > > > -	/* for shrink_folio_list() */
> > > > -	folio_clear_reclaim(folio);
> > > 
> > > IMO, Moving this change into patch 8 would make more sense. Otherwise LGTM.
> > 
> > Thanks for the review! I made it a separate patch so we can better
> > identify which part had the performance gain, and patch 8 can keep
> > the review by. Patch 8 is still good without this, a few counters
> > are updated with no user, kind of wasted but that's harmless.
> 
> I’m not referring to all the above changes. What I mean is that the
> 'folio_clear_reclaim' removal should belong to patch 8. Since
> shrink_folio_list() in patch 8 will handle the writeback logic,
> folio_clear_reclaim() should also be removed in the same patch.

Ah, that's a very good point then. Can move it in V3, thanks!


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios
  2026-03-31  9:01     ` Kairui Song
@ 2026-03-31  9:52       ` Baolin Wang
  0 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-31  9:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng



On 3/31/26 5:01 PM, Kairui Song wrote:
> On Tue, Mar 31, 2026 at 04:04:30PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Make the scan helpers return the exact number of folios being scanned
>>> or isolated. Since the reclaim loop now has a natural scan budget that
>>> controls the scan progress, returning the scan number directly should
>>> make the scan more accurate and easier to follow.
>>>
>>> The number of scanned folios for each iteration is always positive and
>>> larger than 0, unless the reclaim must stop for a forced aging, so
>>> there is no more need for any special handling when there is no
>>> progress made:
>>>
>>> - `return isolated || !remaining ? scanned : 0` in scan_folios: both
>>>     the function and the call now just return the exact scan count,
>>>     combined with the scan budget introduced in the previous commit to
>>>     avoid livelock or under scan.
>>
>> Make sense to me.
>>
>>>
>>> - `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
>>>     scan count was kind of confusing and no longer needed too, as scan
>>>     number will never be zero even if none of the folio in oldest
>>>     generation is isolated.
>>
>> Yes, agree.
>>
>>>
>>> - `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
>>>     the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
>>>     naturally returns 0 when only two gens remain and breaks the loop.
>>>
>>> Also move try_to_inc_min_seq before isolate_folios, so that any empty
>>> gens created by external folio freeing are also skipped.
>>
>> This part is somewhat confusing. You probably mean the case where the list
>> of that gen becomes empty via isolate_folio(), right?
>>
>> If that's the case, the original logic would remove the empty gens produced
>> by isolate_folio() after calling try_to_inc_min_seq().
>>
>> However, with your changes, this removal won't happen until the next
>> eviction. Does this provide any additional benefits? Or could you describe
>> how this change impacts your testing?
> 
> Hi Baolin, thanks for the review.
> 
> Yeah, I also notices this issue after sending this while doing more
> self review.
> 
> So I did some test with the patch below:
> 
> static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
> @@ -4818,11 +4814,15 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   
>          lruvec_lock_irq(lruvec);
>   
> +       /* In case folio deletion created empty gen, flush them */
>          try_to_inc_min_seq(lruvec, swappiness);
>   
>          scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
>                                   &list, &isolated, &type, &type_scanned);
>   
> +       /* Isolation might created empty gen, flush them */
> +       try_to_inc_min_seq(lruvec, swappiness);
> +
>          lruvec_unlock_irq(lruvec);
>   
>          if (list_empty(&list))
> 
> The return value of try_to_inc_min_seq can also be dropped
> since it's no longer used, and the function call should be cheap.
> 
> After system time of build kernel using 3G memory and make -j96
> with ZRAM as swap, system time in seconds average of 12 test run each:
> 
> mm-new:
> 9136.055833
> 
> After V2:
> 8819.932222
> 
> After V2, with above patch:
> 8783.944444
> 
> After V2, without above patch but move try_to_inc_min_seq
> back to after isolate_folios:
> 8807.874444
> 
> This series is looking good, this inc_min change seems trivial
> but in theory it does have have real effect.
> 
> - Moving the try_to_inc_min_seq after isolate_folios may result in a
> wasted isolate_folios call and early abort of reclaim loop if there
> is a stalled oldest gen created by folio deletion.

Indeed.

> - Moving the try_to_inc_min_seq before isolate_folios may leave a
> empty gen after isolation. Usually it's fine because next eviction
> will still reclaim them. But before next eviction, during that period,
> new file folios could be added the oldest gen and get reclaim too
> early. That looks a real problem.
> 
> This maybe trivial since MGLRU itself also may suffer the same
> problem when the oldest gen is just too short, that's a much more
> common case (For this short oldest gen issue we can solve later).
> 
> - Having try_to_inc_min_seq both before and after isolate_folios
> seems the best choice here and somehow matches the benchmark
> result above, very close to the noise level though.
> 
> Well I only tested one cases, the cover letter described a
> larger matrix, still all good with this series and I'm not
> 100% sure how this particular change effects them, I guess
> it's still trivial.
> 
> The try_to_inc_min_seq call should be cheap enough since it's
> called only for one batch of 64 folios, and it's only reading
> a few lists for the non inc path.
> 
> How do you think that we just call it twice here?

Sounds reasonable to me.

I'm not sure if we need to split out a new patch with adding above 
message, as this patch mainly focuses on optimizing the number of folios 
being scanned.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation
  2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
  2026-03-30  1:57   ` Chen Ridong
  2026-03-30  7:59   ` Baolin Wang
@ 2026-04-01  0:00   ` Barry Song
  2 siblings, 0 replies; 44+ messages in thread
From: Barry Song @ 2026-04-01  0:00 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Sun, Mar 29, 2026 at 3:52 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The current variable name isn't helpful. Make the variable names more
> meaningful.
>
> Only naming change, no behavior change.
>
> Suggested-by: Barry Song <baohua@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

LGTM,
Reviewed-by: Barry Song <baohua@kernel.org>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-30  8:14   ` Baolin Wang
@ 2026-04-01  0:20     ` Barry Song
  0 siblings, 0 replies; 44+ messages in thread
From: Barry Song @ 2026-04-01  0:20 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng

On Mon, Mar 30, 2026 at 4:15 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Same as active / inactive LRU, MGLRU isolates and scans folios in
> > batches.  The batch split is done hidden deep in the helper, which
> > makes the code harder to follow.  The helper's arguments are also
> > confusing since callers usually request more folios than the batch
> > size, so the helper almost never processes the full requested amount.
> >
> > Move the batch splitting into the top loop to make it cleaner, there
> > should be no behavior change.
> >
> > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
>
> Some nits as follows, otherwise LGTM.
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

With the same nits addressed,

Reviewed-by: Barry Song <baohua@kernel.org>

>
> >   mm/vmscan.c | 16 +++++++++-------
> >   1 file changed, 9 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index f336f89a2de6..963362523782 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4695,10 +4695,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >       int scanned = 0;
> >       int isolated = 0;
> >       int skipped = 0;
> > -     int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
> > -     int remaining = scan_batch;
> > +     unsigned long remaining = nr_to_scan;
> >       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >
> > +     VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
> >       VM_WARN_ON_ONCE(!list_empty(list));
> >
> >       if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> > @@ -4751,7 +4751,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >       mod_lruvec_state(lruvec, item, isolated);
> >       mod_lruvec_state(lruvec, PGREFILL, sorted);
> >       mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
> > -     trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
> > +     trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
> >                               scanned, skipped, isolated,
> >                               type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >       if (type == LRU_GEN_FILE)
> > @@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> >
> >   static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >   {
> > -     long nr_to_scan;
> > +     long nr_batch, nr_to_scan;
>
> Nit: Since evict_folios() expects an unsgined long, why not define
> 'unsigned long nr_batch'?

I guess the confusion comes from nr_to_scan being a long
rather than an unsigned long. This is the only place where
nr_to_scan is defined as a long.

I think we should clean up get_nr_to_scan(). Right now, it
clearly returns more than it should, uses -1 to indicate
something else, and also calls try_to_inc_max_seq(), which
is not part of nr_to_scan.

That might be better addressed in a separate patch.

Thanks
Barry


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-31  9:18     ` Kairui Song
@ 2026-04-01  2:52       ` Baolin Wang
  2026-04-01  4:57         ` Kairui Song
  2026-04-02  0:11       ` Barry Song
  1 sibling, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-04-01  2:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng



On 3/31/26 5:18 PM, Kairui Song wrote:
> On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
>>
>>
>> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> The current handling of dirty writeback folios is not working well for
>>> file page heavy workloads: Dirty folios are protected and move to next
>>> gen upon isolation of getting throttled or reactivation upon pageout
>>> (shrink_folio_list).
>>>
>>> This might help to reduce the LRU lock contention slightly, but as a
>>> result, the ping-pong effect of folios between head and tail of last two
>>> gens is serious as the shrinker will run into protected dirty writeback
>>> folios more frequently compared to activation. The dirty flush wakeup
>>> condition is also much more passive compared to active/inactive LRU.
>>> Active / inactve LRU wakes the flusher if one batch of folios passed to
>>> shrink_folio_list is unevictable due to under writeback, but MGLRU
>>> instead has to check this after the whole reclaim loop is done, and then
>>> count the isolation protection number compared to the total reclaim
>>> number.
>>>
>>> And we previously saw OOM problems with it, too, which were fixed but
>>> still not perfect [1].
>>>
>>> So instead, just drop the special handling for dirty writeback, just
>>> re-activate it like active / inactive LRU. And also move the dirty flush
>>> wake up check right after shrink_folio_list. This should improve both
>>> throttling and performance.
>>>
>>> Test with YCSB workloadb showed a major performance improvement:
>>>
>>> Before this series:
>>> Throughput(ops/sec): 61642.78008938203
>>> AverageLatency(us):  507.11127774145166
>>> pgpgin 158190589
>>> pgpgout 5880616
>>> workingset_refault 7262988
>>>
>>> After this commit:
>>> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
>>> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
>>> pgpgin 101871227                        (-35.6%, lower is better)
>>> pgpgout 5770028
>>> workingset_refault 3418186              (-52.9%, lower is better)
>>>
>>> The refault rate is ~50% lower, and throughput is ~30% higher, which
>>> is a huge gain. We also observed significant performance gain for
>>> other real-world workloads.
>>>
>>> We were concerned that the dirty flush could cause more wear for SSD:
>>> that should not be the problem here, since the wakeup condition is when
>>> the dirty folios have been pushed to the tail of LRU, which indicates
>>> that memory pressure is so high that writeback is blocking the workload
>>> already.
>>>
>>> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
>>> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
>>>    1 file changed, 16 insertions(+), 41 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 8de5c8d5849e..17b5318fad39 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>>>    		       int tier_idx)
>>>    {
>>>    	bool success;
>>> -	bool dirty, writeback;
>>>    	int gen = folio_lru_gen(folio);
>>>    	int type = folio_is_file_lru(folio);
>>>    	int zone = folio_zonenum(folio);
>>> @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>>>    		return true;
>>>    	}
>>> -	dirty = folio_test_dirty(folio);
>>> -	writeback = folio_test_writeback(folio);
>>> -	if (type == LRU_GEN_FILE && dirty) {
>>> -		sc->nr.file_taken += delta;
>>> -		if (!writeback)
>>> -			sc->nr.unqueued_dirty += delta;
>>> -	}
>>> -
>>> -	/* waiting for writeback */
>>> -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
>>> -		gen = folio_inc_gen(lruvec, folio, true);
>>> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
>>> -		return true;
>>> -	}
>>
>> I'm a bit concerned about the handling of dirty folios.
>>
>> In the original logic, if we encounter a dirty folio, we increment its
>> generation counter by 1 and move it to the *second oldest generation*.
>>
>> However, with your patch, shrink_folio_list() will activate the dirty folio
>> by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
>> will put the dirty folio back into the MGLRU list.
>>
>> But because the folio_test_active() is true for this dirty folio, the dirty
>> folio will now be placed into the *second youngest generation* (see
>> lru_gen_folio_seq()).
> 
> Yeah, and that's exactly what we want. Or else, these folios will
> stay at oldest gen, following scan will keep seeing them and hence

Not the oldest gen, instead, they will be moved into the second oldest 
gen, right?

if (writeback || (type == LRU_GEN_FILE && dirty)) {
	gen = folio_inc_gen(lruvec, folio, true);
	list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
	return true;
}

> keep bouncing these folios again and again to a younger gen since
> they are not reclaimable.
> 
> The writeback callback (folio_rotate_reclaimable) will move them
> back to tail once they are actually reclaimable. So we are not
> losing any ability to reclaim them. Am I missing anything?

Right.

>> As a result, during the next eviction, these dirty folios won't be scanned
>> again (because they are in the second youngest generation). Wouldn't this
>> lead to a situation where the flusher cannot be woken up in time, making OOM
>> more likely?
> 
> No? Flusher is already waken up by the time they are seen for the
> first time. If we see these folios again very soon, the LRU is
> congested, one following patch handles the congested case too by
> throttling (which was completely missing previously). And now we

Yes, throttling is what we expect.

My concern is that if all dirty folios are requeued into the *second 
youngest generation*, it might lead to the throttling mechanism in 
shrink_folio_list() becoming ineffective (because these dirty folios are 
no longer scanned again), resulting in a failure to throttle reclamation 
and leaving no reclaimable folios to scan, potentially causing premature 
OOM.

Specifically, if the reclaimer scan a memcg's MGLRU first time, all 
dirty folios are moved into the *second youngest generation*, the 
*oldest generation* will be empty and will be removed by 
try_to_inc_min_seq(), leaving only 3 generations now.

Then on the next scan, we cannot find any file folios to scan, and if 
the writeback of the memcg’s dirty folios has not yet completed, this 
can lead to a premature OOM.

If, as in the original logic, these dirty folios are scanned by 
shrink_folio_list() and moved them into the *second oldest generation*, 
then when the *oldest generation* becomes empty and is removed, 
reclaimer can still continue scanning the dirty folios (the former 
second oldest generation becomes the oldest generation), thereby 
continuing to trigger shrink_folio_list()’s writeback throttling and 
avoiding a premature OOM.

Am I overthinking this?

> are actually a bit more proactive about waking up the flusher,
> since the wakeup hook is moved inside the loop instead of after
> the whole loop is finished.
> 
> These two behavior change above is basically just unifying MGLRU to do
> what the classical LRU has been doing for years, and result looks
> really good.

One difference is that, For classical LRU, if the inactive list is low, 
we will run shrink_active_list() to refill the inactive list.

But for MGLRU, after your changes, we might not perform aging (e.g., 
DEF_PRIORITY will skip aging), which could make shrink_folio_list()’s 
throttling less effective than expected, as I mentioned above.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-04-01  2:52       ` Baolin Wang
@ 2026-04-01  4:57         ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-04-01  4:57 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng

On Wed, Apr 01, 2026 at 10:52:54AM +0800, Baolin Wang wrote:
> 
> 
> On 3/31/26 5:18 PM, Kairui Song wrote:
> > On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
> > > 
> > > 
> > > On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > > 
> > > > The current handling of dirty writeback folios is not working well for
> > > > file page heavy workloads: Dirty folios are protected and move to next
> > > > gen upon isolation of getting throttled or reactivation upon pageout
> > > > (shrink_folio_list).
> > > > 
> > > > This might help to reduce the LRU lock contention slightly, but as a
> > > > result, the ping-pong effect of folios between head and tail of last two
> > > > gens is serious as the shrinker will run into protected dirty writeback
> > > > folios more frequently compared to activation. The dirty flush wakeup
> > > > condition is also much more passive compared to active/inactive LRU.
> > > > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > > > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > > > instead has to check this after the whole reclaim loop is done, and then
> > > > count the isolation protection number compared to the total reclaim
> > > > number.
> > > > 
> > > > And we previously saw OOM problems with it, too, which were fixed but
> > > > still not perfect [1].
> > > > 
> > > > So instead, just drop the special handling for dirty writeback, just
> > > > re-activate it like active / inactive LRU. And also move the dirty flush
> > > > wake up check right after shrink_folio_list. This should improve both
> > > > throttling and performance.
> > > > 
> > > > Test with YCSB workloadb showed a major performance improvement:
> > > > 
> > > > Before this series:
> > > > Throughput(ops/sec): 61642.78008938203
> > > > AverageLatency(us):  507.11127774145166
> > > > pgpgin 158190589
> > > > pgpgout 5880616
> > > > workingset_refault 7262988
> > > > 
> > > > After this commit:
> > > > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > > > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > > > pgpgin 101871227                        (-35.6%, lower is better)
> > > > pgpgout 5770028
> > > > workingset_refault 3418186              (-52.9%, lower is better)
> > > > 
> > > > The refault rate is ~50% lower, and throughput is ~30% higher, which
> > > > is a huge gain. We also observed significant performance gain for
> > > > other real-world workloads.
> > > > 
> > > > We were concerned that the dirty flush could cause more wear for SSD:
> > > > that should not be the problem here, since the wakeup condition is when
> > > > the dirty folios have been pushed to the tail of LRU, which indicates
> > > > that memory pressure is so high that writeback is blocking the workload
> > > > already.
> > > > 
> > > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > > > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >    mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
> > > >    1 file changed, 16 insertions(+), 41 deletions(-)
> > > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 8de5c8d5849e..17b5318fad39 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > >    		       int tier_idx)
> > > >    {
> > > >    	bool success;
> > > > -	bool dirty, writeback;
> > > >    	int gen = folio_lru_gen(folio);
> > > >    	int type = folio_is_file_lru(folio);
> > > >    	int zone = folio_zonenum(folio);
> > > > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > > >    		return true;
> > > >    	}
> > > > -	dirty = folio_test_dirty(folio);
> > > > -	writeback = folio_test_writeback(folio);
> > > > -	if (type == LRU_GEN_FILE && dirty) {
> > > > -		sc->nr.file_taken += delta;
> > > > -		if (!writeback)
> > > > -			sc->nr.unqueued_dirty += delta;
> > > > -	}
> > > > -
> > > > -	/* waiting for writeback */
> > > > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > > > -		gen = folio_inc_gen(lruvec, folio, true);
> > > > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > > -		return true;
> > > > -	}
> > > 
> > > I'm a bit concerned about the handling of dirty folios.
> > > 
> > > In the original logic, if we encounter a dirty folio, we increment its
> > > generation counter by 1 and move it to the *second oldest generation*.
> > > 
> > > However, with your patch, shrink_folio_list() will activate the dirty folio
> > > by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
> > > will put the dirty folio back into the MGLRU list.
> > > 
> > > But because the folio_test_active() is true for this dirty folio, the dirty
> > > folio will now be placed into the *second youngest generation* (see
> > > lru_gen_folio_seq()).
> > 
> > Yeah, and that's exactly what we want. Or else, these folios will
> > stay at oldest gen, following scan will keep seeing them and hence
> 
> Not the oldest gen, instead, they will be moved into the second oldest gen,
> right?
> 
> if (writeback || (type == LRU_GEN_FILE && dirty)) {
> 	gen = folio_inc_gen(lruvec, folio, true);
> 	list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> 	return true;
> }

Right, it is still similar though, scanner will see these folios
very soon again as the oldest gen is drained.

> > > As a result, during the next eviction, these dirty folios won't be scanned
> > > again (because they are in the second youngest generation). Wouldn't this
> > > lead to a situation where the flusher cannot be woken up in time, making OOM
> > > more likely?
> > 
> > No? Flusher is already waken up by the time they are seen for the
> > first time. If we see these folios again very soon, the LRU is
> > congested, one following patch handles the congested case too by
> > throttling (which was completely missing previously). And now we
> 
> Yes, throttling is what we expect.
> 
> My concern is that if all dirty folios are requeued into the *second
> youngest generation*, it might lead to the throttling mechanism in
> shrink_folio_list() becoming ineffective (because these dirty folios are no
> longer scanned again), resulting in a failure to throttle reclamation and
> leaving no reclaimable folios to scan, potentially causing premature OOM.

They are scanned again just fine when older gens are drained.
MGLRU uses PID and protection so it might seem harder for promoted
folios to get demoted to tail - but we are not activating them to HEAD
either, second youngest gen is not that far away from tail.

Classic LRU will simply move these pages to head of active, so it
takes a whole scan iteration of the whole lruvec before seeing
these folios again, so we don't go throttle unless this is really
no way to progress.

> Specifically, if the reclaimer scan a memcg's MGLRU first time, all dirty
> folios are moved into the *second youngest generation*, the *oldest
> generation* will be empty and will be removed by try_to_inc_min_seq(),
> leaving only 3 generations now.
> 
> Then on the next scan, we cannot find any file folios to scan, and if the
> writeback of the memcg’s dirty folios has not yet completed, this can lead
> to a premature OOM.

Let's walk through this concretely. Assume gen 4 is youngest, gen 1 is
oldest. Dirty folios are activated to gen 3 (second youngest). Then
gen 1 is drained and removed. gen 2 becomes the new oldest, and it
is still evictable .

If we are so unlucky and gen 2 is empty or unevictable, anon reclaim
is still available. And if anon is unevictable (no swap, swap full,
or getting recycled), then file eviction proceeds - MGLRU's force
age is performed as anon gen is drained.

Gen 3's content (demoted) is reached after old gen 2 is dropped, by
which point the flusher could have been running for two full
generation-drain cycles and finished. We are all good.

Overall I think this issues seem trivial considering the chance
and time window of reclaim rotation vs aging, and the worst we
get here is a bit more anon reclaim. The anon / file balance
and swappiness issue when the gen gap >= 2 worth another fix.

> 
> If, as in the original logic, these dirty folios are scanned by
> shrink_folio_list() and moved them into the *second oldest generation*, then
> when the *oldest generation* becomes empty and is removed, reclaimer can
> still continue scanning the dirty folios (the former second oldest
> generation becomes the oldest generation), thereby continuing to trigger
> shrink_folio_list()’s writeback throttling and avoiding a premature OOM.

Moving them to gen 2 (second oldest) blocks reclaim of gen 2 and starts
throttling early, while gen 2 is very likely reclaimable with clean
folios. Classic LRU will scan the whole LRU before starting to
throttle to avoid that.

So I even hesitated about moving these folios to the youngest gen here.
It might be fine as the youngest gen in theory should be hottest, so
skipping it might not be a bad idea.

> 
> Am I overthinking this?

We lived without throttling or proper dirty writeback handling for
years (e.g. the benchmark represents a lot of real workloads).
Things are getting much better, so I think we are fine :)

Have been testing this new design on servers and my Android phone,
so far everything looks good.

> > are actually a bit more proactive about waking up the flusher,
> > since the wakeup hook is moved inside the loop instead of after
> > the whole loop is finished.
> > 
> > These two behavior change above is basically just unifying MGLRU to do
> > what the classical LRU has been doing for years, and result looks
> > really good.
> 
> One difference is that, For classical LRU, if the inactive list is low, we
> will run shrink_active_list() to refill the inactive list.
> 
> But for MGLRU, after your changes, we might not perform aging (e.g.,
> DEF_PRIORITY will skip aging), which could make shrink_folio_list()’s
> throttling less effective than expected, as I mentioned above.

That refill doesn't change the order of folios, it just shifts the
LRU as a whole. So essentially it needs to scan the whole LRU before
throttling. I think we might even be still a bit too aggressive since
gen 4 is not touched before throttling starts, but gen 4 being
protected seems sane, so the whole picture looks alright.

DEF_PRIORITY gets escalated easily if the scan fails to satisfy the
reclaimer's need.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
  2026-03-31  9:24   ` Baolin Wang
@ 2026-04-01  5:01   ` Leno Hou
  2026-04-02  2:39   ` Shakeel Butt
  2 siblings, 0 replies; 44+ messages in thread
From: Leno Hou @ 2026-04-01  5:01 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
> 
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior. Also remove the
> folio_clear_reclaim in isolate_folio which was actively invalidating
> the congestion control. PG_reclaim is now handled by shrink_folio_list,
> keeping it in isolate_folio is not helpful.
> 
> Test using following reproducer using bash:
> 
>    echo "Setup a slow device using dm delay"
>    dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>    LOOP=$(losetup --show -f /var/tmp/backing)
>    mkfs.ext4 -q $LOOP
>    echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>        dmsetup create slow_dev
>    mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> 
>    echo "Start writeback pressure"
>    sync && echo 3 > /proc/sys/vm/drop_caches
>    mkdir /sys/fs/cgroup/test_wb
>    echo 128M > /sys/fs/cgroup/test_wb/memory.max
>    (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>        dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> 
>    echo "Clean up"
>    echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>    dmsetup resume slow_dev
>    umount -l /mnt/slow && sync
>    dmsetup remove slow_dev
> 

Hi Kairui,

I have tested this patch series on both stable v6.1.163 and 7.0.0-rc5
regarding the writeback throttling issue you described on arm64 platform.


Test Results:
1. Kernel 6.1.163: I was unable to reproduce the OOM issue with the 
provided test script. This is expected as the MGLRU writeback handling 
might behave differently or the issue is less prominent in this specific 
stable branch.

2. Kernel 7.0.0-rc5: I successfully reproduced the OOM issue using your 
bash script. The dd process gets OOM killed shortly after starting the 
writeback pressure.

Verification:
After applying your patch, I re-ran the test on 7.0-rc5:
1. The congestion control/throttling is now working as expected.
2. The OOM issue is resolved, and the dd process completes successfully 
without being killed.


Tested-by: Leno Hou <lenohou@gmail.com>

> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
> 
> After this commit, congestion control is now effective and no more
> spin on LRU or premature OOM.
> 
> Stress test on other workloads also looking good.
> 
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/vmscan.c | 93 +++++++++++++++++++++++++++----------------------------------
>   1 file changed, 41 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 1783da54ada1..83c8fdf8fdc4 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
>   	return !(current->flags & PF_LOCAL_THROTTLE);
>   }
>   
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +				     struct pglist_data *pgdat,
> +				     struct scan_control *sc,
> +				     struct reclaim_stat *stat)
> +{
> +	/*
> +	 * If dirty folios are scanned that are not queued for IO, it
> +	 * implies that flushers are not doing their job. This can
> +	 * happen when memory pressure pushes dirty folios to the end of
> +	 * the LRU before the dirty limits are breached and the dirty
> +	 * data has expired. It can also happen when the proportion of
> +	 * dirty folios grows not through writes but through memory
> +	 * pressure reclaiming all the clean cache. And in some cases,
> +	 * the flushers simply cannot keep up with the allocation
> +	 * rate. Nudge the flusher threads in case they are asleep.
> +	 */
> +	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> +		/*
> +		 * For cgroupv1 dirty throttling is achieved by waking up
> +		 * the kernel flusher here and later waiting on folios
> +		 * which are in writeback to finish (see shrink_folio_list()).
> +		 *
> +		 * Flusher may not be able to issue writeback quickly
> +		 * enough for cgroupv1 writeback throttling to work
> +		 * on a large system.
> +		 */
> +		if (!writeback_throttling_sane(sc))
> +			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +	}
> +
> +	sc->nr.dirty += stat->nr_dirty;
> +	sc->nr.congested += stat->nr_congested;
> +	sc->nr.writeback += stat->nr_writeback;
> +	sc->nr.immediate += stat->nr_immediate;
> +	sc->nr.taken += nr_taken;
> +}
> +
>   /*
>    * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>    * of reclaimed pages
> @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>   	lruvec_lock_irq(lruvec);
>   	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>   					nr_scanned - nr_reclaimed);
> -
> -	/*
> -	 * If dirty folios are scanned that are not queued for IO, it
> -	 * implies that flushers are not doing their job. This can
> -	 * happen when memory pressure pushes dirty folios to the end of
> -	 * the LRU before the dirty limits are breached and the dirty
> -	 * data has expired. It can also happen when the proportion of
> -	 * dirty folios grows not through writes but through memory
> -	 * pressure reclaiming all the clean cache. And in some cases,
> -	 * the flushers simply cannot keep up with the allocation
> -	 * rate. Nudge the flusher threads in case they are asleep.
> -	 */
> -	if (stat.nr_unqueued_dirty == nr_taken) {
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -		/*
> -		 * For cgroupv1 dirty throttling is achieved by waking up
> -		 * the kernel flusher here and later waiting on folios
> -		 * which are in writeback to finish (see shrink_folio_list()).
> -		 *
> -		 * Flusher may not be able to issue writeback quickly
> -		 * enough for cgroupv1 writeback throttling to work
> -		 * on a large system.
> -		 */
> -		if (!writeback_throttling_sane(sc))
> -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -	}
> -
> -	sc->nr.dirty += stat.nr_dirty;
> -	sc->nr.congested += stat.nr_congested;
> -	sc->nr.writeback += stat.nr_writeback;
> -	sc->nr.immediate += stat.nr_immediate;
> -	sc->nr.taken += nr_taken;
> -
> +	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>   	return nr_reclaimed;
> @@ -4651,9 +4657,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>   	if (!folio_test_referenced(folio))
>   		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
>   
> -	/* for shrink_folio_list() */
> -	folio_clear_reclaim(folio);
> -
>   	success = lru_gen_del_folio(lruvec, folio, true);
>   	VM_WARN_ON_ONCE_FOLIO(!success, folio);
>   
> @@ -4833,26 +4836,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   retry:
>   	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
>   	sc->nr_reclaimed += reclaimed;
> +	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
>   	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>   			type_scanned, reclaimed, &stat, sc->priority,
>   			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>   
> -	/*
> -	 * If too many file cache in the coldest generation can't be evicted
> -	 * due to being dirty, wake up the flusher.
> -	 */
> -	if (stat.nr_unqueued_dirty == isolated) {
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
> -		/*
> -		 * For cgroupv1 dirty throttling is achieved by waking up
> -		 * the kernel flusher here and later waiting on folios
> -		 * which are in writeback to finish (see shrink_folio_list()).
> -		 */
> -		if (!writeback_throttling_sane(sc))
> -			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -	}
> -
>   	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>   		DEFINE_MIN_SEQ(lruvec);
>   
> @@ -4895,6 +4883,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>   
>   	if (!list_empty(&list)) {
>   		skip_retry = true;
> +		isolated = 0;
>   		goto retry;
>   	}
>   
> 


--
Best regards,
Leno Hou



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling
  2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (11 preceding siblings ...)
  2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
@ 2026-04-01  5:18 ` Leno Hou
  2026-04-01  7:36   ` Kairui Song
  12 siblings, 1 reply; 44+ messages in thread
From: Leno Hou @ 2026-04-01  5:18 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> The aging OOM is a bit tricky, a specific reproducer can be used to
> simulate what we encountered in production environment [4]: Spawns
> multiple workers that keep reading the given file using mmap, and pauses
> for 120ms after one file read batch. It also spawns another set of
> workers that keep allocating and freeing a given size of anonymous memory.
> The total memory size exceeds the memory limit (eg. 44G anon + 8G file,
> which is 52G vs 48G memcg limit).
> 
> - MGLRU disabled:
>    Finished 128 iterations.
> 
> - MGLRU enabled:
>    OOM with following info after about ~10-20 iterations:
>      [  154.365634] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>      [  154.366456] memory: usage 50331648kB, limit 50331648kB, failcnt 354207
>      [  154.378941] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>      [  154.379408] Memory cgroup stats for /demo:
>      [  154.379544] anon 44352327680
>      [  154.380079] file 7187271680
> 
>    OOM occurs despite there being still evictable file folios.
> 
> - MGLRU enabled after this series:
>    Finished 128 iterations.

Hi Kairui,

I've tested on v6.1.163 and unable to reproduce the OOM issue by your 
test script [4], Could you point the kernel version in your environment 
or more information?


Link: 
https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure 
[4]


--
Best Regards,
Leno Hou


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-01  5:18 ` [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Leno Hou
@ 2026-04-01  7:36   ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-04-01  7:36 UTC (permalink / raw)
  To: Leno Hou
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

On Wed, Apr 01, 2026 at 01:18:16PM +0800, Leno Hou wrote:
> On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > The aging OOM is a bit tricky, a specific reproducer can be used to
> > simulate what we encountered in production environment [4]: Spawns
> > multiple workers that keep reading the given file using mmap, and pauses
> > for 120ms after one file read batch. It also spawns another set of
> > workers that keep allocating and freeing a given size of anonymous memory.
> > The total memory size exceeds the memory limit (eg. 44G anon + 8G file,
> > which is 52G vs 48G memcg limit).
> > 
> > - MGLRU disabled:
> >    Finished 128 iterations.
> > 
> > - MGLRU enabled:
> >    OOM with following info after about ~10-20 iterations:
> >      [  154.365634] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> >      [  154.366456] memory: usage 50331648kB, limit 50331648kB, failcnt 354207
> >      [  154.378941] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> >      [  154.379408] Memory cgroup stats for /demo:
> >      [  154.379544] anon 44352327680
> >      [  154.380079] file 7187271680
> > 
> >    OOM occurs despite there being still evictable file folios.
> > 
> > - MGLRU enabled after this series:
> >    Finished 128 iterations.
> 
> Hi Kairui,

Hi Leno,

> 
> I've tested on v6.1.163 and unable to reproduce the OOM issue by your test
> script [4], Could you point the kernel version in your environment or more
> information?
> 

Thanks for testing!

Right, this one is very tricky to trigger, I struggled a lot with that
and took many attempts to construct a reproducer. I later changed the
setup to 16G memcg for easier reproduce, idea is still the same:

- Mount a ramdisk (/dev/pmem0) at /mnt/ramdisk:
  mkfs.xfs -f /dev/pmem0; mount /dev/pmem0 /mnt/ramdisk/
- Setup a 16g memcg
  mkdir -p /sys/fs/cgroup/demo
  echo 16G > /sys/fs/cgroup/demo/memory.max
  echo $$ > /sys/fs/cgroup/demo/cgroup.procs
  echo $PPID > /sys/fs/cgroup/demo/cgroup.procs
  echo $BASHPID > /sys/fs/cgroup/demo/cgroup.procs
- Then run the reproducer:
  file_anon_mix_pressure /mnt/ramdisk/test.img 14g 8g 96 96 120000

The parameters is depend on your system config. My system is a
48c96t machine:

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Intel(R) Corporation
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
BIOS Model name:     Intel(R) Xeon(R) Platinum 8255C CPU @ 2.50GHz
Stepping:            7
CPU MHz:             3100.021
CPU max MHz:         2501.0000
CPU min MHz:         1000.0000
BogoMIPS:            5000.00
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts vnmi pku ospke avx512_vnni md_clear flush_l1d arch_capabilities

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           62132        9553       49022          18        4172       52579
Swap:              0           0           0

And gets the OOM without this series:
[   17.537545] XFS (pmem0): Ending clean mount
[   49.329042] hrtimer: interrupt took 13930 ns
[   49.823993] file_anon_mix_p (3832): drop_caches: 3
[   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
[   62.624892] CPU: 95 UID: 0 PID: 4875 Comm: file_anon_mix_p Kdump: loaded Not tainted 7.0.0-rc5.orig-gb822cd37c749 #292 PREEMPT(full)·
[   62.624897] Hardware name: Red Hat KVM/RHEL-AV, BIOS 0.0.0 02/06/2015
[   62.624899] Call Trace:
[   62.624902]  <TASK>
[   62.624905]  dump_stack_lvl+0x4a/0x70
[   62.624912]  dump_header+0x43/0x1b3
[   62.624918]  oom_kill_process.cold+0x8/0x85
[   62.624922]  out_of_memory+0xee/0x280
[   62.624927]  mem_cgroup_out_of_memory+0xbc/0xd0
[   62.624933]  try_charge_memcg+0x3c1/0x5d0
[   62.624936]  charge_memcg+0x4a/0xb0
[   62.624939]  __mem_cgroup_charge+0x28/0x80
[   62.624942]  alloc_anon_folio+0x1d1/0x3d0
[   62.624947]  do_anonymous_page+0x19d/0x550
[   62.624950]  ? pte_offset_map_rw_nolock+0x1b/0x80
[   62.624954]  __handle_mm_fault+0x346/0x6d0
[   62.624956]  ? __schedule+0x29c/0x5b0
[   62.624968]  handle_mm_fault+0xe8/0x2d0
[   62.624971]  do_user_addr_fault+0x204/0x660
[   62.624977]  exc_page_fault+0x67/0x170
[   62.624979]  asm_exc_page_fault+0x22/0x30
[   62.624982] RIP: 0033:0x401451
[   62.624985] Code: 00 00 00 c3 0f 1f 44 00 00 48 83 7f 10 00 74 23 31 c0 0f 1f 80 00 00 00 00 48 8b 57 18 48 01 c2 48 03 57 08 48 05 00 10 00 00 <c6> 02 00 48 3b 47 10 72 e6 c7 47 20 01 00 00 00 31 c0 c3 90 66 66
[   62.624987] RSP: 002b:00007f3ec53a5e68 EFLAGS: 00010206
[   62.624989] RAX: 000000000731d000 RBX: 00007f3ec53a66c0 RCX: 00007f4271ca02d6
[   62.624991] RDX: 00007f425cefd000 RSI: 00007f3ec53a6fb0 RDI: 000000000a2f1c28
[   62.624992] RBP: 00007f3ec53a5f30 R08: 0000000000000000 R09: 0000000000000021
[   62.624993] R10: 0000000000000008 R11: 0000000000000246 R12: 00007f3ec53a66c0
[   62.624995] R13: 00007ffe83436d80 R14: 00007f3ec53a6ce4 R15: 00007ffe83436e87
[   62.624998]  </TASK>
[   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
[   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
[   62.640823] Memory cgroup stats for /demo:
[   62.641017] anon 10604879872
[   62.641941] file 6574858240
[   62.642259] kernel 0
[   62.642443] kernel_stack 0
[   62.642674] pagetables 0
[   62.642889] sec_pagetables 0
[   62.643318] percpu 0
[   62.643545] sock 0
[   62.643782] vmalloc 0
[   62.643987] shmem 0
[   62.644244] zswap 0
[   62.644425] zswapped 0
[   62.644666] zswap_incomp 0
[   62.644917] file_mapped 6574698496
[   62.645344] file_dirty 0
[   62.645835] file_writeback 0
[   62.646707] swapcached 0
[   62.647430] anon_thp 0
[   62.648204] file_thp 0
[   62.648895] shmem_thp 0
[   62.649737] inactive_anon 10597609472
[   62.650675] active_anon 7270400
[   62.651549] inactive_file 6367440896
[   62.652430] active_file 179376128
[   62.653318] unevictable 0
[   62.653976] slab_reclaimable 0
[   62.654664] slab_unreclaimable 0
[   62.655625] slab 0
[   62.656418] workingset_refault_anon 0
[   62.656816] workingset_refault_file 1120215
[   62.657293] workingset_activate_anon 0
[   62.657667] workingset_activate_file 45850
[   62.658167] workingset_restore_anon 0
[   62.658562] workingset_restore_file 45850
[   62.658981] workingset_nodereclaim 0
[   62.659417] pgdemote_kswapd 0
[   62.659715] pgdemote_direct 0
[   62.660102] pgdemote_khugepaged 0
[   62.660434] pgdemote_proactive 0
[   62.660730] pgsteal_kswapd 0
[   62.661015] pgsteal_direct 1612151
[   62.662669] pgscan_khugepaged 0
[   62.662990] pgscan_proactive 0
[   62.663393] pgrefill 4536757
[   62.663706] pgpromote_success 0
[   62.664115] pgscan 3867681
[   62.664397] pgsteal 1612151
[   62.664691] pswpin 0
[   62.664925] pswpout 0
[   62.665266] pgfault 35906959
[   62.665564] pgmajfault 95947
[   62.665867] pgactivate 3693439
[   62.666261] pgdeactivate 0
[   62.666492] pglazyfree 34
[   62.666728] pglazyfreed 0
[   62.666990] swpin_zero 0
[   62.667365] swpout_zero 0
[   62.667664] zswpin 0
[   62.667910] zswpout 0
[   62.668235] zswpwb 0
[   62.668472] thp_fault_alloc 0
[   62.668790] thp_collapse_alloc 0
[   62.669211] thp_swpout 0
[   62.669469] thp_swpout_fallback 0
[   62.669762] numa_pages_migrated 0
[   62.670177] numa_pte_updates 0
[   62.670470] numa_hint_faults 0
[   62.670774] Memory cgroup min protection 0kB -- low protection 0kB
[   62.670776] Tasks state (memory values in pages):
[   62.672213] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[   62.673379] [   3364]     0  3364     1794      900       73      827         0    57344        0             0 spawn-cgroup.sh
[   62.674519] [   3266]     0  3266    72782     2891      576     2315         0   110592        0             0 fish
[   62.675663] [   3591]     0  3591    55883     2979      625     2354         0   110592        0             0 fish
[   62.676546] [   3832]     0  3832  3867588  2588259  2587769      490         0 21630976        0             0 file_anon_mix_p
[   62.677691] [   3962]     0  3962  2098020  1237009      281  1236728         0 16855040        0             0 file_anon_mix_p
[   62.678950] [   3963]     0  3963  2098020  1236990      281  1236709         0 16855040        0             0 file_anon_mix_p
[   62.680233] [   3964]     0  3964  2098020  1236985      281  1236704         0 16855040        0             0 file_anon_mix_p
[   62.681374] [   3965]     0  3965  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.682501] [   3966]     0  3966  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.683637] [   3967]     0  3967  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.684812] [   3968]     0  3968  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.685883] [   3969]     0  3969  2098020  1236967      281  1236686         0 16855040        0             0 file_anon_mix_p
[   62.686988] [   3970]     0  3970  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.688168] [   3971]     0  3971  2098020  1236993      281  1236712         0 16855040        0             0 file_anon_mix_p
[   62.689402] [   3972]     0  3972  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.690621] [   3973]     0  3973  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.691839] [   3974]     0  3974  2098020  1237011      281  1236730         0 16855040        0             0 file_anon_mix_p
[   62.693550] [   3975]     0  3975  2098020  1237016      281  1236735         0 16855040        0             0 file_anon_mix_p
[   62.695292] [   3976]     0  3976  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.696997] [   3977]     0  3977  2098020  1237014      281  1236733         0 16855040        0             0 file_anon_mix_p
[   62.698734] [   3978]     0  3978  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.700415] [   3979]     0  3979  2098020  1236992      281  1236711         0 16855040        0             0 file_anon_mix_p
[   62.702153] [   3980]     0  3980  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.703859] [   3981]     0  3981  2098020  1236919      281  1236638         0 16855040        0             0 file_anon_mix_p
[   62.705597] [   3982]     0  3982  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.707329] [   3983]     0  3983  2098020  1236277      281  1235996         0 16855040        0             0 file_anon_mix_p
[   62.709056] [   3984]     0  3984  2098020  1236952      281  1236671         0 16855040        0             0 file_anon_mix_p
[   62.710732] [   3985]     0  3985  2098020  1236948      281  1236667         0 16855040        0             0 file_anon_mix_p
[   62.712482] [   3986]     0  3986  2098020  1237014      281  1236733         0 16855040        0             0 file_anon_mix_p
[   62.714184] [   3987]     0  3987  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.715930] [   3988]     0  3988  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.717543] [   3989]     0  3989  2098020  1237015      281  1236734         0 16855040        0             0 file_anon_mix_p
[   62.719129] [   3990]     0  3990  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.720723] [   3991]     0  3991  2098020  1237011      281  1236730         0 16855040        0             0 file_anon_mix_p
[   62.722356] [   3992]     0  3992  2098020  1236945      281  1236664         0 16855040        0             0 file_anon_mix_p
[   62.723893] [   3993]     0  3993  2098020  1237017      281  1236736         0 16855040        0             0 file_anon_mix_p
[   62.725413] [   3994]     0  3994  2098020  1236982      281  1236701         0 16855040        0             0 file_anon_mix_p
[   62.727108] [   3995]     0  3995  2098020  1237012      281  1236731         0 16855040        0             0 file_anon_mix_p
[   62.728701] [   3996]     0  3996  2098020  1236990      281  1236709         0 16855040        0             0 file_anon_mix_p

.. snip ..

The testing kernel commit is latest mm-new:

$ git log --oneline
b822cd37c749 (HEAD) mm/mglru: improve reclaim loop and dirty folio handling
  # This is a empty commit, to hold my cover letter.
54c9d0359b18 selftests-mm-add-merge-test-for-partial-msealed-range-fix
  # This is mm-new, see below.
fc127b77592e selftests/mm: add merge test for partial msealed range
ff02b14f414c mm/vmalloc: use dedicated unbound workqueue for vmap purge/drain

$ git log 54c9d0359b18
commit 54c9d0359b180b34070aa7ff8d9428fa3db8acbb (akpm/mm-new)
Author: Andrew Morton <akpm@linux-foundation.org>
Date:   Mon Mar 30 17:12:35 2026 -0700

    selftests-mm-add-merge-test-for-partial-msealed-range-fix

Note you may see fish in the OOM list, I use that shell, bash and
the memcg spawn is wrapped by spawn-cgroup.sh, irrelevant but just
to avoid confusion.

Reproducer log:
.. snip ..
[phase3] Starting 96 anonymous pressure threads (14336 MB x 128 rounds)...
[pressure] Round 1/128: faulting 14336 MB across 96 threads...
[pressure] Round 1/128 complete.
[pressure] Round 2/128: faulting 14336 MB across 96 threads...
[pressure] Round 2/128 complete.
[pressure] Round 3/128: faulting 14336 MB across 96 threads...
[pressure] Round 3/128 complete.

.. snip ...

[pressure] Round 17/128 complete.
[pressure] Round 18/128: faulting 14336 MB across 96 threads...
[pressure] Round 18/128 complete.
[pressure] Round 19/128: faulting 14336 MB across 96 threads...
fish: Job 1, './file_anon_mix_pressure /mnt/r…' terminated by signal SIGKILL (Forced quit)

OOM doesn't occur with MGLRU disabled or after this series,
128 rounds finishes just fine.

Very unfortunately I haven't find a easy and generic way to
reproduce this as the time window is extremely short: if
another reclaim thread keeps getting rejected due to
should_run_aging return true, and a racing thread is doing
the aging but not finished, MGLRU might OOM when it shouldn't.

This series greatly avoided that, but in very rare cases and
in theory, we may still see OOM due to the force protection
of MIN_NR_GENS. That can be fixed later.

We have see some very rare OOM issue with several services.
It took me a long time to figure out what is actually wrong here
since the racing window is extremely tiny and hard to trigger.
This reproducer is currently the best I can provide to simulate
that. It's not a 100% accurate and stable but close enough.

Maybe you can try to adjust the parameters to reproduce
that, and the storage have to be fast for the reproducer.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
  2026-03-29  8:21   ` Kairui Song
  2026-03-31  8:42   ` Baolin Wang
@ 2026-04-01 23:37   ` Shakeel Butt
  2026-04-02 11:44     ` Kairui Song
  2 siblings, 1 reply; 44+ messages in thread
From: Shakeel Butt @ 2026-04-01 23:37 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Sun, Mar 29, 2026 at 03:52:34AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivation upon pageout
> (shrink_folio_list).
> 
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.

I was just ranting about this on Baolin's patch and thanks for unifying them.

> 
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
> 
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.

Please divide this patch into two separate ones. One for moving the flusher
waker (& v1 throttling) within evict_folios() and second the above heuristic of
dirty writeback.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-03-31  9:18     ` Kairui Song
  2026-04-01  2:52       ` Baolin Wang
@ 2026-04-02  0:11       ` Barry Song
  1 sibling, 0 replies; 44+ messages in thread
From: Barry Song @ 2026-04-02  0:11 UTC (permalink / raw)
  To: Kairui Song
  Cc: Baolin Wang, kasong, linux-mm, Andrew Morton, Axel Rasmussen,
	Yuanchu Xie, Wei Xu, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Shakeel Butt, Lorenzo Stoakes,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel, Qi Zheng

On Tue, Mar 31, 2026 at 5:18 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Mar 31, 2026 at 04:42:59PM +0800, Baolin Wang wrote:
> >
> >
> > On 3/29/26 3:52 AM, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > The current handling of dirty writeback folios is not working well for
> > > file page heavy workloads: Dirty folios are protected and move to next
> > > gen upon isolation of getting throttled or reactivation upon pageout
> > > (shrink_folio_list).
> > >
> > > This might help to reduce the LRU lock contention slightly, but as a
> > > result, the ping-pong effect of folios between head and tail of last two
> > > gens is serious as the shrinker will run into protected dirty writeback
> > > folios more frequently compared to activation. The dirty flush wakeup
> > > condition is also much more passive compared to active/inactive LRU.
> > > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > > instead has to check this after the whole reclaim loop is done, and then
> > > count the isolation protection number compared to the total reclaim
> > > number.
> > >
> > > And we previously saw OOM problems with it, too, which were fixed but
> > > still not perfect [1].
> > >
> > > So instead, just drop the special handling for dirty writeback, just
> > > re-activate it like active / inactive LRU. And also move the dirty flush
> > > wake up check right after shrink_folio_list. This should improve both
> > > throttling and performance.
> > >
> > > Test with YCSB workloadb showed a major performance improvement:
> > >
> > > Before this series:
> > > Throughput(ops/sec): 61642.78008938203
> > > AverageLatency(us):  507.11127774145166
> > > pgpgin 158190589
> > > pgpgout 5880616
> > > workingset_refault 7262988
> > >
> > > After this commit:
> > > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > > pgpgin 101871227                        (-35.6%, lower is better)
> > > pgpgout 5770028
> > > workingset_refault 3418186              (-52.9%, lower is better)
> > >
> > > The refault rate is ~50% lower, and throughput is ~30% higher, which
> > > is a huge gain. We also observed significant performance gain for
> > > other real-world workloads.
> > >
> > > We were concerned that the dirty flush could cause more wear for SSD:
> > > that should not be the problem here, since the wakeup condition is when
> > > the dirty folios have been pushed to the tail of LRU, which indicates
> > > that memory pressure is so high that writeback is blocking the workload
> > > already.
> > >
> > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >   mm/vmscan.c | 57 ++++++++++++++++-----------------------------------------
> > >   1 file changed, 16 insertions(+), 41 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 8de5c8d5849e..17b5318fad39 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4583,7 +4583,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >                    int tier_idx)
> > >   {
> > >     bool success;
> > > -   bool dirty, writeback;
> > >     int gen = folio_lru_gen(folio);
> > >     int type = folio_is_file_lru(folio);
> > >     int zone = folio_zonenum(folio);
> > > @@ -4633,21 +4632,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> > >             return true;
> > >     }
> > > -   dirty = folio_test_dirty(folio);
> > > -   writeback = folio_test_writeback(folio);
> > > -   if (type == LRU_GEN_FILE && dirty) {
> > > -           sc->nr.file_taken += delta;
> > > -           if (!writeback)
> > > -                   sc->nr.unqueued_dirty += delta;
> > > -   }
> > > -
> > > -   /* waiting for writeback */
> > > -   if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > > -           gen = folio_inc_gen(lruvec, folio, true);
> > > -           list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > > -           return true;
> > > -   }
> >
> > I'm a bit concerned about the handling of dirty folios.
> >
> > In the original logic, if we encounter a dirty folio, we increment its
> > generation counter by 1 and move it to the *second oldest generation*.
> >
> > However, with your patch, shrink_folio_list() will activate the dirty folio
> > by calling folio_set_active(). Then, evict_folios() -> move_folios_to_lru()
> > will put the dirty folio back into the MGLRU list.
> >
> > But because the folio_test_active() is true for this dirty folio, the dirty
> > folio will now be placed into the *second youngest generation* (see
> > lru_gen_folio_seq()).
>
> Yeah, and that's exactly what we want. Or else, these folios will
> stay at oldest gen, following scan will keep seeing them and hence
> keep bouncing these folios again and again to a younger gen since
> they are not reclaimable.
>
> The writeback callback (folio_rotate_reclaimable) will move them
> back to tail once they are actually reclaimable. So we are not
> losing any ability to reclaim them. Am I missing anything?
>

This makes sense to me. As long as folio_rotate_reclaimable()
exists, we can move those folios back to the tail once they are
clean and ready for reclaim.

This reminds me of Ridong's patch, which tried to emulate MGLRU's
behavior by 'rotating' folios whose IO completed during isolate,
and thus missed folio_rotate_reclaimable() in the active/inactive
LRUs[1]. Not sure if that patch has managed to land since v7.

                /* retry folios that may have missed
folio_rotate_reclaimable() */
                if (!skip_retry && !folio_test_active(folio) &&
!folio_mapped(folio) &&
                    !folio_test_dirty(folio) && !folio_test_writeback(folio)) {
                        list_move(&folio->lru, &clean);
                        continue;
                }

[1] https://lore.kernel.org/linux-mm/20250111091504.1363075-1-chenridong@huaweicloud.com/

Best Regards
Barry


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
  2026-03-31  9:24   ` Baolin Wang
  2026-04-01  5:01   ` Leno Hou
@ 2026-04-02  2:39   ` Shakeel Butt
  2026-04-02  2:56     ` Kairui Song
  2 siblings, 1 reply; 44+ messages in thread
From: Shakeel Butt @ 2026-04-02  2:39 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Sun, Mar 29, 2026 at 03:52:38AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
> 
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior. Also remove the
> folio_clear_reclaim in isolate_folio which was actively invalidating
> the congestion control. PG_reclaim is now handled by shrink_folio_list,
> keeping it in isolate_folio is not helpful.
> 
> Test using following reproducer using bash:
> 
>   echo "Setup a slow device using dm delay"
>   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>   LOOP=$(losetup --show -f /var/tmp/backing)
>   mkfs.ext4 -q $LOOP
>   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>       dmsetup create slow_dev
>   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> 
>   echo "Start writeback pressure"
>   sync && echo 3 > /proc/sys/vm/drop_caches
>   mkdir /sys/fs/cgroup/test_wb
>   echo 128M > /sys/fs/cgroup/test_wb/memory.max
>   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> 
>   echo "Clean up"
>   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>   dmsetup resume slow_dev
>   umount -l /mnt/slow && sync
>   dmsetup remove slow_dev
> 
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
> 
> After this commit, congestion control is now effective and no more

What do you mean by congestion control here?

> spin on LRU or premature OOM.
> 
> Stress test on other workloads also looking good.
> 
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>

There is still differences for global and kswapd reclaim in the shrink_node()
like kswapd throttling and congestion state management and throttling. Any plan
to unify them?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-02  2:39   ` Shakeel Butt
@ 2026-04-02  2:56     ` Kairui Song
  2026-04-02  3:17       ` Shakeel Butt
  0 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-04-02  2:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Wed, Apr 01, 2026 at 07:39:03PM +0800, Shakeel Butt wrote:
> On Sun, Mar 29, 2026 at 03:52:38AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > Currently MGLRU and non-MGLRU handle the reclaim statistic and
> > writeback handling very differently, especially throttling.
> > Basically MGLRU just ignored the throttling part.
> > 
> > Let's just unify this part, use a helper to deduplicate the code
> > so both setups will share the same behavior. Also remove the
> > folio_clear_reclaim in isolate_folio which was actively invalidating
> > the congestion control. PG_reclaim is now handled by shrink_folio_list,
> > keeping it in isolate_folio is not helpful.
> > 
> > Test using following reproducer using bash:
> > 
> >   echo "Setup a slow device using dm delay"
> >   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> >   LOOP=$(losetup --show -f /var/tmp/backing)
> >   mkfs.ext4 -q $LOOP
> >   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> >       dmsetup create slow_dev
> >   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> > 
> >   echo "Start writeback pressure"
> >   sync && echo 3 > /proc/sys/vm/drop_caches
> >   mkdir /sys/fs/cgroup/test_wb
> >   echo 128M > /sys/fs/cgroup/test_wb/memory.max
> >   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> >       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> > 
> >   echo "Clean up"
> >   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> >   dmsetup resume slow_dev
> >   umount -l /mnt/slow && sync
> >   dmsetup remove slow_dev
> > 
> > Before this commit, `dd` will get OOM killed immediately if
> > MGLRU is enabled. Classic LRU is fine.
> > 
> > After this commit, congestion control is now effective and no more
> 
> What do you mean by congestion control here?

This particular case demostrated here is VMSCAN_THROTTLE_CONGESTED so
I described it as "congestion control", may I'll just say throttling to
avoid confusion, it's not limited to that.

> 
> > spin on LRU or premature OOM.
> > 
> > Stress test on other workloads also looking good.
> > 
> > Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> There is still differences for global and kswapd reclaim in the shrink_node()
> like kswapd throttling and congestion state management and throttling. Any plan
> to unify them?

Of course. Let fix it step by step, this series is pretty long already.
I originally plan to put this patch in a later series, but as Ridong
pointed out leaving these counter updated but unused looks really
ugly. And this fix is clean and easily to understand I think.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-02  2:56     ` Kairui Song
@ 2026-04-02  3:17       ` Shakeel Butt
  0 siblings, 0 replies; 44+ messages in thread
From: Shakeel Butt @ 2026-04-02  3:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Thu, Apr 02, 2026 at 10:56:07AM +0800, Kairui Song wrote:
> On Wed, Apr 01, 2026 at 07:39:03PM +0800, Shakeel Butt wrote:
> > On Sun, Mar 29, 2026 at 03:52:38AM +0800, Kairui Song via B4 Relay wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > > 
[...]
> > > 
> > > After this commit, congestion control is now effective and no more
> > 
> > What do you mean by congestion control here?
> 
> This particular case demostrated here is VMSCAN_THROTTLE_CONGESTED so
> I described it as "congestion control", may I'll just say throttling to
> avoid confusion, it's not limited to that.

Yes use throttling or VMSCAN_THROTTLE_CONGESTED directly. Congestion control
gives a networking vibe.

> 
> > 
> > > spin on LRU or premature OOM.
> > > 
> > > Stress test on other workloads also looking good.
> > > 
> > > Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > 
> > There is still differences for global and kswapd reclaim in the shrink_node()
> > like kswapd throttling and congestion state management and throttling. Any plan
> > to unify them?
> 
> Of course. Let fix it step by step,

Fine with me but better to be clear about the destination.

> this series is pretty long already.
> I originally plan to put this patch in a later series, but as Ridong
> pointed out leaving these counter updated but unused looks really
> ugly. And this fix is clean and easily to understand I think.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling
  2026-04-01 23:37   ` Shakeel Butt
@ 2026-04-02 11:44     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-04-02 11:44 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Wed, Apr 01, 2026 at 04:37:14PM +0800, Shakeel Butt wrote:
> On Sun, Mar 29, 2026 at 03:52:34AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current handling of dirty writeback folios is not working well for
> > file page heavy workloads: Dirty folios are protected and move to next
> > gen upon isolation of getting throttled or reactivation upon pageout
> > (shrink_folio_list).
> > 
> > This might help to reduce the LRU lock contention slightly, but as a
> > result, the ping-pong effect of folios between head and tail of last two
> > gens is serious as the shrinker will run into protected dirty writeback
> > folios more frequently compared to activation. The dirty flush wakeup
> > condition is also much more passive compared to active/inactive LRU.
> > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > instead has to check this after the whole reclaim loop is done, and then
> > count the isolation protection number compared to the total reclaim
> > number.
> 
> I was just ranting about this on Baolin's patch and thanks for unifying them.
> 
> > 
> > And we previously saw OOM problems with it, too, which were fixed but
> > still not perfect [1].
> > 
> > So instead, just drop the special handling for dirty writeback, just
> > re-activate it like active / inactive LRU. And also move the dirty flush
> > wake up check right after shrink_folio_list. This should improve both
> > throttling and performance.
> 
> Please divide this patch into two separate ones. One for moving the flusher
> waker (& v1 throttling) within evict_folios() and second the above heuristic of
> dirty writeback.

OK, but throttling is not handled by this commit, it handled by the last
commit. And using the common routine in shrink_folio_list and activate the
folio is suppose to be done before moving the flusher wakeup and throttle,
as I observed some inefficient reclaim or over aggressive / passive if we
don't do that first. We will run into these folios again and again very
frequently and shrink_folio_list also have better dirty / writeback
detection.

I tested these two changes separately again in case I remembered it
wrongly, using the MongoDB YCSB case:

Before this series or commit, it's similar:
Throughput(ops/sec), 63414.891930455

Apply only the remove folio_inc_gen and use shrink_folio_list
to active folio part in this commit:
Throughput(ops/sec), 68580.83394294075

Skip the folio_inc_gen part but apply other part:
Throughput(ops/sec), 61614.29451632779

After the two fixes together (apply this commit fully):
Throughput(ops/sec), 80857.08510208207

And the whole series:
Throughput(ops/sec), 79760.71784646061

The test is a bit noisy, but after the whole series the throttling
seems is already slightly slowing down the workload, still accetable
IMO, this is also why activate the folios here is a good idea or we
will run into problematic throttling.

I think this can be further improved later, as I observed previously with
the LFU alike rework I mentioned, it will help promote folios more
proactively to younger gen and it will have a even better performance:
https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/

For now I can split this into two in V3, first a commit to use the
common routine for activating the folio, then move then fluster wakeup.


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-04-02 11:45 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-28 19:52 [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 01/12] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 02/12] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
2026-03-30  1:57   ` Chen Ridong
2026-03-30  7:59   ` Baolin Wang
2026-04-01  0:00   ` Barry Song
2026-03-28 19:52 ` [PATCH v2 03/12] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-03-30  8:14   ` Baolin Wang
2026-04-01  0:20     ` Barry Song
2026-03-28 19:52 ` [PATCH v2 04/12] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-03-29  6:47   ` Kairui Song
2026-03-28 19:52 ` [PATCH v2 05/12] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-03-31  8:04   ` Baolin Wang
2026-03-31  9:01     ` Kairui Song
2026-03-31  9:52       ` Baolin Wang
2026-03-28 19:52 ` [PATCH v2 06/12] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-03-31  8:08   ` Baolin Wang
2026-03-28 19:52 ` [PATCH v2 07/12] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 08/12] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-03-29  8:21   ` Kairui Song
2026-03-29  8:46     ` Kairui Song
2026-03-31  8:42   ` Baolin Wang
2026-03-31  9:18     ` Kairui Song
2026-04-01  2:52       ` Baolin Wang
2026-04-01  4:57         ` Kairui Song
2026-04-02  0:11       ` Barry Song
2026-04-01 23:37   ` Shakeel Butt
2026-04-02 11:44     ` Kairui Song
2026-03-28 19:52 ` [PATCH v2 09/12] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
2026-03-28 19:52 ` [PATCH v2 10/12] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-03-31  8:49   ` Baolin Wang
2026-03-28 19:52 ` [PATCH v2 11/12] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
2026-03-31  8:51   ` Baolin Wang
2026-03-28 19:52 ` [PATCH v2 12/12] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
2026-03-31  9:24   ` Baolin Wang
2026-03-31  9:29     ` Kairui Song
2026-03-31  9:36       ` Baolin Wang
2026-03-31  9:40         ` Kairui Song
2026-04-01  5:01   ` Leno Hou
2026-04-02  2:39   ` Shakeel Butt
2026-04-02  2:56     ` Kairui Song
2026-04-02  3:17       ` Shakeel Butt
2026-04-01  5:18 ` [PATCH v2 00/12] mm/mglru: improve reclaim loop and dirty folio handling Leno Hou
2026-04-01  7:36   ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox