[PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling
@ 2026-04-02 18:53 Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
                   ` (14 more replies)
  0 siblings, 15 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

This series is based on mm-new.

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty writeback handling. As a result, we can see an up to ~30% increase
in some workloads like MongoDB with YCSB and a huge decrease in file
refault, no swap involved. Other common benchmarks have no regression,
and LOC is reduced, with less unexpected OOM, too.

Some of the problems were found in our production environment, and
others were mostly exposed while stress testing during the development
of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
the code base and fixes several performance issues, preparing for
further work.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective.

This series slightly cleans up and improves these issues using a scan
budget by calculating the number of folios to scan at the beginning of
the loop, and decouples aging from the reclaim calculation helpers.
Then, move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and a 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Before:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
pgpgout 5413332
workingset_refault_anon 0
workingset_refault_file 34522071

After:
Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
pgpgin 111093923                       (-30.3%, lower is better)
pgpgout 5437456
workingset_refault_anon 0
workingset_refault_file 19566366       (-43.3%, lower is better)

We can see a significant performance improvement after this series.
The test is done on NVME and the performance gap would be even larger
for slow devices, such as HDD or network storage. We observed over
100% gain for some workloads with slow IO.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            79915
Per-worker 95% CI (mean):  [1233.9, 1263.5]
Per-worker stdev:          59.2
Jain's fairness:           0.997795 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26859   33.61%   33.61%
[1,2)s      7818    9.78%   43.39%
[2,4)s      5532    6.92%   50.31%
[4,8)s     39706   49.69%  100.00%

After:
Total requests:            81382
Per-worker 95% CI (mean):  [1241.9, 1301.3]
Per-worker stdev:          118.8
Jain's fairness:           0.991480 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     26696   32.80%   32.80%
[1,2)s      8745   10.75%   43.55%
[2,4)s      6865    8.44%   51.98%
[4,8)s     39076   48.02%  100.00%

Reclaim is still fair and effective, total requests number seems
slightly better.

OOM issue with aging and throttling
===================================
For the throttling OOM issue, it can be easily reproduced using dd and
cgroup limit as demonstrated in patch 14, and fixed by this series.

The aging OOM is a bit tricky, a specific reproducer can be used to
simulate what we encountered in production environment [4]:
Spawns multiple workers that keep reading the given file using mmap,
and pauses for 120ms after one file read batch. It also spawns another
set of workers that keep allocating and freeing a given size of
anonymous memory. The total memory size exceeds the memory limit
(eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  OOM with following info after about ~10-20 iterations:
    [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
    [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [   62.640823] Memory cgroup stats for /demo:
    [   62.641017] anon 10604879872
    [   62.641941] file 6574858240

  OOM occurs despite there being still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

Worth noting there is another OOM related issue reported in V1 of
this series, which is tested and looking OK now [5].

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=48 --time=600 run

Before:            17260.781429 tps
After this series: 17266.842857 tps

MySQL is anon folios heavy, involves writeback and file and still
looking good. Seems only noise level changes, no regression.

FIO:
====
Testing with the following command, where /mnt/ramdisk is a
64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
6 test run each:

fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
  --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
  --rw=randread --norandommap --time_based \
  --ramp_time=1m --runtime=5m --group_reporting

Before:            9196.481429 MB/s
After this series: 9256.105000 MB/s

Also seem only noise level changes and no regression or slightly better.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 12 test run each.

Before:            2589.63s
After this series: 2543.58s

Also seem only noise level changes, no regression or very slightly better.

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v3:
- Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
  loop. If the LRU is too small, adjust it accordingly. Now the
  multi-cgroup scan balance looked even better for tiny cgroups:
  https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
- Add one patch to remove the swap constraint check in isolate_folio. In
  theory, it's fine, and both stress test and performance test didn't
  show any issue:
  https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
- I reran most tests, all seem identical, so most data is kept.
  Intermediate test results are dropped. I ran tests on most patches
  individually, and there is no problem, but the series is getting too
  long, and posting them makes it harder to read and unnecessary.
- Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
  also some test result is collected to support the design:
  https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
  I kept Axel's review-by since the code is identical.
- Call try_to_inc_min_seq twice to avoid stale empty gen and drop
  its return argument [ Baolin Wang ]
- Move a few lines of code between patches to where they fits better,
  the final result is identical [ Baolin Wang ].
- Collect tested by and update test setup [ Leno Hou ]
- Collect review by.
- Update a few commit message [ Shakeel Butt ].
- Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com

Changes in v2:
- Rebase on top of mm-new which includes Cgroup V1 fix from
  [ Baolin Wang ].
- Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
  review suggested that we shouldn't leave the counter and reclaim
  feedback in shrink_folio_list untracked in this series.
- Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
  "restructure the reclaim loop", the change is trivial but might
  help avoid livelock for tiny cgroups.
- Redo the tests, most test are basically identical to before, but just
  in case, since the patch also solves the throttling issue now, and
  discussed with reports from CachyOS.
- Add a separate patch for variable renaming as suggested by [ Barry
  Song ]. No feature change.
- Improve several comment and code issue [ Axel Rasmussen ].
- Remove no longer needed variable [ Axel Rasmussen ].
- Collect review by.
- Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com

---
Kairui Song (14):
      mm/mglru: consolidate common code for retrieving evictable size
      mm/mglru: rename variables related to aging and rotation
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: remove redundant swap constrained check upon isolation
      mm/mglru: use the common routine for dirty/writeback reactivation
      mm/mglru: simplify and improve dirty writeback handling
      mm/mglru: remove no longer used reclaim argument for folio protection
      mm/vmscan: remove sc->file_taken
      mm/vmscan: remove sc->unqueued_dirty
      mm/vmscan: unify writeback reclaim statistic and throttling

 mm/vmscan.c | 332 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 143 insertions(+), 189 deletions(-)
---
base-commit: c17461ca3e91a3fe705685a23ad7edb58d4ee768
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
--  
Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-03  3:16   ` Kairui Song
  2026-04-02 18:53 ` [PATCH v3 02/14] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Merge commonly used code for counting evictable folios in a lruvec.

No behavior change.

Return unsigned long instead of long as suggested [ Axel Rasmussen ]

Acked-by: Yuanchu Xie <yuanchu@google.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5a8c8fcccbfc..adc07501a137 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4084,27 +4084,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static unsigned long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4909,9 +4915,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
 	*nr_to_scan = 0;
@@ -4919,18 +4922,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
-
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 02/14] mm/mglru: rename variables related to aging and rotation
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 03/14] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current variable name isn't helpful. Make the variable names more
meaningful.

Only naming change, no behavior change.

Suggested-by: Barry Song <baohua@kernel.org>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index adc07501a137..f336f89a2de6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4934,7 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  */
 static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
-	bool success;
+	bool need_aging;
 	unsigned long nr_to_scan;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
@@ -4942,7 +4942,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
 		return -1;
 
-	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (nr_to_scan && !mem_cgroup_online(memcg))
@@ -4951,7 +4951,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
 	/* try to get away with not aging at the default priority */
-	if (!success || sc->priority == DEF_PRIORITY)
+	if (!need_aging || sc->priority == DEF_PRIORITY)
 		return nr_to_scan >> sc->priority;
 
 	/* stop scanning this lruvec as it's low on cold folios */
@@ -5040,7 +5040,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
+	bool need_rotate;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5058,7 +5058,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		memcg_memory_event(memcg, MEMCG_LOW);
 	}
 
-	success = try_to_shrink_lruvec(lruvec, sc);
+	need_rotate = try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -5068,10 +5068,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	flush_reclaim_state(sc);
 
-	if (success && mem_cgroup_online(memcg))
+	if (need_rotate && mem_cgroup_online(memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (!success && lruvec_is_sizable(lruvec, sc))
+	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
 		return 0;
 
 	/* one retry if offlined or too small */

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 03/14] mm/mglru: relocate the LRU scan batch limit to callers
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 02/14] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Same as active / inactive LRU, MGLRU isolates and scans folios in
batches.  The batch split is done hidden deep in the helper, which
makes the code harder to follow.  The helper's arguments are also
confusing since callers usually request more folios than the batch
size, so the helper almost never processes the full requested amount.

Move the batch splitting into the top loop to make it cleaner, there
should be no behavior change.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Barry Song <baohua@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f336f89a2de6..963362523782 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4695,10 +4695,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int scanned = 0;
 	int isolated = 0;
 	int skipped = 0;
-	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
-	int remaining = scan_batch;
+	unsigned long remaining = nr_to_scan;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
+	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
 	VM_WARN_ON_ONCE(!list_empty(list));
 
 	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
@@ -4751,7 +4751,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	mod_lruvec_state(lruvec, item, isolated);
 	mod_lruvec_state(lruvec, PGREFILL, sorted);
 	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
-	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
@@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	long nr_to_scan;
+	long nr_batch, nr_to_scan;
 	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
 
@@ -4998,7 +4998,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (nr_to_scan <= 0)
 			break;
 
-		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
+		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
@@ -5623,6 +5624,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
 			int swappiness, unsigned long nr_to_reclaim)
 {
+	int nr_batch;
 	DEFINE_MAX_SEQ(lruvec);
 
 	if (seq + MIN_NR_GENS > max_seq)
@@ -5639,8 +5641,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			return 0;
 
-		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
-				  swappiness))
+		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
+		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
 			return 0;
 
 		cond_resched();

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 04/14] mm/mglru: restructure the reclaim loop
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 03/14] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-03  4:44   ` Kairui Song
  2026-04-02 18:53 ` [PATCH v3 05/14] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current loop will calculate the scan number on each iteration. The
number of folios to scan is based on the LRU length, with some unclear
behaviors, eg, the scan number is only shifted by reclaim priority when
aging is not needed or when at the default priority, and it couples
the number calculation with aging and rotation.

Adjust, simplify it, and decouple aging and rotation. Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how aging and offline memcg reclaim works:
Previously, aging was always skipped at DEF_PRIORITY even when
eviction was impossible. Now, aging is always triggered when it
is necessary to make progress. The old behavior may waste a reclaim
iteration only to escalate priority, potentially causing over-reclaim
of slab and breaking reclaim balance in multi-cgroup setups.

Similar for offline memcg. Previously, offline memcg wouldn't be
aged unless it didn't have any evictable folios. Now, we might age
it if it has only 3 generations and the reclaim priority is less
than DEF_PRIORITY, which should be fine. On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem. These folios
might be used by other memcg, so aging them as ordinary memcg seems
correct. Besides, aging enables further reclaim of an offlined memcg,
which will certainly happen if we keep shrinking it. And offline
memcg might soon be no longer an issue with reparenting.

Overall, the memcg LRU rotation, as described in mmzone.h,
remains the same.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 74 +++++++++++++++++++++++++++++++++----------------------------
 1 file changed, 40 insertions(+), 34 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 963362523782..93ffb3d98fed 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4913,49 +4913,44 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     int swappiness, unsigned long *nr_to_scan)
+			     struct scan_control *sc, int swappiness)
 {
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
+	/* try to get away with not aging at the default priority */
+	if (sc->priority == DEF_PRIORITY)
+		return false;
+
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
-/*
- * For future optimizations:
- * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
- *    reclaim.
- */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
 {
-	bool need_aging;
-	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
-
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	unsigned long evictable, nr_to_scan;
 
+	evictable = lruvec_evictable_size(lruvec, swappiness);
+	nr_to_scan = evictable;
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
+	if (!mem_cgroup_online(memcg))
 		return nr_to_scan;
 
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
-	/* try to get away with not aging at the default priority */
-	if (!need_aging || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
+	/*
+	 * Always respect scan priority, minimally target some folios
+	 * to keep reclaim moving forwards.
+	 */
+	nr_to_scan >>= sc->priority;
+	if (!nr_to_scan)
+		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
 
-	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	return nr_to_scan;
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4985,31 +4980,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
+	bool need_rotate = false;
 	long nr_batch, nr_to_scan;
-	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
+
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			break;
+		}
 
 		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
-			break;
-
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		nr_to_scan -= delta;
 		cond_resched();
 	}
 
@@ -5035,8 +5042,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
-	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 05/14] mm/mglru: scan and count the exact number of folios
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Make the scan helpers return the exact number of folios being scanned
or isolated. Since the reclaim loop now has a natural scan budget that
controls the scan progress, returning the scan number and consume the
budget should make the scan more accurate and easier to follow.

The number of scanned folios for each iteration is always positive and
larger than 0, unless the reclaim must stop for a forced aging, so
there is no more need for any special handling when there is no
progress made:

- `return isolated || !remaining ? scanned : 0` in scan_folios: both
  the function and the call now just return the exact scan count,
  combined with the scan budget introduced in the previous commit to
  avoid livelock or under scan.

- `scanned += try_to_inc_min_seq` in evict_folios: adding a bool as a
  scan count was kind of confusing and no longer needed to, as scan
  number should never be zero as long as there are still evictable
  gens. We may encounter a empty old gen that return 0 scan count,
  to avoid that, do a try_to_inc_min_seq before isolation which
  have slight to none overhead in most cases.

- `evictable_min_seq + MIN_NR_GENS > max_seq` guard in evict_folios:
  the per-type get_nr_gens == MIN_NR_GENS check in scan_folios
  naturally returns 0 when only two gens remain and breaks the loop.

Also change try_to_inc_min_seq to return void, as its return value is
no longer used by any caller. Move the call before isolate_folios so
that any empty gens created by external folio freeing are flushed, and
add another call after isolate_folios to also flush empty gens that
isolation itself may create.

The scan still stops if there are only two gens left as the scan number
will be zero, this behavior is same as before. This force gen protection
may get removed or softened later to improve the reclaim a bit more.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 60 ++++++++++++++++++++++++++++++------------------------------
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 93ffb3d98fed..643f9fc10214 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3878,10 +3878,9 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 	return true;
 }
 
-static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
+static void try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	bool success = false;
 	bool seq_inc_flag = false;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
@@ -3907,11 +3906,10 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 	/*
 	 * If min_seq[type] of both anonymous and file is not increased,
-	 * we can directly return false to avoid unnecessary checking
-	 * overhead later.
+	 * return here to avoid unnecessary checking overhead later.
 	 */
 	if (!seq_inc_flag)
-		return success;
+		return;
 
 	/* see the comment on lru_gen_folio */
 	if (swappiness && swappiness <= MAX_SWAPPINESS) {
@@ -3929,10 +3927,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, int swappiness)
 
 		reset_ctrl_pos(lruvec, type, true);
 		WRITE_ONCE(lrugen->min_seq[type], min_seq[type]);
-		success = true;
 	}
-
-	return success;
 }
 
 static bool inc_max_seq(struct lruvec *lruvec, unsigned long seq, int swappiness)
@@ -4686,7 +4681,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
 static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		       struct scan_control *sc, int type, int tier,
-		       struct list_head *list)
+		       struct list_head *list, int *isolatedp)
 {
 	int i;
 	int gen;
@@ -4756,11 +4751,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
 		sc->nr.file_taken += isolated;
-	/*
-	 * There might not be eligible folios due to reclaim_idx. Check the
-	 * remaining to prevent livelock if it's not making progress.
-	 */
-	return isolated || !remaining ? scanned : 0;
+
+	*isolatedp = isolated;
+	return scanned;
 }
 
 static int get_tier_idx(struct lruvec *lruvec, int type)
@@ -4804,33 +4797,36 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 
 static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			  struct scan_control *sc, int swappiness,
-			  int *type_scanned, struct list_head *list)
+			  struct list_head *list, int *isolated,
+			  int *isolate_type, int *isolate_scanned)
 {
 	int i;
+	int scanned = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
-		int scanned;
+		int type_scan;
 		int tier = get_tier_idx(lruvec, type);
 
-		*type_scanned = type;
+		type_scan = scan_folios(nr_to_scan, lruvec, sc,
+					type, tier, list, isolated);
 
-		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
-		if (scanned)
-			return scanned;
+		scanned += type_scan;
+		if (*isolated) {
+			*isolate_type = type;
+			*isolate_scanned = type_scan;
+			break;
+		}
 
 		type = !type;
 	}
 
-	return 0;
+	return scanned;
 }
 
 static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			struct scan_control *sc, int swappiness)
 {
-	int type;
-	int scanned;
-	int reclaimed;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
@@ -4838,19 +4834,23 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	enum node_stat_item item;
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
+	int scanned, reclaimed;
+	int isolated = 0, type, type_scanned;
 	bool skip_retry = false;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
 	lruvec_lock_irq(lruvec);
 
-	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
+	/* In case folio deletion left empty old gens, flush them */
+	try_to_inc_min_seq(lruvec, swappiness);
 
-	scanned += try_to_inc_min_seq(lruvec, swappiness);
+	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness,
+				 &list, &isolated, &type, &type_scanned);
 
-	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
-		scanned = 0;
+	/* Isolation might create empty gen, flush them */
+	if (scanned)
+		try_to_inc_min_seq(lruvec, swappiness);
 
 	lruvec_unlock_irq(lruvec);
 
@@ -4861,7 +4861,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
-			scanned, reclaimed, &stat, sc->priority,
+			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 05/14] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-03  7:50   ` Barry Song
  2026-04-02 18:53 ` [PATCH v3 07/14] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 643f9fc10214..9c28afb0219c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			break;
 		}
 
-		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 07/14] mm/mglru: don't abort scan immediately right after aging
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 08/14] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Right now, if eviction triggers aging, the reclaimer will abort. This is
not the optimal strategy for several reasons.

Aborting the reclaim early wastes a reclaim cycle when under pressure,
and for concurrent reclaim, if the LRU is under aging, all concurrent
reclaimers might fail. And if the age has just finished, new cold folios
exposed by the aging are not reclaimed until the next reclaim iteration.

What's more, the current aging trigger is quite lenient, having 3 gens
with a reclaim priority lower than default will trigger aging, and
blocks reclaiming from one memcg. This wastes reclaim retry cycles
easily. And in the worst case, if the reclaim is making slower progress
and all following attempts fail due to being blocked by aging, it
triggers unexpected early OOM.

And if a lruvec requires aging, it doesn't mean it's hot. Instead, the
lruvec could be idle for quite a while, and hence it might contain lots
of cold folios to be reclaimed.

While it's helpful to rotate memcg LRU after aging for global reclaim,
as global reclaim fairness is coupled with the rotation in shrink_many,
memcg fairness is instead handled by cgroup iteration in
shrink_node_memcgs. So, for memcg level pressure, this abort is not the
key part for keeping the fairness. And in most cases, there is no need
to age, and fairness must be achieved by upper-level reclaim control.

So instead, just keep the scanning going unless one whole batch of
folios failed to be isolated or enough folios have been scanned, which
is triggered by evict_folios returning 0. And only abort for global
reclaim after one batch, so when there are fewer memcgs, progress is
still made, and the fairness mechanism described above still works fine.

And in most cases, the one more batch attempt for global reclaim might
just be enough to satisfy what the reclaimer needs, hence improving
global reclaim performance by reducing reclaim retry cycles.

Rotation is still there after the reclaim is done, which still follows
the comment in mmzone.h. And fairness still looking good.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c28afb0219c..b3371877762a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4987,7 +4987,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
  */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool need_rotate = false;
+	bool need_rotate = false, should_age = false;
 	long nr_batch, nr_to_scan;
 	int swappiness = get_swappiness(lruvec, sc);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5005,7 +5005,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
 			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
 				need_rotate = true;
-			break;
+			should_age = true;
 		}

 		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
@@ -5016,6 +5016,10 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_abort_scan(lruvec, sc))
 			break;

+		/* For cgroup reclaim, fairness is handled by iterator, not rotation */
+		if (root_reclaim(sc) && should_age)
+			break;
+
 		nr_to_scan -= delta;
 		cond_resched();
 	}

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 08/14] mm/mglru: remove redundant swap constrained check upon isolation
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 07/14] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Remove the swap-constrained early reject check upon isolation. This
check is a micro optimization when swap IO is not allowed, so folios are
rejected early. But it is redundant and overly broad since
shrink_folio_list() already handles all these cases with proper
granularity.

Notably, this check wrongly rejected lazyfree folios, and it doesn't
cover all rejection cases. shrink_folio_list() uses may_enter_fs(),
which distinguishes non-SWP_FS_OPS devices from filesystem-backed swap
and does all the checks after folio is locked, so flags like swap cache
are stable.

This check also covers dirty file folios, which are not a problem now
since sort_folio() already bumps dirty file folios to the next
generation, but causes trouble for unifying dirty folio writeback
handling.

And there should be no performance impact from removing it. We may have
lost a micro optimization, but unblocked lazyfree reclaim for NOIO
contexts, which is not a common case in the first place.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b3371877762a..9f4512a4d35f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4650,12 +4650,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 {
 	bool success;

-	/* swap constrained */
-	if (!(sc->gfp_mask & __GFP_IO) &&
-	    (folio_test_dirty(folio) ||
-	     (folio_test_anon(folio) && !folio_test_swapcache(folio))))
-		return false;
-
 	/* raced with release_pages() */
 	if (!folio_try_get(folio))
 		return false;

-- 
2.53.0

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 09/14] mm/mglru: use the common routine for dirty/writeback reactivation
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 08/14] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-03  5:00   ` Kairui Song
  2026-04-02 18:53 ` [PATCH v3 10/14] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently MGLRU will move the dirty writeback folios to the second
oldest gen instead of reactivate them like the classical LRU. This
might help to reduce the LRU contention as it skipped the isolation.
But as a result we will see these folios at the LRU tail more frequently
leading to inefficient reclaim.

Besides, the dirty / writeback check after isolation in
shrink_folio_list is more accurate and covers more cases. So instead,
just drop the special handling for dirty writeback, use the common
routine and re-activate it like the classical LRU.

This should in theory improve the scan efficiency. These folios will be
rotated back to LRU tail once writeback is done so there is no risk of
hotness inversion. And now each reclaim loop will have a higher
success rate. This also prepares for unifying the writeback and
throttling mechanism with classical LRU, we keep these folios far from
tail so detecting the tail batch will have a similar pattern with
classical LRU.

The micro optimization that avoids LRU contention by skipping the
isolation is gone, which should be fine. Compared to IO and writeback
cost, the isolation overhead is trivial.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9f4512a4d35f..2a36cf937061 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4578,7 +4578,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4628,21 +4627,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }
 
@@ -4664,9 +4648,6 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 	if (!folio_test_referenced(folio))
 		set_mask_bits(&folio->flags.f, LRU_REFS_MASK, 0);
 
-	/* for shrink_folio_list() */
-	folio_clear_reclaim(folio);
-
 	success = lru_gen_del_folio(lruvec, folio, true);
 	VM_WARN_ON_ONCE_FOLIO(!success, folio);
 

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 10/14] mm/mglru: simplify and improve dirty writeback handling
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (8 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 11/14] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Right now the flusher wakeup mechanism for MGLRU is less responsive
and unlikely to trigger compared to classical LRU. The classical
LRU wakes the flusher if one batch of folios passed to shrink_folio_list
is unevictable due to under writeback. MGLRU instead check and handle
this after the whole reclaim loop is done.

We previously even saw OOM problems due to passive flusher, which were
fixed but still not perfect [1].

We have just unified the dirty folio counting and activation routine,
now just move the dirty flush into the loop right after shrink_folio_list.
This improves the performance a lot for workloads involving heavy
writeback and prepares for throttling too.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 62485.02962831822
AverageLatency(us): 500.9746963330107
pgpgin 159347462
workingset_refault_file 34522071

After this commit:
Throughput(ops/sec): 80857.08510208207
AverageLatency(us): 386.653262968934
pgpgin 112233121
workingset_refault_file 19516246

The performance is a lot better with significantly lower refault. We also
observed similar or higher performance gain for other real-world workloads.

We were concerned that the dirty flush could cause more wear for SSD:
that should not be the problem here, since the wakeup condition is when
the dirty folios have been pushed to the tail of LRU, which indicates
that memory pressure is so high that writeback is blocking the workload
already.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 41 ++++++++++++++++-------------------------
 1 file changed, 16 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2a36cf937061..bd2bf45826de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4724,8 +4724,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
 
 	*isolatedp = isolated;
 	return scanned;
@@ -4833,12 +4831,27 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (stat.nr_unqueued_dirty == isolated) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4999,28 +5012,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken) {
-		struct pglist_data *pgdat = lruvec_pgdat(lruvec);
-
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	return need_rotate;
 }
 

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 11/14] mm/mglru: remove no longer used reclaim argument for folio protection
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (9 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 10/14] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 12/14] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now dirty reclaim folios are handled after isolation, not before,
since dirty reactivation must take the folio off LRU first, and that
helps to unify the dirty handling logic.

So this argument is no longer needed. Just remove it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd2bf45826de..9bd0a3b94855 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3220,7 +3220,7 @@ static int folio_update_gen(struct folio *folio, int gen)
 }
 
 /* protect pages accessed multiple times through file descriptors */
-static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclaiming)
+static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio)
 {
 	int type = folio_is_file_lru(folio);
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
@@ -3239,9 +3239,6 @@ static int folio_inc_gen(struct lruvec *lruvec, struct folio *folio, bool reclai
 
 		new_flags = old_flags & ~(LRU_GEN_MASK | LRU_REFS_FLAGS);
 		new_flags |= (new_gen + 1UL) << LRU_GEN_PGOFF;
-		/* for folio_end_writeback() */
-		if (reclaiming)
-			new_flags |= BIT(PG_reclaim);
 	} while (!try_cmpxchg(&folio->flags.f, &old_flags, new_flags));
 
 	lru_gen_update_size(lruvec, folio, old_gen, new_gen);
@@ -3855,7 +3852,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, int swappiness)
 			VM_WARN_ON_ONCE_FOLIO(folio_is_file_lru(folio) != type, folio);
 			VM_WARN_ON_ONCE_FOLIO(folio_zonenum(folio) != zone, folio);
 
-			new_gen = folio_inc_gen(lruvec, folio, false);
+			new_gen = folio_inc_gen(lruvec, folio);
 			list_move_tail(&folio->lru, &lrugen->folios[new_gen][type][zone]);
 
 			/* don't count the workingset being lazily promoted */
@@ -4607,7 +4604,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* protected */
 	if (tier > tier_idx || refs + workingset == BIT(LRU_REFS_WIDTH) + 1) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
 
 		/* don't count the workingset being lazily promoted */
@@ -4622,7 +4619,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 
 	/* ineligible */
 	if (zone > sc->reclaim_idx) {
-		gen = folio_inc_gen(lruvec, folio, false);
+		gen = folio_inc_gen(lruvec, folio);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
 	}

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 12/14] mm/vmscan: remove sc->file_taken
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (10 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 11/14] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 13/14] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9bd0a3b94855..e4f27fd22422 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -173,7 +173,6 @@ struct scan_control {
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
-		unsigned int file_taken;
 		unsigned int taken;
 	} nr;
 
@@ -2040,8 +2039,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 13/14] mm/vmscan: remove sc->unqueued_dirty
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (11 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 12/14] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-02 18:53 ` [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
  2026-04-03 21:26 ` [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Axel Rasmussen
  14 siblings, 0 replies; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Suggested-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e4f27fd22422..9120d914445e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -169,7 +169,6 @@ struct scan_control {
 
 	struct {
 		unsigned int dirty;
-		unsigned int unqueued_dirty;
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
@@ -2035,7 +2034,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 
 	sc->nr.dirty += stat.nr_dirty;
 	sc->nr.congested += stat.nr_congested;
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (12 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 13/14] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
@ 2026-04-02 18:53 ` Kairui Song via B4 Relay
  2026-04-03 21:15   ` Axel Rasmussen
  2026-04-03 21:26 ` [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Axel Rasmussen
  14 siblings, 1 reply; 24+ messages in thread
From: Kairui Song via B4 Relay @ 2026-04-02 18:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently MGLRU and non-MGLRU handle the reclaim statistic and
writeback handling very differently, especially throttling.
Basically MGLRU just ignored the throttling part.

Let's just unify this part, use a helper to deduplicate the code
so both setups will share the same behavior.

Test using following reproducer using bash:

  echo "Setup a slow device using dm delay"
  dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
  LOOP=$(losetup --show -f /var/tmp/backing)
  mkfs.ext4 -q $LOOP
  echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
      dmsetup create slow_dev
  mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow

  echo "Start writeback pressure"
  sync && echo 3 > /proc/sys/vm/drop_caches
  mkdir /sys/fs/cgroup/test_wb
  echo 128M > /sys/fs/cgroup/test_wb/memory.max
  (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
      dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)

  echo "Clean up"
  echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
  dmsetup resume slow_dev
  umount -l /mnt/slow && sync
  dmsetup remove slow_dev

Before this commit, `dd` will get OOM killed immediately if
MGLRU is enabled. Classic LRU is fine.

After this commit, throttling is now effective and no more spin on
LRU or premature OOM. Stress test on other workloads also looking good.

Global throttling is not here yet, we will fix that separately later.

Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
Tested-by: Leno Hou <lenohou@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
 1 file changed, 41 insertions(+), 49 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9120d914445e..a7b3e5b6676b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
 	return !(current->flags & PF_LOCAL_THROTTLE);
 }
 
+static void handle_reclaim_writeback(unsigned long nr_taken,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc,
+				     struct reclaim_stat *stat)
+{
+	/*
+	 * If dirty folios are scanned that are not queued for IO, it
+	 * implies that flushers are not doing their job. This can
+	 * happen when memory pressure pushes dirty folios to the end of
+	 * the LRU before the dirty limits are breached and the dirty
+	 * data has expired. It can also happen when the proportion of
+	 * dirty folios grows not through writes but through memory
+	 * pressure reclaiming all the clean cache. And in some cases,
+	 * the flushers simply cannot keep up with the allocation
+	 * rate. Nudge the flusher threads in case they are asleep.
+	 */
+	if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+		/*
+		 * For cgroupv1 dirty throttling is achieved by waking up
+		 * the kernel flusher here and later waiting on folios
+		 * which are in writeback to finish (see shrink_folio_list()).
+		 *
+		 * Flusher may not be able to issue writeback quickly
+		 * enough for cgroupv1 writeback throttling to work
+		 * on a large system.
+		 */
+		if (!writeback_throttling_sane(sc))
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
+	}
+
+	sc->nr.dirty += stat->nr_dirty;
+	sc->nr.congested += stat->nr_congested;
+	sc->nr.writeback += stat->nr_writeback;
+	sc->nr.immediate += stat->nr_immediate;
+	sc->nr.taken += nr_taken;
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
@@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	lruvec_lock_irq(lruvec);
 	lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
 					nr_scanned - nr_reclaimed);
-
-	/*
-	 * If dirty folios are scanned that are not queued for IO, it
-	 * implies that flushers are not doing their job. This can
-	 * happen when memory pressure pushes dirty folios to the end of
-	 * the LRU before the dirty limits are breached and the dirty
-	 * data has expired. It can also happen when the proportion of
-	 * dirty folios grows not through writes but through memory
-	 * pressure reclaiming all the clean cache. And in some cases,
-	 * the flushers simply cannot keep up with the allocation
-	 * rate. Nudge the flusher threads in case they are asleep.
-	 */
-	if (stat.nr_unqueued_dirty == nr_taken) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 *
-		 * Flusher may not be able to issue writeback quickly
-		 * enough for cgroupv1 writeback throttling to work
-		 * on a large system.
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
-	sc->nr.dirty += stat.nr_dirty;
-	sc->nr.congested += stat.nr_congested;
-	sc->nr.writeback += stat.nr_writeback;
-	sc->nr.immediate += stat.nr_immediate;
-	sc->nr.taken += nr_taken;
-
+	handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);
 	return nr_reclaimed;
@@ -4824,26 +4830,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
 	sc->nr_reclaimed += reclaimed;
+	handle_reclaim_writeback(isolated, pgdat, sc, &stat);
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			type_scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (stat.nr_unqueued_dirty == isolated) {
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
-		/*
-		 * For cgroupv1 dirty throttling is achieved by waking up
-		 * the kernel flusher here and later waiting on folios
-		 * which are in writeback to finish (see shrink_folio_list()).
-		 */
-		if (!writeback_throttling_sane(sc))
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
-	}
-
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -4886,6 +4877,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	if (!list_empty(&list)) {
 		skip_retry = true;
+		isolated = 0;
 		goto retry;
 	}
 

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size
  2026-04-02 18:53 ` [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
@ 2026-04-03  3:16   ` Kairui Song
  0 siblings, 0 replies; 24+ messages in thread
From: Kairui Song @ 2026-04-03  3:16 UTC (permalink / raw)
  To: linux-mm
  Cc: Kairui Song, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

Forwarding sashiko review in case anyone has concern about it:

https://sashiko.dev/#/patchset/20260403-mglru-reclaim-v3-0-a285efd6ff91%40tencent.com?part=1

> commit e0119c3e147f68096f59fb96e1e69e83ecdfce40
> Author: Kairui Song <kasong@tencent.com>
> mm/mglru: consolidate common code for retrieving evictable size
> This commit merges commonly used code for counting evictable folios in a
> lruvec into a new helper function lruvec_evictable_size.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 5a8c8fcccbfc9..adc07501a1375 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > @@ -4909,9 +4915,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> >  			     int swappiness, unsigned long *nr_to_scan)
> >  {
> > -	int gen, type, zone;
> > -	unsigned long size = 0;
> > -	struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >  	DEFINE_MIN_SEQ(lruvec);
> >  
> >  	*nr_to_scan = 0;
> > @@ -4919,18 +4922,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> >  	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
> >  		return true;
> >  
> > -	for_each_evictable_type(type, swappiness) {
> > -		unsigned long seq;
> > -
> > -		for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > -			gen = lru_gen_from_seq(seq);
> > -
> > -			for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > -				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
> > -		}
> > -	}
> > -
> > -	*nr_to_scan = size;
> > +	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> The commit message notes there is no behavior change, but does delegating
> to lruvec_evictable_size() introduce a subtle state inconsistency?
> Previously, should_run_aging() evaluated the aging conditions and iterated
> the generations to calculate the evictable size using a single snapshot of
> max_seq (passed as an argument) and min_seq (read locally via
> DEFINE_MIN_SEQ).
> The new lruvec_evictable_size() helper re-reads both sequence numbers
> internally via its own DEFINE_MAX_SEQ and DEFINE_MIN_SEQ macros. If the
> sequence numbers advance concurrently, the calculated size might be based
> on a different snapshot than the one used just lines above to determine if
> aging should run.
> Could this lead to unintended behavioral changes during concurrent
> execution?

This is totally fine.

Right lruvec_evictable_size re-reads max_seq - and that's really trivial,
I don't think anyone will observe any different in any way. The time
windows is so tiny and even if a different value is read, the updated
value is more accurate for getting the LRU size.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 04/14] mm/mglru: restructure the reclaim loop
  2026-04-02 18:53 ` [PATCH v3 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-04-03  4:44   ` Kairui Song
  0 siblings, 0 replies; 24+ messages in thread
From: Kairui Song @ 2026-04-03  4:44 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, kasong, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

Forwarding sashiko review:

https://sashiko.dev/#/patchset/20260403-mglru-reclaim-v3-0-a285efd6ff91%40tencent.com?part=4

> commit 89b3058e01113eb027062254e73446d398af330a
> Author: Kairui Song <kasong@tencent.com>
> mm/mglru: restructure the reclaim loop
> This commit restructures the MGLRU reclaim loop by calculating the scan
> number once at the beginning of reclaim and decoupling aging from rotation.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 963362523782a..93ffb3d98fedf 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> [ ... ]
> > -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> > +			   struct mem_cgroup *memcg, int swappiness)
> >  {
> > -	bool need_aging;
> > -	unsigned long nr_to_scan;
> > -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > -	DEFINE_MAX_SEQ(lruvec);
> > -
> > -	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> > -		return -1;
> > -
> > -	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> > +	unsigned long evictable, nr_to_scan;
> >  
> > +	evictable = lruvec_evictable_size(lruvec, swappiness);
> > +	nr_to_scan = evictable;
> [ ... ]
> >  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> >  
> > -	/* try to get away with not aging at the default priority */
> > -	if (!need_aging || sc->priority == DEF_PRIORITY)
> > -		return nr_to_scan >> sc->priority;
> > +	/*
> > +	 * Always respect scan priority, minimally target some folios
> > +	 * to keep reclaim moving forwards.
> > +	 */
> > +	nr_to_scan >>= sc->priority;
> > +	if (!nr_to_scan)
> > +		nr_to_scan = min(evictable, SWAP_CLUSTER_MAX);
> When falling back to a minimum scan size, does this bypass the memory.low
> protection applied earlier by apply_proportional_protection()?
> Since the fallback uses the original unprotected evictable value rather
> than the protected nr_to_scan calculated prior to the priority shift, it
> appears protected memcgs will evaluate to min(evictable, 32) = 32
> and be forcefully scanned, bypassing the protection.
> Could this also cause an inversion based on memcg size?
> For example, an unprotected memcg with 4,000 folios shifts to 0 at
> DEF_PRIORITY (4000 >> 12) and falls back to 32 folios, whereas a larger
> memcg with 4,096 folios shifts to 1 (4096 >> 12) and scans only 1 folio.

That's really trivial I think.

The low protection calculation already has a "max(scan, SWAP_CLUSTER_MAX)"
check. If it's concerning here then the
nr_to_scan = min(evictable, SWAP_CLUSTER_MAX) can be dropped. And actually
in previous version of this series there isn't such check and it was
concerned that always apply the priority shift could lead to inefficient
reclaim of small cgroup. Having the min check here seems safer and
make the reclaim loop a bit more like what it was.

In theory and in practise there isn't much difference since only very
tiny cgroups are slightly affected, and in most cases the reclaimer
already starts with requesting SWAP_CLUSTER_MAX folios to be reclaimed
(nr_to_reclaim).

Maybe some chore cleanup can be done for a perfect accurate reclaim
budget starting from the reclaim start point. Not sure if that
would be a practical thing to do though.

> > -	/* stop scanning this lruvec as it's low on cold folios */
> > -	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> > +	return nr_to_scan;
> >  }
> [ ... ]
> > +static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >  {
> > +	bool need_rotate = false;
> >  	long nr_batch, nr_to_scan;
> > -	unsigned long scanned = 0;
> >  	int swappiness = get_swappiness(lruvec, sc);
> > +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >  
> > -	while (true) {
> > +	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> > +	while (nr_to_scan > 0) {
> >  		int delta;
> > +		DEFINE_MAX_SEQ(lruvec);
> >  
> > -		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> > -		if (nr_to_scan <= 0)
> > +		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> > +			need_rotate = true;
> >  			break;
> > +		}
> > +
> > +		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> > +			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> > +				need_rotate = true;
> > +			break;
> > +		}
> If an lruvec has zero evictable folios in its oldest generation,
> get_nr_to_scan() will return 0, and this loop will be completely bypassed.

No, get_nr_to_scan won't return zero if aging helps get this LRU
evictable. And if even aging won't help there is no point to
reclaim it.

> Because the memcg rotation check (mem_cgroup_below_min) and the MGLRU aging
> logic (should_run_aging) are now entirely inside this loop, will this leave
> an lruvec permanently stalled?

There won't be stall, in worst case empty memcgs filled all the
random bucket, it just requires a few more iteration to find a
reclaimable memcg.

But on second though, if get_nr_to_scan returns 0 in the first try,
try_to_shrink_lruvec should just return true to rotate unevictable
LRU to speed up following reclaim indeed. The old behavior is a bit
fuzzy about this since the rotation and not doing the best
choice either, roration is decided by if an aging success but
an unevictable LRU doesn't always trigger aging.

This part can be improved easily while at it, should be a
nice micro optimization, would do that if V3 is sent.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 09/14] mm/mglru: use the common routine for dirty/writeback reactivation
  2026-04-02 18:53 ` [PATCH v3 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
@ 2026-04-03  5:00   ` Kairui Song
  0 siblings, 0 replies; 24+ messages in thread
From: Kairui Song @ 2026-04-03  5:00 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

Forwarding sashiko review, which seems not a real issue but just in case:

> commit e6160f371a1c25f6b49b0a6639582d0636d9905b
> Author: Kairui Song <kasong@tencent.com>
> mm/mglru: use the common routine for dirty/writeback reactivation
> This commit drops MGLRU-specific handling for dirty and writeback folios,
> instead allowing the common shrink_folio_list() routine to process and
> reactivate them like the classical LRU.
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 9f4512a4d35fc..2a36cf937061b 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4628,21 +4627,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >  		return true;
> >  	}
> >  
> > -	dirty = folio_test_dirty(folio);
> > -	writeback = folio_test_writeback(folio);
> > -	if (type == LRU_GEN_FILE && dirty) {
> > -		sc->nr.file_taken += delta;
> > -		if (!writeback)
> > -			sc->nr.unqueued_dirty += delta;
> > -	}
> > -
> > -	/* waiting for writeback */
> > -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> > -		gen = folio_inc_gen(lruvec, folio, true);
> > -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> > -		return true;
> > -	}
> Does removing this explicit promotion and instead falling back to 
> shrink_folio_list() cause these folios to retain stale tier bits?
> When dirty and writeback folios are processed by shrink_folio_list(),
> they frequently hit the activate_locked path, which sets PG_active on
> the folio.
> When these folios are returned in a list to evict_folios(), it tries to
> clear stale tier bits by checking if the folio is from the oldest
> generation:
>     if (lru_gen_folio_seq(lruvec, folio, false) == min_seq[type])
> Because PG_active is set, lru_gen_folio_seq() calculates a younger
> generation, so this check evaluates to false and the tier bits
> (LRU_REFS_FLAGS) are not cleared.
> Later, the folios are added back to the LRU via move_folios_to_lru() and
> lru_gen_add_folio(). While lru_gen_add_folio() clears PG_active, it
> intentionally does not clear LRU_REFS_FLAGS.
> Could this sequence leave the folios in a younger generation with stale
> PG_referenced and LRU_REFS_MASK bits, artificially inflating their access
> counts?

Actually the new behavior is better I believe, clearing the ref bits for
dirty folios make no sense. We even previously had a commit locally to
disable resetting the refs bits for folio_inc_gen unless it's getting
protected by ref bits (MGLRU's PID).

The access count contributes to the PID protection, refault tracking
and things like PSI, leave these counter untouched should help them to
track the folio hotness info better upon the real reclaim when the
writeback is done.

I better mention this in the commit message indeed.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim
  2026-04-02 18:53 ` [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-04-03  7:50   ` Barry Song
  2026-04-03  9:09     ` Kairui Song
  0 siblings, 1 reply; 24+ messages in thread
From: Barry Song @ 2026-04-03  7:50 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Fri, Apr 3, 2026 at 2:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> With a fixed number to reclaim calculated at the beginning, making each
> following step smaller should reduce the lock contention and avoid
> over-aggressive reclaim of folios, as it will abort earlier when the
> number of folios to be reclaimed is reached.
>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 643f9fc10214..9c28afb0219c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                         break;
>                 }
>
> -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);

I’m fine with the smaller batch size, but I wonder if
MIN_LRU_BATCH is too small.

Just curious if we are calling get_nr_to_scan() more frequently
before we can abort the while (true) loop if reclamation
is not making good progress.

Assume get_nr_to_scan() also has a cost. Not sure if a
value between MIN_LRU_BATCH and MAX_LRU_BATCH
would be better.

Thanks
Barry


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim
  2026-04-03  7:50   ` Barry Song
@ 2026-04-03  9:09     ` Kairui Song
  2026-04-03  9:25       ` Barry Song
  0 siblings, 1 reply; 24+ messages in thread
From: Kairui Song @ 2026-04-03  9:09 UTC (permalink / raw)
  To: Barry Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

On Fri, Apr 03, 2026 at 03:50:37PM +0800, Barry Song wrote:
> On Fri, Apr 3, 2026 at 2:53 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > With a fixed number to reclaim calculated at the beginning, making each
> > following step smaller should reduce the lock contention and avoid
> > over-aggressive reclaim of folios, as it will abort earlier when the
> > number of folios to be reclaimed is reached.
> >
> > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 643f9fc10214..9c28afb0219c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                         break;
> >                 }
> >
> > -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> > +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
> 
> I’m fine with the smaller batch size, but I wonder if
> MIN_LRU_BATCH is too small.

Thanks for the review, Barry!

It's quite reasonable value I think, for comparison classical LRU's
batch size is SWAP_CLUSTER_MAX (32), even smaller than
MIN_LRU_BATCH (64).

I ran many different benchmarks on this which can be found in
V2 / V1's cover letter (it getting too long so I didn't include these
results in V3 but I did retest). The new value looked good from large
server to small VMs.

It's also a much more reasonable value for batch throttling and dirty
writeback IMO.

> 
> Just curious if we are calling get_nr_to_scan() more frequently
> before we can abort the while (true) loop if reclamation
> is not making good progress.
> 
> Assume get_nr_to_scan() also has a cost. Not sure if a
> value between MIN_LRU_BATCH and MAX_LRU_BATCH
> would be better.

We are calling that less frequently actually, in a previous
commit it was moved out of the loop to act like a budget
control. That's also where using a smaller batch start
to makes more sense.

The overhead of other function calls also seems trivial.

I also wonder if we can unify or remove some
SWAP_CLUSTER_MAX usage, that value might be no longer
suitable in many places.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim
  2026-04-03  9:09     ` Kairui Song
@ 2026-04-03  9:25       ` Barry Song
  0 siblings, 0 replies; 24+ messages in thread
From: Barry Song @ 2026-04-03  9:25 UTC (permalink / raw)
  To: Kairui Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Qi Zheng, Baolin Wang

On Fri, Apr 3, 2026 at 5:09 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Apr 03, 2026 at 03:50:37PM +0800, Barry Song wrote:
> > On Fri, Apr 3, 2026 at 2:53 AM Kairui Song via B4 Relay
> > <devnull+kasong.tencent.com@kernel.org> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > With a fixed number to reclaim calculated at the beginning, making each
> > > following step smaller should reduce the lock contention and avoid
> > > over-aggressive reclaim of folios, as it will abort earlier when the
> > > number of folios to be reclaimed is reached.
> > >
> > > Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
> > > Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
> > > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/vmscan.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 643f9fc10214..9c28afb0219c 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -5008,7 +5008,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> > >                         break;
> > >                 }
> > >
> > > -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> > > +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
> >
> > I’m fine with the smaller batch size, but I wonder if
> > MIN_LRU_BATCH is too small.
>
> Thanks for the review, Barry!
>
> It's quite reasonable value I think, for comparison classical LRU's
> batch size is SWAP_CLUSTER_MAX (32), even smaller than
> MIN_LRU_BATCH (64).
>
> I ran many different benchmarks on this which can be found in
> V2 / V1's cover letter (it getting too long so I didn't include these
> results in V3 but I did retest). The new value looked good from large
> server to small VMs.
>
> It's also a much more reasonable value for batch throttling and dirty
> writeback IMO.
>
> >
> > Just curious if we are calling get_nr_to_scan() more frequently
> > before we can abort the while (true) loop if reclamation
> > is not making good progress.
> >
> > Assume get_nr_to_scan() also has a cost. Not sure if a
> > value between MIN_LRU_BATCH and MAX_LRU_BATCH
> > would be better.
>
> We are calling that less frequently actually, in a previous
> commit it was moved out of the loop to act like a budget
> control. That's also where using a smaller batch start
> to makes more sense.

Sorry, I missed your earlier change of moving get_nr_to_scan()
out of the loop[1]. It makes a lot of sense to me now.

It seems easier to review if moving it out and decreasing the batch
size are put in the same patch. Anyway, I understand the story now,
thanks!

[1] https://lore.kernel.org/linux-mm/20260403-mglru-reclaim-v3-4-a285efd6ff91@tencent.com/

>
> The overhead of other function calls also seems trivial.
>
> I also wonder if we can unify or remove some
> SWAP_CLUSTER_MAX usage, that value might be no longer
> suitable in many places.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-02 18:53 ` [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
@ 2026-04-03 21:15   ` Axel Rasmussen
  2026-04-04 18:36     ` Kairui Song
  0 siblings, 1 reply; 24+ messages in thread
From: Axel Rasmussen @ 2026-04-03 21:15 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently MGLRU and non-MGLRU handle the reclaim statistic and
> writeback handling very differently, especially throttling.
> Basically MGLRU just ignored the throttling part.
>
> Let's just unify this part, use a helper to deduplicate the code
> so both setups will share the same behavior.
>
> Test using following reproducer using bash:
>
>   echo "Setup a slow device using dm delay"
>   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
>   LOOP=$(losetup --show -f /var/tmp/backing)
>   mkfs.ext4 -q $LOOP
>   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
>       dmsetup create slow_dev
>   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
>
>   echo "Start writeback pressure"
>   sync && echo 3 > /proc/sys/vm/drop_caches
>   mkdir /sys/fs/cgroup/test_wb
>   echo 128M > /sys/fs/cgroup/test_wb/memory.max
>   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
>       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
>
>   echo "Clean up"
>   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
>   dmsetup resume slow_dev
>   umount -l /mnt/slow && sync
>   dmsetup remove slow_dev
>
> Before this commit, `dd` will get OOM killed immediately if
> MGLRU is enabled. Classic LRU is fine.
>
> After this commit, throttling is now effective and no more spin on
> LRU or premature OOM. Stress test on other workloads also looking good.
>
> Global throttling is not here yet, we will fix that separately later.

If I understand correctly, I think this fixes this regression report
[1] from a long time ago that was never fully resolved?

[1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@chrisdown.name/

We investigated at that time, but I don't feel we got to a consensus
on how to solve it. I think we got a bit bogged down trying to
"completely solve writeback throttling" rather than just doing some
incremental improvement which fixed that particular case.

>
> Suggested-by: Chen Ridong <chenridong@huaweicloud.com>
> Tested-by: Leno Hou <lenohou@gmail.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 90 ++++++++++++++++++++++++++++---------------------------------
>  1 file changed, 41 insertions(+), 49 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9120d914445e..a7b3e5b6676b 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1942,6 +1942,44 @@ static int current_may_throttle(void)
>         return !(current->flags & PF_LOCAL_THROTTLE);
>  }
>
> +static void handle_reclaim_writeback(unsigned long nr_taken,
> +                                    struct pglist_data *pgdat,
> +                                    struct scan_control *sc,
> +                                    struct reclaim_stat *stat)
> +{
> +       /*
> +        * If dirty folios are scanned that are not queued for IO, it
> +        * implies that flushers are not doing their job. This can
> +        * happen when memory pressure pushes dirty folios to the end of
> +        * the LRU before the dirty limits are breached and the dirty
> +        * data has expired. It can also happen when the proportion of
> +        * dirty folios grows not through writes but through memory
> +        * pressure reclaiming all the clean cache. And in some cases,
> +        * the flushers simply cannot keep up with the allocation
> +        * rate. Nudge the flusher threads in case they are asleep.
> +        */
> +       if (stat->nr_unqueued_dirty == nr_taken && nr_taken) {
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +               /*
> +                * For cgroupv1 dirty throttling is achieved by waking up
> +                * the kernel flusher here and later waiting on folios
> +                * which are in writeback to finish (see shrink_folio_list()).
> +                *
> +                * Flusher may not be able to issue writeback quickly
> +                * enough for cgroupv1 writeback throttling to work
> +                * on a large system.
> +                */
> +               if (!writeback_throttling_sane(sc))
> +                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> +       }
> +
> +       sc->nr.dirty += stat->nr_dirty;
> +       sc->nr.congested += stat->nr_congested;
> +       sc->nr.writeback += stat->nr_writeback;
> +       sc->nr.immediate += stat->nr_immediate;
> +       sc->nr.taken += nr_taken;
> +}
> +
>  /*
>   * shrink_inactive_list() is a helper for shrink_node().  It returns the number
>   * of reclaimed pages
> @@ -2005,39 +2043,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>         lruvec_lock_irq(lruvec);
>         lru_note_cost_unlock_irq(lruvec, file, stat.nr_pageout,
>                                         nr_scanned - nr_reclaimed);
> -
> -       /*
> -        * If dirty folios are scanned that are not queued for IO, it
> -        * implies that flushers are not doing their job. This can
> -        * happen when memory pressure pushes dirty folios to the end of
> -        * the LRU before the dirty limits are breached and the dirty
> -        * data has expired. It can also happen when the proportion of
> -        * dirty folios grows not through writes but through memory
> -        * pressure reclaiming all the clean cache. And in some cases,
> -        * the flushers simply cannot keep up with the allocation
> -        * rate. Nudge the flusher threads in case they are asleep.
> -        */
> -       if (stat.nr_unqueued_dirty == nr_taken) {
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -               /*
> -                * For cgroupv1 dirty throttling is achieved by waking up
> -                * the kernel flusher here and later waiting on folios
> -                * which are in writeback to finish (see shrink_folio_list()).
> -                *
> -                * Flusher may not be able to issue writeback quickly
> -                * enough for cgroupv1 writeback throttling to work
> -                * on a large system.
> -                */
> -               if (!writeback_throttling_sane(sc))
> -                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -       }
> -
> -       sc->nr.dirty += stat.nr_dirty;
> -       sc->nr.congested += stat.nr_congested;
> -       sc->nr.writeback += stat.nr_writeback;
> -       sc->nr.immediate += stat.nr_immediate;
> -       sc->nr.taken += nr_taken;
> -
> +       handle_reclaim_writeback(nr_taken, pgdat, sc, &stat);
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>         return nr_reclaimed;
> @@ -4824,26 +4830,11 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
>         sc->nr_reclaimed += reclaimed;
> +       handle_reclaim_writeback(isolated, pgdat, sc, &stat);
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         type_scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> -       /*
> -        * If too many file cache in the coldest generation can't be evicted
> -        * due to being dirty, wake up the flusher.
> -        */
> -       if (stat.nr_unqueued_dirty == isolated) {
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
> -               /*
> -                * For cgroupv1 dirty throttling is achieved by waking up
> -                * the kernel flusher here and later waiting on folios
> -                * which are in writeback to finish (see shrink_folio_list()).
> -                */
> -               if (!writeback_throttling_sane(sc))
> -                       reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
> -       }
> -
>         list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>                 DEFINE_MIN_SEQ(lruvec);
>
> @@ -4886,6 +4877,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>         if (!list_empty(&list)) {
>                 skip_retry = true;
> +               isolated = 0;
>                 goto retry;
>         }
>
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling
  2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (13 preceding siblings ...)
  2026-04-02 18:53 ` [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
@ 2026-04-03 21:26 ` Axel Rasmussen
  14 siblings, 0 replies; 24+ messages in thread
From: Axel Rasmussen @ 2026-04-03 21:26 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> This series is based on mm-new.
>
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty writeback handling. As a result, we can see an up to ~30% increase
> in some workloads like MongoDB with YCSB and a huge decrease in file
> refault, no swap involved. Other common benchmarks have no regression,
> and LOC is reduced, with less unexpected OOM, too.
>
> Some of the problems were found in our production environment, and
> others were mostly exposed while stress testing during the development
> of the LSM/MM/BPF topic on improving MGLRU [1]. This series cleans up
> the code base and fixes several performance issues, preparing for
> further work.
>
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective.
>
> This series slightly cleans up and improves these issues using a scan
> budget by calculating the number of folios to scan at the beginning of
> the loop, and decouples aging from the reclaim calculation helpers.
> Then, move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
>
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and a 128G memory machine using NVME as storage.
>
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
>
> Not using SWAP.
>
> Before:
> Throughput(ops/sec): 62485.02962831822
> AverageLatency(us): 500.9746963330107
> pgpgin 159347462
> pgpgout 5413332
> workingset_refault_anon 0
> workingset_refault_file 34522071
>
> After:
> Throughput(ops/sec): 79760.71784646061 (+27.6%, higher is better)
> AverageLatency(us): 391.25169970043726 (-21.9%, lower is better)
> pgpgin 111093923                       (-30.3%, lower is better)
> pgpgout 5437456
> workingset_refault_anon 0
> workingset_refault_file 19566366       (-43.3%, lower is better)
>
> We can see a significant performance improvement after this series.
> The test is done on NVME and the performance gap would be even larger
> for slow devices, such as HDD or network storage. We observed over
> 100% gain for some workloads with slow IO.
>
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
>
> Before:
> Total requests:            79915
> Per-worker 95% CI (mean):  [1233.9, 1263.5]
> Per-worker stdev:          59.2
> Jain's fairness:           0.997795 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26859   33.61%   33.61%
> [1,2)s      7818    9.78%   43.39%
> [2,4)s      5532    6.92%   50.31%
> [4,8)s     39706   49.69%  100.00%
>
> After:
> Total requests:            81382
> Per-worker 95% CI (mean):  [1241.9, 1301.3]
> Per-worker stdev:          118.8
> Jain's fairness:           0.991480 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     26696   32.80%   32.80%
> [1,2)s      8745   10.75%   43.55%
> [2,4)s      6865    8.44%   51.98%
> [4,8)s     39076   48.02%  100.00%
>
> Reclaim is still fair and effective, total requests number seems
> slightly better.
>
> OOM issue with aging and throttling
> ===================================
> For the throttling OOM issue, it can be easily reproduced using dd and
> cgroup limit as demonstrated in patch 14, and fixed by this series.
>
> The aging OOM is a bit tricky, a specific reproducer can be used to
> simulate what we encountered in production environment [4]:
> Spawns multiple workers that keep reading the given file using mmap,
> and pauses for 120ms after one file read batch. It also spawns another
> set of workers that keep allocating and freeing a given size of
> anonymous memory. The total memory size exceeds the memory limit
> (eg. 14G anon + 8G file, which is 22G vs a 16G memcg limit).
>
> - MGLRU disabled:
>   Finished 128 iterations.
>
> - MGLRU enabled:
>   OOM with following info after about ~10-20 iterations:
>     [   62.624130] file_anon_mix_p invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     [   62.624999] memory: usage 16777216kB, limit 16777216kB, failcnt 24460
>     [   62.640200] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [   62.640823] Memory cgroup stats for /demo:
>     [   62.641017] anon 10604879872
>     [   62.641941] file 6574858240
>
>   OOM occurs despite there being still evictable file folios.
>
> - MGLRU enabled after this series:
>   Finished 128 iterations.
>
> Worth noting there is another OOM related issue reported in V1 of
> this series, which is tested and looking OK now [5].
>
> MySQL:
> ======
>
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=48 --time=600 run
>
> Before:            17260.781429 tps
> After this series: 17266.842857 tps
>
> MySQL is anon folios heavy, involves writeback and file and still
> looking good. Seems only noise level changes, no regression.
>
> FIO:
> ====
> Testing with the following command, where /mnt/ramdisk is a
> 64G EXT4 ramdisk, each test file is 3G, in a 10G memcg,
> 6 test run each:
>
> fio --directory=/mnt/ramdisk --filename_format='test.$jobnum.img' \
>   --name=cached --numjobs=16 --size=3072M --buffered=1 --ioengine=mmap \
>   --rw=randread --norandommap --time_based \
>   --ramp_time=1m --runtime=5m --group_reporting
>
> Before:            9196.481429 MB/s
> After this series: 9256.105000 MB/s
>
> Also seem only noise level changes and no regression or slightly better.
>
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 12 test run each.
>
> Before:            2589.63s
> After this series: 2543.58s
>
> Also seem only noise level changes, no regression or very slightly better.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> Link: https://lore.kernel.org/linux-mm/acgNCzRDVmSbXrOE@KASONG-MC4/ [5]
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Changes in v3:
> - Don't force scan at least SWAP_CLUSTER_MAX pages for each reclaim
>   loop. If the LRU is too small, adjust it accordingly. Now the
>   multi-cgroup scan balance looked even better for tiny cgroups:
>   https://lore.kernel.org/linux-mm/aciejkdIHyXPNS9Y@KASONG-MC4/
> - Add one patch to remove the swap constraint check in isolate_folio. In
>   theory, it's fine, and both stress test and performance test didn't
>   show any issue:
>   https://lore.kernel.org/linux-mm/CAMgjq7C8TCsK99p85i3QzGCwgkXscTfFB6XCUTWQOcuqwHQa2Q@mail.gmail.com/
> - I reran most tests, all seem identical, so most data is kept.
>   Intermediate test results are dropped. I ran tests on most patches
>   individually, and there is no problem, but the series is getting too
>   long, and posting them makes it harder to read and unnecessary.
> - Split previously patch 8 into two patches as suggested [ Shakeel Butt ],
>   also some test result is collected to support the design:
>   https://lore.kernel.org/linux-mm/ac44BVOvOm8lhVvj@KASONG-MC4/#t
>   I kept Axel's review-by since the code is identical.
> - Call try_to_inc_min_seq twice to avoid stale empty gen and drop
>   its return argument [ Baolin Wang ]
> - Move a few lines of code between patches to where they fits better,
>   the final result is identical [ Baolin Wang ].
> - Collect tested by and update test setup [ Leno Hou ]
> - Collect review by.
> - Update a few commit message [ Shakeel Butt ].
> - Link to v2: https://patch.msgid.link/20260329-mglru-reclaim-v2-0-b53a3678513c@tencent.com
>
> Changes in v2:
> - Rebase on top of mm-new which includes Cgroup V1 fix from
>   [ Baolin Wang ].
> - Added dirty throttling OOM fix as patch 12, as [ Chen Ridong ]'s
>   review suggested that we shouldn't leave the counter and reclaim
>   feedback in shrink_folio_list untracked in this series.
> - Add a minimal scan number of SWAP_CLUSTER_MAX limit in patch
>   "restructure the reclaim loop", the change is trivial but might
>   help avoid livelock for tiny cgroups.
> - Redo the tests, most test are basically identical to before, but just
>   in case, since the patch also solves the throttling issue now, and
>   discussed with reports from CachyOS.
> - Add a separate patch for variable renaming as suggested by [ Barry
>   Song ]. No feature change.
> - Improve several comment and code issue [ Axel Rasmussen ].
> - Remove no longer needed variable [ Axel Rasmussen ].
> - Collect review by.
> - Link to v1: https://lore.kernel.org/r/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com
>
> ---
> Kairui Song (14):
>       mm/mglru: consolidate common code for retrieving evictable size
>       mm/mglru: rename variables related to aging and rotation
>       mm/mglru: relocate the LRU scan batch limit to callers
>       mm/mglru: restructure the reclaim loop
>       mm/mglru: scan and count the exact number of folios
>       mm/mglru: use a smaller batch for reclaim
>       mm/mglru: don't abort scan immediately right after aging
>       mm/mglru: remove redundant swap constrained check upon isolation
>       mm/mglru: use the common routine for dirty/writeback reactivation
>       mm/mglru: simplify and improve dirty writeback handling
>       mm/mglru: remove no longer used reclaim argument for folio protection
>       mm/vmscan: remove sc->file_taken
>       mm/vmscan: remove sc->unqueued_dirty
>       mm/vmscan: unify writeback reclaim statistic and throttling

I read through all of the v3 patches, for me they look ready to go. I
don't see any of the remaining small optimizations as reasons not to
merge at this point, there can always be some small follow up work. :)
Feel free to take on any of the patches that don't already have it:

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

>
>  mm/vmscan.c | 332 ++++++++++++++++++++++++++----------------------------------
>  1 file changed, 143 insertions(+), 189 deletions(-)
> ---
> base-commit: c17461ca3e91a3fe705685a23ad7edb58d4ee768
> change-id: 20260314-mglru-reclaim-1c9d45ac57f6
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling
  2026-04-03 21:15   ` Axel Rasmussen
@ 2026-04-04 18:36     ` Kairui Song
  0 siblings, 0 replies; 24+ messages in thread
From: Kairui Song @ 2026-04-04 18:36 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel, Qi Zheng,
	Baolin Wang

On Sat, Apr 4, 2026 at 5:16 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Thu, Apr 2, 2026 at 11:53 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently MGLRU and non-MGLRU handle the reclaim statistic and
> > writeback handling very differently, especially throttling.
> > Basically MGLRU just ignored the throttling part.
> >
> > Let's just unify this part, use a helper to deduplicate the code
> > so both setups will share the same behavior.
> >
> > Test using following reproducer using bash:
> >
> >   echo "Setup a slow device using dm delay"
> >   dd if=/dev/zero of=/var/tmp/backing bs=1M count=2048
> >   LOOP=$(losetup --show -f /var/tmp/backing)
> >   mkfs.ext4 -q $LOOP
> >   echo "0 $(blockdev --getsz $LOOP) delay $LOOP 0 0 $LOOP 0 1000" | \
> >       dmsetup create slow_dev
> >   mkdir -p /mnt/slow && mount /dev/mapper/slow_dev /mnt/slow
> >
> >   echo "Start writeback pressure"
> >   sync && echo 3 > /proc/sys/vm/drop_caches
> >   mkdir /sys/fs/cgroup/test_wb
> >   echo 128M > /sys/fs/cgroup/test_wb/memory.max
> >   (echo $BASHPID > /sys/fs/cgroup/test_wb/cgroup.procs && \
> >       dd if=/dev/zero of=/mnt/slow/testfile bs=1M count=192)
> >
> >   echo "Clean up"
> >   echo "0 $(blockdev --getsz $LOOP) error" | dmsetup load slow_dev
> >   dmsetup resume slow_dev
> >   umount -l /mnt/slow && sync
> >   dmsetup remove slow_dev
> >
> > Before this commit, `dd` will get OOM killed immediately if
> > MGLRU is enabled. Classic LRU is fine.
> >
> > After this commit, throttling is now effective and no more spin on
> > LRU or premature OOM. Stress test on other workloads also looking good.
> >
> > Global throttling is not here yet, we will fix that separately later.
>
> If I understand correctly, I think this fixes this regression report
> [1] from a long time ago that was never fully resolved?
>
> [1]: https://lore.kernel.org/lkml/ZeC-u7GRSptoVqia@chrisdown.name/
>
> We investigated at that time, but I don't feel we got to a consensus
> on how to solve it. I think we got a bit bogged down trying to
> "completely solve writeback throttling" rather than just doing some
> incremental improvement which fixed that particular case.
>

Hello Axel!

Yes, we also observed that problem. I almost forgot about that report,
thanks for the link! No worry, for the majority of the users I think
the problem was fixed already a year ago.

I asked Jingxiang previously to help fix that by waking up writeback
previously. In that discussion, the info is showing that fluster is
not waking at all, and Yafang reports that reverting 14aa8b2d5c2e can
fix it. So Jingxiang's fix seemed work well at that time:
https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/

AFAIK there seems to be no more reports of premature OOM in the mail
list since then, but later we found that that fix isn't enough for
some particular and rare setups (for example I used dm delay in the
test script above to simulate slow IO). Usually the reclaim can always
keep up, since it's rare for LRU to be full of writeback folios and
there are always clean folios to drop, waking up flusher is good
enough. But when under extreme pressure or very slow devices, LRU
could get congested with writeback folios. And it's hard to apply a
reasonable throttle or improve the dirty flush without a bit more
refactor first, and that's not the only cgroup OOM problem we
encountered.

With this series, I think the known problems mentioned above are all
covered in a clean way.

Global pressure and throttle is still not here yet, it's an even more
rare problem since LRU getting congested with writeback globally seems
already a really bad situation to me. That can also be fixed
separately later.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2026-04-04 18:36 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 18:53 [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 01/14] mm/mglru: consolidate common code for retrieving evictable size Kairui Song via B4 Relay
2026-04-03  3:16   ` Kairui Song
2026-04-02 18:53 ` [PATCH v3 02/14] mm/mglru: rename variables related to aging and rotation Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 03/14] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 04/14] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-04-03  4:44   ` Kairui Song
2026-04-02 18:53 ` [PATCH v3 05/14] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 06/14] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-04-03  7:50   ` Barry Song
2026-04-03  9:09     ` Kairui Song
2026-04-03  9:25       ` Barry Song
2026-04-02 18:53 ` [PATCH v3 07/14] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 08/14] mm/mglru: remove redundant swap constrained check upon isolation Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 09/14] mm/mglru: use the common routine for dirty/writeback reactivation Kairui Song via B4 Relay
2026-04-03  5:00   ` Kairui Song
2026-04-02 18:53 ` [PATCH v3 10/14] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 11/14] mm/mglru: remove no longer used reclaim argument for folio protection Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 12/14] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 13/14] mm/vmscan: remove sc->unqueued_dirty Kairui Song via B4 Relay
2026-04-02 18:53 ` [PATCH v3 14/14] mm/vmscan: unify writeback reclaim statistic and throttling Kairui Song via B4 Relay
2026-04-03 21:15   ` Axel Rasmussen
2026-04-04 18:36     ` Kairui Song
2026-04-03 21:26 ` [PATCH v3 00/14] mm/mglru: improve reclaim loop and dirty folio handling Axel Rasmussen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox