public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
@ 2026-03-17 19:08 Kairui Song via B4 Relay
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
                   ` (8 more replies)
  0 siblings, 9 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

This series cleans up and slightly improves MGLRU's reclaim loop and
dirty flush logic. As a result, we can see an up to ~50% reduce of file
faults and 30% increase in MongoDB throughput with YCSB and no swap
involved, other common benchmarks have no regression, and LOC is
reduced, with less unexpected OOM in our production environment.

Some of the problems were found in our production environment, and
others are mostly exposed while stress testing the LFU-like design as
proposed in the LSM/MM/BPF topic this year [1]. This series has no
direct relationship to that topic, but it cleans up the code base and
fixes several strange behaviors that make the test result of the
LFU-like design not as good as expected.

MGLRU's reclaim loop is a bit complex, and hence these problems are
somehow related to each other. The aging, scan number calculation, and
reclaim loop are coupled together, and the dirty folio handling logic is
quite different, making the reclaim loop hard to follow and the dirty
flush ineffective too.

This series slightly cleans up and improves the reclaim loop using a
scan budget by calculating the number of folios to scan at the beginning
of the loop, and decouples aging from the reclaim calculation helpers
Then move the dirty flush logic inside the reclaim loop so it can kick
in more effectively. These issues are somehow related, and this series
handles them and improves MGLRU reclaim in many ways.

Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
and 128G memory machine using NVME as storage.

MongoDB
=======
Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
threads:32), which does 95% read and 5% update to generate mixed read
and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
the WiredTiger cache size is set to 4.5G, using NVME as storage.

Not using SWAP.

Median of 3 test run, results are stable.

Before:
Throughput(ops/sec): 61642.78008938203
AverageLatency(us):  507.11127774145166
pgpgin 158190589
pgpgout 5880616
workingset_refault 7262988

After:
Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
pgpgin 101871227                        (-35.6%, lower is better)
pgpgout 5770028
workingset_refault 3418186              (-52.9%, lower is better)

We can see a significant performance improvement after this series for
file cache heavy workloads like this. The test is done on NVME and the
performance gap would be even larger for slow devices, we observed
>100% gain for some other workloads running on HDD devices.

Chrome & Node.js [3]
====================
Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
workers:

Before:
Total requests:            77920
Per-worker 95% CI (mean):  [1199.9, 1235.1]
Per-worker stdev:          70.5
Jain's fairness:           0.996706 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     25649   32.92%   32.92%
[1,2)s      7759    9.96%   42.87%
[2,4)s      5156    6.62%   49.49%
[4,8)s     39356   50.51%  100.00%

After:
Total requests:            79564
Per-worker 95% CI (mean):  [1224.2, 1262.2]
Per-worker stdev:          76.1
Jain's fairness:           0.996328 (1.0 = perfectly fair)
Latency:
Bucket     Count      Pct    Cumul
[0,1)s     25485   32.03%   32.03%
[1,2)s      8661   10.89%   42.92%
[2,4)s      6268    7.88%   50.79%
[4,8)s     39150   49.21%  100.00%

Seems identical, reclaim is still fair and effective, total requests
number seems slightly better.

OOM issue [4]
=============
Testing with a specific reproducer [4] to simulate what we encounterd in
production environment. Still using the same test machine but one node
is used as pmem ramdisk following steps in the reproducer, no SWAP used.

This reproducer spawns multiple workers that keep reading the given file
using mmap, and pauses for 120ms after one file read batch. It also
spawns another set of workers that keep allocating and freeing a
given size of anonymous memory. The total memory size exceeds the
memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
But by evicting the file cache, the workload should hold just fine,
especially given that the file worker pauses after every batch, allowing
other workers to catch up.

- MGLRU disabled:
  Finished 128 iterations.

- MGLRU enabled:
  Hung or OOM with following info after about ~10-20 iterations:

    [  357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    ... <snip> ...
    [  357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
    [  357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [  357.348192] Memory cgroup stats for /demo:
    [  357.348314] anon 46724382720
    [  357.348963] file 4160753664

  OOM occurs despite there is still evictable file folios.

- MGLRU enabled after this series:
  Finished 128 iterations.

With aging blocking reclaim, the OOM will be much more likely to occur.
This issue is mostly fixed by patch 6 and result is much better, but
this series is still only the first step to improve file folio reclaim
for MGLRU, as there are still cases where file folios can't be
effectively reclaimed.

MySQL:
======

Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
ZRAM as swap and test command:

sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
  --tables=48 --table-size=2000000 --threads=96 --time=600 run

Before:         22343.701667 tps
After patch 4:  22327.325000 tps
After patch 5:  22373.224000 tps
After patch 6:  22321.174000 tps
After patch 7:  22625.961667 tps (+1.26%, higher is better)

MySQL is anon folios heavy but still looking good. Seems only noise level
changes, no regression.

FIO:
====
Testing with the following command, where /mnt is an EXT4 ramdisk, 6
test runs each in a 10G memcg:

fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
  --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
  --iodepth_batch_complete=32 --rw=randread \
  --random_distribution=zipf:1.2 --norandommap --time_based \
  --ramp_time=1m --runtime=10m --group_reporting

Before:        32039.56 MB/s
After patch 3: 32751.50 MB/s
After patch 4: 32703.03 MB/s
After patch 5: 33395.52 MB/s
After patch 6: 32031.51 MB/s
After patch 7: 32534.29 MB/s

Also seem only noise level changes and no regression.

Build kernel:
=============
Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
using make -j96 and defconfig, measuring system time, 8 test run each.

Before:        2881.41s
After patch 3: 2894.09s
After patch 4: 2846.73s
After patch 5: 2847.91s
After patch 6: 2835.17s
After patch 7: 2842.90s

Also seem only noise level changes, no regression or very slightly better.

Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (8):
      mm/mglru: consolidate common code for retrieving evitable size
      mm/mglru: relocate the LRU scan batch limit to callers
      mm/mglru: restructure the reclaim loop
      mm/mglru: scan and count the exact number of folios
      mm/mglru: use a smaller batch for reclaim
      mm/mglru: don't abort scan immediately right after aging
      mm/mglru: simplify and improve dirty writeback handling
      mm/vmscan: remove sc->file_taken

 mm/vmscan.c | 191 ++++++++++++++++++++++++++----------------------------------
 1 file changed, 81 insertions(+), 110 deletions(-)
---
base-commit: dffde584d8054e88e597e3f28de04c7f5d191a67
change-id: 20260314-mglru-reclaim-1c9d45ac57f6

Best regards,
-- 
Kairui Song <kasong@tencent.com>




^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
@ 2026-03-17 19:08 ` Kairui Song via B4 Relay
  2026-03-17 19:55   ` Yuanchu Xie
                     ` (3 more replies)
  2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
                   ` (7 subsequent siblings)
  8 siblings, 4 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Merge commonly used code for counting evictable folios in a lruvec.

No behavior change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 42 +++++++++++++++++-------------------------
 1 file changed, 17 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 33287ba4a500..d7fc7f1fe06d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4078,27 +4078,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
 	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
 }
 
-static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+static long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
 {
 	int gen, type, zone;
-	unsigned long total = 0;
-	int swappiness = get_swappiness(lruvec, sc);
+	unsigned long seq, total = 0;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
 	DEFINE_MIN_SEQ(lruvec);
 
 	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
 		for (seq = min_seq[type]; seq <= max_seq; seq++) {
 			gen = lru_gen_from_seq(seq);
-
 			for (zone = 0; zone < MAX_NR_ZONES; zone++)
 				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
 		}
 	}
 
+	return total;
+}
+
+static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
+{
+	unsigned long total;
+	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
+
+	total = lruvec_evictable_size(lruvec, swappiness);
+
 	/* whether the size is big enough to be helpful */
 	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
 }
@@ -4921,9 +4927,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 			     int swappiness, unsigned long *nr_to_scan)
 {
-	int gen, type, zone;
-	unsigned long size = 0;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	DEFINE_MIN_SEQ(lruvec);
 
 	*nr_to_scan = 0;
@@ -4931,18 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	for_each_evictable_type(type, swappiness) {
-		unsigned long seq;
-
-		for (seq = min_seq[type]; seq <= max_seq; seq++) {
-			gen = lru_gen_from_seq(seq);
-
-			for (zone = 0; zone < MAX_NR_ZONES; zone++)
-				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
-		}
-	}
-
-	*nr_to_scan = size;
+	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
@@ -4954,7 +4946,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
  */
 static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
 {
-	bool success;
+	bool need_aging;
 	unsigned long nr_to_scan;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	DEFINE_MAX_SEQ(lruvec);
@@ -4962,7 +4954,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
 		return -1;
 
-	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
+	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
 	/* try to scrape all its memory if this memcg was deleted */
 	if (nr_to_scan && !mem_cgroup_online(memcg))
@@ -4971,7 +4963,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
 
 	/* try to get away with not aging at the default priority */
-	if (!success || sc->priority == DEF_PRIORITY)
+	if (!need_aging || sc->priority == DEF_PRIORITY)
 		return nr_to_scan >> sc->priority;
 
 	/* stop scanning this lruvec as it's low on cold folios */

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
@ 2026-03-17 19:08 ` Kairui Song via B4 Relay
  2026-03-19  2:00   ` Chen Ridong
                     ` (2 more replies)
  2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
                   ` (6 subsequent siblings)
  8 siblings, 3 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Same as active / inactive LRU, MGLRU isolates and scans folios in
batches.  The batch split is done hidden deep in the helper, which
makes the code harder to follow.  The helper's arguments are also
confusing since callers usually request more folios than the batch
size, so the helper almost never processes the full requested amount.

Move the batch splitting into the top loop to make it cleaner, there
should be no behavior change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d7fc7f1fe06d..d48074f9bd87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4689,10 +4689,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int scanned = 0;
 	int isolated = 0;
 	int skipped = 0;
-	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
-	int remaining = scan_batch;
+	unsigned long remaining = nr_to_scan;
 	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 
+	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
 	VM_WARN_ON_ONCE(!list_empty(list));
 
 	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
@@ -4745,7 +4745,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	mod_lruvec_state(lruvec, item, isolated);
 	mod_lruvec_state(lruvec, PGREFILL, sorted);
 	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
-	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
+	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
@@ -4827,7 +4827,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 		*type_scanned = type;
 
-		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
+		scanned = scan_folios(nr_to_scan, lruvec, sc,
+				      type, tier, list);
 		if (scanned)
 			return scanned;
 
@@ -4999,7 +5000,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	long nr_to_scan;
+	long nr_batch, nr_to_scan;
 	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
 
@@ -5010,7 +5011,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (nr_to_scan <= 0)
 			break;
 
-		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
+		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
@@ -5615,6 +5617,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
 static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
 			int swappiness, unsigned long nr_to_reclaim)
 {
+	int nr_batch;
 	DEFINE_MAX_SEQ(lruvec);
 
 	if (seq + MIN_NR_GENS > max_seq)
@@ -5631,8 +5634,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
 		if (sc->nr_reclaimed >= nr_to_reclaim)
 			return 0;
 
-		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
-				  swappiness))
+		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
+		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
 			return 0;
 
 		cond_resched();

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 3/8] mm/mglru: restructure the reclaim loop
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
  2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-03-17 19:08 ` Kairui Song via B4 Relay
  2026-03-20 20:09   ` Axel Rasmussen
                     ` (2 more replies)
  2026-03-17 19:09 ` [PATCH 4/8] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
                   ` (5 subsequent siblings)
  8 siblings, 3 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current loop will calculate the scan number on each iteration. The
number of folios to scan is based on the LRU length, with some unclear
behaviors, eg, it only shifts the scan number by reclaim priority at the
default priority, and it couples the number calculation with aging and
rotation.

Adjust, simplify it, and decouple aging and rotation. Just calculate the
scan number for once at the beginning of the reclaim, always respect the
reclaim priority, and make the aging and rotation more explicit.

This slightly changes how offline memcg aging works: previously, offline
memcg wouldn't be aged unless it didn't have any evictable folios. Now,
we might age it if it has only 3 generations and the reclaim priority is
less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
might still hold long-term folios, and in fact, a long-existing offline
memcg must be pinned by some long-term folios like shmem. These folios
might be used by other memcg, so aging them as ordinary memcg doesn't
seem wrong. And besides, aging enables further reclaim of an offlined
memcg, which will certainly happen if we keep shrinking it. And offline
memcg might soon be no longer an issue once reparenting is all ready.

Overall, the memcg LRU rotation, as described in mmzone.h,
remains the same.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 74 ++++++++++++++++++++++++++++++-------------------------------
 1 file changed, 36 insertions(+), 38 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index d48074f9bd87..ed5b5f8dd3c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4926,49 +4926,35 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 }
 
 static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
-			     int swappiness, unsigned long *nr_to_scan)
+			     struct scan_control *sc, int swappiness)
 {
 	DEFINE_MIN_SEQ(lruvec);
 
-	*nr_to_scan = 0;
 	/* have to run aging, since eviction is not possible anymore */
 	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
 		return true;
 
-	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
+	/* try to get away with not aging at the default priority */
+	if (sc->priority == DEF_PRIORITY)
+		return false;
+
 	/* better to run aging even though eviction is still possible */
 	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
 }
 
-/*
- * For future optimizations:
- * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
- *    reclaim.
- */
-static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
+static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
+			   struct mem_cgroup *memcg, int swappiness)
 {
-	bool need_aging;
 	unsigned long nr_to_scan;
-	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
-	DEFINE_MAX_SEQ(lruvec);
-
-	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
-		return -1;
-
-	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
 
+	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
 	/* try to scrape all its memory if this memcg was deleted */
-	if (nr_to_scan && !mem_cgroup_online(memcg))
+	if (!mem_cgroup_online(memcg))
 		return nr_to_scan;
 
 	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
-
-	/* try to get away with not aging at the default priority */
-	if (!need_aging || sc->priority == DEF_PRIORITY)
-		return nr_to_scan >> sc->priority;
-
-	/* stop scanning this lruvec as it's low on cold folios */
-	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
+	/* always respect scan priority */
+	return nr_to_scan >> sc->priority;
 }
 
 static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
@@ -4998,31 +4984,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
 	return true;
 }
 
+/*
+ * For future optimizations:
+ * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
+ *    reclaim.
+ */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
+	bool need_rotate = false;
 	long nr_batch, nr_to_scan;
-	unsigned long scanned = 0;
 	int swappiness = get_swappiness(lruvec, sc);
+	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 
-	while (true) {
+	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
+	while (nr_to_scan > 0) {
 		int delta;
+		DEFINE_MAX_SEQ(lruvec);
 
-		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
-		if (nr_to_scan <= 0)
+		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
+			need_rotate = true;
 			break;
+		}
+
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			break;
+		}
 
 		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;
 
-		scanned += delta;
-		if (scanned >= nr_to_scan)
-			break;
-
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		nr_to_scan -= delta;
 		cond_resched();
 	}
 
@@ -5034,12 +5032,12 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		wakeup_flusher_threads(WB_REASON_VMSCAN);
 
 	/* whether this lruvec should be rotated */
-	return nr_to_scan < 0;
+	return need_rotate;
 }
 
 static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool success;
+	bool need_rotate;
 	unsigned long scanned = sc->nr_scanned;
 	unsigned long reclaimed = sc->nr_reclaimed;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5057,7 +5055,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 		memcg_memory_event(memcg, MEMCG_LOW);
 	}
 
-	success = try_to_shrink_lruvec(lruvec, sc);
+	need_rotate = try_to_shrink_lruvec(lruvec, sc);
 
 	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
 
@@ -5067,10 +5065,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
 
 	flush_reclaim_state(sc);
 
-	if (success && mem_cgroup_online(memcg))
+	if (need_rotate && mem_cgroup_online(memcg))
 		return MEMCG_LRU_YOUNG;
 
-	if (!success && lruvec_is_sizable(lruvec, sc))
+	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
 		return 0;
 
 	/* one retry if offlined or too small */

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (2 preceding siblings ...)
  2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-03-17 19:09 ` Kairui Song via B4 Relay
  2026-03-20 20:57   ` Axel Rasmussen
  2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Make the scan helpers return the exact number of folios being scanned
or isolated. This should make the scan more accurate and easier to
follow.

Now there is no more need for special handling when there is no
progress made. The old livelock prevention `(return isolated ||
!remaining ? scanned : 0)` is replaced by the natural scan budget
exhaustion in try_to_shrink_lruvec, and sort_folio moves ineligible
folios to newer generations.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ed5b5f8dd3c7..4f4548ff3a17 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4680,7 +4680,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
 
 static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		       struct scan_control *sc, int type, int tier,
-		       struct list_head *list)
+		       struct list_head *list, int *isolatedp)
 {
 	int i;
 	int gen;
@@ -4750,11 +4750,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 	if (type == LRU_GEN_FILE)
 		sc->nr.file_taken += isolated;
-	/*
-	 * There might not be eligible folios due to reclaim_idx. Check the
-	 * remaining to prevent livelock if it's not making progress.
-	 */
-	return isolated || !remaining ? scanned : 0;
+
+	*isolatedp = isolated;
+	return scanned;
 }
 
 static int get_tier_idx(struct lruvec *lruvec, int type)
@@ -4819,23 +4817,24 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			  int *type_scanned, struct list_head *list)
 {
 	int i;
+	int scanned = 0;
+	int isolated = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
-		int scanned;
 		int tier = get_tier_idx(lruvec, type);
 
 		*type_scanned = type;
 
-		scanned = scan_folios(nr_to_scan, lruvec, sc,
-				      type, tier, list);
-		if (scanned)
+		scanned += scan_folios(nr_to_scan, lruvec, sc,
+				      type, tier, list, &isolated);
+		if (isolated)
 			return scanned;
 
 		type = !type;
 	}
 
-	return 0;
+	return scanned;
 }
 
 static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
@@ -4852,7 +4851,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	struct reclaim_stat stat;
 	struct lru_gen_mm_walk *walk;
 	bool skip_retry = false;
-	struct lru_gen_folio *lrugen = &lruvec->lrugen;
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
@@ -4860,10 +4858,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
 
-	scanned += try_to_inc_min_seq(lruvec, swappiness);
-
-	if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
-		scanned = 0;
+	try_to_inc_min_seq(lruvec, swappiness);
 
 	lruvec_unlock_irq(lruvec);
 

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 5/8] mm/mglru: use a smaller batch for reclaim
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (3 preceding siblings ...)
  2026-03-17 19:09 ` [PATCH 4/8] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-03-17 19:09 ` Kairui Song via B4 Relay
  2026-03-20 20:58   ` Axel Rasmussen
  2026-03-24  7:51   ` Chen Ridong
  2026-03-17 19:09 ` [PATCH 6/8] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

With a fixed number to reclaim calculated at the beginning, making each
following step smaller should reduce the lock contention and avoid
over-aggressive reclaim of folios, as it will abort earlier when the
number of folios to be reclaimed is reached.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4f4548ff3a17..2ff1609ff4de 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5007,7 +5007,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 			break;
 		}
 
-		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
+		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
 		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
 		if (!delta)
 			break;

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 6/8] mm/mglru: don't abort scan immediately right after aging
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (4 preceding siblings ...)
  2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-03-17 19:09 ` Kairui Song via B4 Relay
  2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Right now, if eviction triggers aging, the reclaimer will abort. This is
not the optimal strategy for several reasons.

Aborting the reclaim early wastes a reclaim cycle when under pressure,
and for concurrent reclaim, if the LRU is under aging, all concurrent
reclaimers might fail. And if the age has just finished, new cold folios
exposed by the aging are not reclaimed until the next reclaim iteration.

What's more, the current aging trigger is quite lenient, having 3 gens
with a reclaim priority lower than default will trigger aging, and
blocks reclaiming from one memcg. This wastes reclaim retry cycles
easily. And in the worst case, if the reclaim is making slower progress
and all following attempts fail due to being blocked by aging, it
triggers unexpected early OOM.

And if a lruvec requires aging, it doesn't mean it's hot. Instead, the
lruvec could be idle for quite a while, and hence it might contain lots
of cold folios to be reclaimed.

While it's helpful to rotate memcg LRU after aging for global reclaim,
as global reclaim fairness is coupled with the rotation in shrink_many,
memcg fairness is instead handled by cgroup iteration in
shrink_node_memcgs. So, for memcg level pressure, this abort is not the
key part for keeping the fairness. And in most cases, there is no need
to age, and fairness must be achieved by upper-level reclaim control.

So instead, just keep the scanning going unless one whole batch of
folios failed to be isolated or enough folios have been scanned, which
is triggered by evict_folios returning 0. And only abort for global
reclaim after one batch, so when there are fewer memcgs, progress is
still made, and the fairness mechanism described above still works fine.

And in most cases, the one more batch attempt for global reclaim might
just be enough to satisfy what the reclaimer needs, hence improving
global reclaim performance by reducing reclaim retry cycles.

Rotation is still there after the reclaim is done, which still follows
the comment in mmzone.h. And fairness still looking good.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2ff1609ff4de..b26959d90850 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4986,7 +4986,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
  */
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
-	bool need_rotate = false;
+	bool need_rotate = false, should_age = false;
 	long nr_batch, nr_to_scan;
 	int swappiness = get_swappiness(lruvec, sc);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
@@ -5004,7 +5004,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
 			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
 				need_rotate = true;
-			break;
+			should_age = true;
 		}
 
 		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
@@ -5015,6 +5015,10 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		if (should_abort_scan(lruvec, sc))
 			break;
 
+		/* Cgroup reclaim fairness not guarded by rotate */
+		if (root_reclaim(sc) && should_age)
+			break;
+
 		nr_to_scan -= delta;
 		cond_resched();
 	}

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (5 preceding siblings ...)
  2026-03-17 19:09 ` [PATCH 6/8] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
@ 2026-03-17 19:09 ` Kairui Song via B4 Relay
  2026-03-20 21:18   ` Axel Rasmussen
                     ` (2 more replies)
  2026-03-17 19:09 ` [PATCH 8/8] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
  2026-03-25  4:49 ` [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Eric Naim
  8 siblings, 3 replies; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The current handling of dirty writeback folios is not working well for
file page heavy workloads: Dirty folios are protected and move to next
gen upon isolation of getting throttled or reactivated upon pageout
(shrink_folio_list).

This might help to reduce the LRU lock contention slightly, but as a
result, the ping-pong effect of folios between head and tail of last two
gens is serious as the shrinker will run into protected dirty writeback
folios more frequently compared to activation. The dirty flush wakeup
condition is also much more passive compared to active/inactive LRU.
Active / inactve LRU wakes the flusher if one batch of folios passed to
shrink_folio_list is unevictable due to under writeback, but MGLRU
instead has to check this after the whole reclaim loop is done, and then
count the isolation protection number compared to the total reclaim
number.

And we previously saw OOM problems with it, too, which were fixed but
still not perfect [1].

So instead, just drop the special handling for dirty writeback, just
re-activate it like active / inactive LRU. And also move the dirty flush
wake up check right after shrink_folio_list. This should improve both
throttling and performance.

Test with YCSB workloadb showed a major performance improvement:

Before this series:
Throughput(ops/sec): 61642.78008938203
AverageLatency(us):  507.11127774145166
pgpgin 158190589
pgpgout 5880616
workingset_refault 7262988

After this commit:
Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
pgpgin 101871227                        (-35.6%, lower is better)
pgpgout 5770028
workingset_refault 3418186              (-52.9%, lower is better)

The refault rate is 50% lower, and throughput is 30% higher, which is a
huge gain. We also observed significant performance gain for other
real-world workloads.

We were concerned that the dirty flush could cause more wear for SSD:
that should not be the problem here, since the wakeup condition is when
the dirty folios have been pushed to the tail of LRU, which indicates
that memory pressure is so high that writeback is blocking the workload
already.

Signed-off-by: Kairui Song <kasong@tencent.com>
Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
---
 mm/vmscan.c | 44 +++++++++++++-------------------------------
 1 file changed, 13 insertions(+), 31 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b26959d90850..e11d0f1a8b68 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4577,7 +4577,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		       int tier_idx)
 {
 	bool success;
-	bool dirty, writeback;
 	int gen = folio_lru_gen(folio);
 	int type = folio_is_file_lru(folio);
 	int zone = folio_zonenum(folio);
@@ -4627,21 +4626,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 		return true;
 	}
 
-	dirty = folio_test_dirty(folio);
-	writeback = folio_test_writeback(folio);
-	if (type == LRU_GEN_FILE && dirty) {
-		sc->nr.file_taken += delta;
-		if (!writeback)
-			sc->nr.unqueued_dirty += delta;
-	}
-
-	/* waiting for writeback */
-	if (writeback || (type == LRU_GEN_FILE && dirty)) {
-		gen = folio_inc_gen(lruvec, folio, true);
-		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
-		return true;
-	}
-
 	return false;
 }
 
@@ -4748,8 +4732,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
 				scanned, skipped, isolated,
 				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
-	if (type == LRU_GEN_FILE)
-		sc->nr.file_taken += isolated;
 
 	*isolatedp = isolated;
 	return scanned;
@@ -4814,11 +4796,11 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
 
 static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 			  struct scan_control *sc, int swappiness,
-			  int *type_scanned, struct list_head *list)
+			  int *type_scanned,
+			  struct list_head *list, int *isolated)
 {
 	int i;
 	int scanned = 0;
-	int isolated = 0;
 	int type = get_type_to_scan(lruvec, swappiness);
 
 	for_each_evictable_type(i, swappiness) {
@@ -4827,8 +4809,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		*type_scanned = type;
 
 		scanned += scan_folios(nr_to_scan, lruvec, sc,
-				      type, tier, list, &isolated);
-		if (isolated)
+				      type, tier, list, isolated);
+		if (*isolated)
 			return scanned;
 
 		type = !type;
@@ -4843,6 +4825,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	int type;
 	int scanned;
 	int reclaimed;
+	int isolated = 0;
 	LIST_HEAD(list);
 	LIST_HEAD(clean);
 	struct folio *folio;
@@ -4856,7 +4839,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 
 	lruvec_lock_irq(lruvec);
 
-	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
+	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list, &isolated);
 
 	try_to_inc_min_seq(lruvec, swappiness);
 
@@ -4866,12 +4849,18 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 		return scanned;
 retry:
 	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
-	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			scanned, reclaimed, &stat, sc->priority,
 			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
 
+	/*
+	 * If too many file cache in the coldest generation can't be evicted
+	 * due to being dirty, wake up the flusher.
+	 */
+	if (stat.nr_unqueued_dirty == isolated)
+		wakeup_flusher_threads(WB_REASON_VMSCAN);
+
 	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
 		DEFINE_MIN_SEQ(lruvec);
 
@@ -5023,13 +5012,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		cond_resched();
 	}
 
-	/*
-	 * If too many file cache in the coldest generation can't be evicted
-	 * due to being dirty, wake up the flusher.
-	 */
-	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
-		wakeup_flusher_threads(WB_REASON_VMSCAN);
-
 	/* whether this lruvec should be rotated */
 	return need_rotate;
 }

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH 8/8] mm/vmscan: remove sc->file_taken
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (6 preceding siblings ...)
  2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-03-17 19:09 ` Kairui Song via B4 Relay
  2026-03-20 21:19   ` Axel Rasmussen
  2026-03-25  4:49 ` [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Eric Naim
  8 siblings, 1 reply; 44+ messages in thread
From: Kairui Song via B4 Relay @ 2026-03-17 19:09 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

No one is using it now, just remove it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/vmscan.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e11d0f1a8b68..b95c9fc17edf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -174,7 +174,6 @@ struct scan_control {
 		unsigned int congested;
 		unsigned int writeback;
 		unsigned int immediate;
-		unsigned int file_taken;
 		unsigned int taken;
 	} nr;
 
@@ -2041,8 +2040,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	sc->nr.writeback += stat.nr_writeback;
 	sc->nr.immediate += stat.nr_immediate;
 	sc->nr.taken += nr_taken;
-	if (file)
-		sc->nr.file_taken += nr_taken;
 
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
 			nr_scanned, nr_reclaimed, &stat, sc->priority, file);

-- 
2.53.0




^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
@ 2026-03-17 19:55   ` Yuanchu Xie
  2026-03-18  9:42   ` Barry Song
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 44+ messages in thread
From: Yuanchu Xie @ 2026-03-17 19:55 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

Hi Kairui,

On Tue, Mar 17, 2026 at 2:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Merge commonly used code for counting evictable folios in a lruvec.
>
> No behavior change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 42 +++++++++++++++++-------------------------
>  1 file changed, 17 insertions(+), 25 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..d7fc7f1fe06d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4078,27 +4078,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
>         sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
>  }
>
> -static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> +static long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
>  {
>         int gen, type, zone;
> -       unsigned long total = 0;
> -       int swappiness = get_swappiness(lruvec, sc);
> +       unsigned long seq, total = 0;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> -       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MAX_SEQ(lruvec);
>         DEFINE_MIN_SEQ(lruvec);
>
>         for_each_evictable_type(type, swappiness) {
> -               unsigned long seq;
> -
>                 for (seq = min_seq[type]; seq <= max_seq; seq++) {
>                         gen = lru_gen_from_seq(seq);
> -
>                         for (zone = 0; zone < MAX_NR_ZONES; zone++)
>                                 total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
>                 }
>         }
>
> +       return total;
> +}
> +
> +static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +       unsigned long total;
> +       int swappiness = get_swappiness(lruvec, sc);
> +       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> +       total = lruvec_evictable_size(lruvec, swappiness);
> +
>         /* whether the size is big enough to be helpful */
>         return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
>  }
> @@ -4921,9 +4927,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>                              int swappiness, unsigned long *nr_to_scan)
>  {
> -       int gen, type, zone;
> -       unsigned long size = 0;
> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         DEFINE_MIN_SEQ(lruvec);
>
>         *nr_to_scan = 0;
> @@ -4931,18 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>         if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
>                 return true;
>
> -       for_each_evictable_type(type, swappiness) {
> -               unsigned long seq;
> -
> -               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> -                       gen = lru_gen_from_seq(seq);
> -
> -                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> -                               size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
> -               }
> -       }
> -
> -       *nr_to_scan = size;
> +       *nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
>         /* better to run aging even though eviction is still possible */
>         return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
>  }
> @@ -4954,7 +4946,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>   */
>  static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>  {
> -       bool success;
> +       bool need_aging;
>         unsigned long nr_to_scan;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MAX_SEQ(lruvec);
> @@ -4962,7 +4954,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>         if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
>                 return -1;
>
> -       success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> +       need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
>
>         /* try to scrape all its memory if this memcg was deleted */
>         if (nr_to_scan && !mem_cgroup_online(memcg))
> @@ -4971,7 +4963,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>         nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
>
>         /* try to get away with not aging at the default priority */
> -       if (!success || sc->priority == DEF_PRIORITY)
> +       if (!need_aging || sc->priority == DEF_PRIORITY)
>                 return nr_to_scan >> sc->priority;
>
>         /* stop scanning this lruvec as it's low on cold folios */
>
> --
> 2.53.0
>
>

Yep, the cleanup makes sense.

Acked-by: Yuanchu Xie <yuanchu@google.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
  2026-03-17 19:55   ` Yuanchu Xie
@ 2026-03-18  9:42   ` Barry Song
  2026-03-18  9:57     ` Kairui Song
  2026-03-19  1:40   ` Chen Ridong
  2026-03-26  6:25   ` Baolin Wang
  3 siblings, 1 reply; 44+ messages in thread
From: Barry Song @ 2026-03-18  9:42 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Wed, Mar 18, 2026 at 3:11 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Merge commonly used code for counting evictable folios in a lruvec.
>
> No behavior change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Barry Song <baohua@kernel.org>

[...]
>  static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>  {
> -       bool success;
> +       bool need_aging;

Nice! Many times, I’ve been in the process of submitting a patch
to rename this `success`, as its current name is completely
unreadable and unclear in meaning.

Another `success`also needs some cleanup.
I mean this one:

static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
{
        bool success;
         ...
         success = try_to_shrink_lruvec(lruvec, sc);
}

yet:

static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
{
        ...
        /* whether this lruvec should be rotated */
        return nr_to_scan < 0;
}

I really can't see the connection between "should be rotated"
and "success".

>         unsigned long nr_to_scan;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         DEFINE_MAX_SEQ(lruvec);
> @@ -4962,7 +4954,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>         if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
>                 return -1;
>
> -       success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> +       need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
>
>         /* try to scrape all its memory if this memcg was deleted */
>         if (nr_to_scan && !mem_cgroup_online(memcg))
> @@ -4971,7 +4963,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>         nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
>
>         /* try to get away with not aging at the default priority */
> -       if (!success || sc->priority == DEF_PRIORITY)
> +       if (!need_aging || sc->priority == DEF_PRIORITY)
>                 return nr_to_scan >> sc->priority;
>
>         /* stop scanning this lruvec as it's low on cold folios */
>
Thanks
Barry


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-18  9:42   ` Barry Song
@ 2026-03-18  9:57     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-18  9:57 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Wed, Mar 18, 2026 at 5:47 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Mar 18, 2026 at 3:11 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Merge commonly used code for counting evictable folios in a lruvec.
> >
> > No behavior change.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Reviewed-by: Barry Song <baohua@kernel.org>
>

Hi Barry,

Thanks for the review.

> [...]
> >  static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> >  {
> > -       bool success;
> > +       bool need_aging;
>
> Nice! Many times, I’ve been in the process of submitting a patch
> to rename this `success`, as its current name is completely
> unreadable and unclear in meaning.
>
> Another `success`also needs some cleanup.
> I mean this one:

Good suggestion. Perhaps I better split it into a standalone patch
with your suggested-by, will include such a patch in V2.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
  2026-03-17 19:55   ` Yuanchu Xie
  2026-03-18  9:42   ` Barry Song
@ 2026-03-19  1:40   ` Chen Ridong
  2026-03-20 19:51     ` Axel Rasmussen
  2026-03-26  6:25   ` Baolin Wang
  3 siblings, 1 reply; 44+ messages in thread
From: Chen Ridong @ 2026-03-19  1:40 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel



On 2026/3/18 3:08, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Merge commonly used code for counting evictable folios in a lruvec.
> 
> No behavior change.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>

> ---
>  mm/vmscan.c | 42 +++++++++++++++++-------------------------
>  1 file changed, 17 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 33287ba4a500..d7fc7f1fe06d 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4078,27 +4078,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
>  	sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
>  }
>  
> -static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> +static long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
>  {
>  	int gen, type, zone;
> -	unsigned long total = 0;
> -	int swappiness = get_swappiness(lruvec, sc);
> +	unsigned long seq, total = 0;
>  	struct lru_gen_folio *lrugen = &lruvec->lrugen;
> -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	DEFINE_MAX_SEQ(lruvec);
>  	DEFINE_MIN_SEQ(lruvec);
>  
>  	for_each_evictable_type(type, swappiness) {
> -		unsigned long seq;
> -
>  		for (seq = min_seq[type]; seq <= max_seq; seq++) {
>  			gen = lru_gen_from_seq(seq);
> -
>  			for (zone = 0; zone < MAX_NR_ZONES; zone++)
>  				total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
>  		}
>  	}
>  
> +	return total;
> +}
> +
> +static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> +{
> +	unsigned long total;
> +	int swappiness = get_swappiness(lruvec, sc);
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> +
> +	total = lruvec_evictable_size(lruvec, swappiness);
> +
>  	/* whether the size is big enough to be helpful */
>  	return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
>  }
> @@ -4921,9 +4927,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>  			     int swappiness, unsigned long *nr_to_scan)
>  {
> -	int gen, type, zone;
> -	unsigned long size = 0;
> -	struct lru_gen_folio *lrugen = &lruvec->lrugen;
>  	DEFINE_MIN_SEQ(lruvec);
>  
>  	*nr_to_scan = 0;
> @@ -4931,18 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>  	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
>  		return true;
>  
> -	for_each_evictable_type(type, swappiness) {
> -		unsigned long seq;
> -
> -		for (seq = min_seq[type]; seq <= max_seq; seq++) {
> -			gen = lru_gen_from_seq(seq);
> -
> -			for (zone = 0; zone < MAX_NR_ZONES; zone++)
> -				size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
> -		}
> -	}
> -
> -	*nr_to_scan = size;
> +	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
>  	/* better to run aging even though eviction is still possible */
>  	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
>  }
> @@ -4954,7 +4946,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
>   */
>  static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
>  {
> -	bool success;
> +	bool need_aging;

I have suffered a lot because of this name. Thank you.

>  	unsigned long nr_to_scan;
>  	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  	DEFINE_MAX_SEQ(lruvec);
> @@ -4962,7 +4954,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>  	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
>  		return -1;
>  
> -	success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> +	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
>  
>  	/* try to scrape all its memory if this memcg was deleted */
>  	if (nr_to_scan && !mem_cgroup_online(memcg))
> @@ -4971,7 +4963,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
>  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
>  
>  	/* try to get away with not aging at the default priority */
> -	if (!success || sc->priority == DEF_PRIORITY)
> +	if (!need_aging || sc->priority == DEF_PRIORITY)
>  		return nr_to_scan >> sc->priority;
>  
>  	/* stop scanning this lruvec as it's low on cold folios */
> 

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
@ 2026-03-19  2:00   ` Chen Ridong
  2026-03-19  4:12     ` Kairui Song
  2026-03-20 21:00   ` Axel Rasmussen
  2026-03-22  8:14   ` Barry Song
  2 siblings, 1 reply; 44+ messages in thread
From: Chen Ridong @ 2026-03-19  2:00 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel



On 2026/3/18 3:08, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Same as active / inactive LRU, MGLRU isolates and scans folios in
> batches.  The batch split is done hidden deep in the helper, which
> makes the code harder to follow.  The helper's arguments are also
> confusing since callers usually request more folios than the batch
> size, so the helper almost never processes the full requested amount.
> 
> Move the batch splitting into the top loop to make it cleaner, there
> should be no behavior change.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I prefer to keep it as is.

If we move min(nr_to_scan, MAX_LRU_BATCH) out of scan_folios, callers
(potentially many functions in the future) would need to handle this logic
themselves, which seems unnecessary. The scan_folios helper should remain cohesive.

Thanks.

> ---
>  mm/vmscan.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d7fc7f1fe06d..d48074f9bd87 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4689,10 +4689,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	int scanned = 0;
>  	int isolated = 0;
>  	int skipped = 0;
> -	int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
> -	int remaining = scan_batch;
> +	unsigned long remaining = nr_to_scan;
>  	struct lru_gen_folio *lrugen = &lruvec->lrugen;
>  
> +	VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
>  	VM_WARN_ON_ONCE(!list_empty(list));
>  
>  	if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> @@ -4745,7 +4745,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	mod_lruvec_state(lruvec, item, isolated);
>  	mod_lruvec_state(lruvec, PGREFILL, sorted);
>  	mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
> -	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
> +	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>  				scanned, skipped, isolated,
>  				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>  	if (type == LRU_GEN_FILE)
> @@ -4827,7 +4827,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  		*type_scanned = type;
>  
> -		scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
> +		scanned = scan_folios(nr_to_scan, lruvec, sc,
> +				      type, tier, list);
>  		if (scanned)
>  			return scanned;
>  
> @@ -4999,7 +5000,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>  
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	long nr_to_scan;
> +	long nr_batch, nr_to_scan;
>  	unsigned long scanned = 0;
>  	int swappiness = get_swappiness(lruvec, sc);
>  
> @@ -5010,7 +5011,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  		if (nr_to_scan <= 0)
>  			break;
>  
> -		delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
> +		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>  		if (!delta)
>  			break;
>  
> @@ -5615,6 +5617,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
>  static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
>  			int swappiness, unsigned long nr_to_reclaim)
>  {
> +	int nr_batch;
>  	DEFINE_MAX_SEQ(lruvec);
>  
>  	if (seq + MIN_NR_GENS > max_seq)
> @@ -5631,8 +5634,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
>  		if (sc->nr_reclaimed >= nr_to_reclaim)
>  			return 0;
>  
> -		if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
> -				  swappiness))
> +		nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
> +		if (!evict_folios(nr_batch, lruvec, sc, swappiness))
>  			return 0;
>  
>  		cond_resched();
> 

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-19  2:00   ` Chen Ridong
@ 2026-03-19  4:12     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-19  4:12 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Thu, Mar 19, 2026 at 10:00 AM Chen Ridong <chenridong@huaweicloud.com> wrote:
> On 2026/3/18 3:08, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Same as active / inactive LRU, MGLRU isolates and scans folios in
> > batches.  The batch split is done hidden deep in the helper, which
> > makes the code harder to follow.  The helper's arguments are also
> > confusing since callers usually request more folios than the batch
> > size, so the helper almost never processes the full requested amount.
> >
> > Move the batch splitting into the top loop to make it cleaner, there
> > should be no behavior change.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> I prefer to keep it as is.
>
> If we move min(nr_to_scan, MAX_LRU_BATCH) out of scan_folios, callers
> (potentially many functions in the future) would need to handle this logic
> themselves, which seems unnecessary. The scan_folios helper should remain cohesive.
>

Hi Ridong,

This patch is mostly for later use, and there are currently only two
callers. One from the default reclaim loop, one from the manual
reclaim interface (not memory.reclaim, I mean the MGLRU's command
interface).

In the default reclaim loop, later we want to control the exact number
of folios being scanned for each iteration, or at least use a smaller
batch value. For the manual reclaim interface using a large batch
seems more reasonable.

So this patch is needed I think. I can merge it into a later patch but
keeping it seperate seems more clean.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-19  1:40   ` Chen Ridong
@ 2026-03-20 19:51     ` Axel Rasmussen
  2026-03-22 16:10       ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 19:51 UTC (permalink / raw)
  To: Chen Ridong
  Cc: kasong, linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

For what it's worth, I applied the full series and ran it through some
basic functional testing, I didn't see any bugs or regressions from
that.

Unfortunately, the best signal would be actually deploying it under
some real serving workloads, but the latency for me to do that + get
results is like order(weeks) and I suspect you don't want to wait that
long. :)

This particular commit looks good besides one minor nitpick:

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

On Wed, Mar 18, 2026 at 7:19 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>
>
>
> On 2026/3/18 3:08, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Merge commonly used code for counting evictable folios in a lruvec.
> >
> > No behavior change.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>
>
> > ---
> >  mm/vmscan.c | 42 +++++++++++++++++-------------------------
> >  1 file changed, 17 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 33287ba4a500..d7fc7f1fe06d 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4078,27 +4078,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
> >       sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
> >  }
> >
> > -static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> > +static long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
> >  {

Since `total` is unsigned long, should this function likewise return
`unsigned long`? It seems ideal to avoid conversions unless there's a
good reason to do so.

> >       int gen, type, zone;
> > -     unsigned long total = 0;
> > -     int swappiness = get_swappiness(lruvec, sc);
> > +     unsigned long seq, total = 0;
> >       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> > -     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >       DEFINE_MAX_SEQ(lruvec);
> >       DEFINE_MIN_SEQ(lruvec);
> >
> >       for_each_evictable_type(type, swappiness) {
> > -             unsigned long seq;
> > -
> >               for (seq = min_seq[type]; seq <= max_seq; seq++) {
> >                       gen = lru_gen_from_seq(seq);
> > -
> >                       for (zone = 0; zone < MAX_NR_ZONES; zone++)
> >                               total += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
> >               }
> >       }
> >
> > +     return total;
> > +}
> > +
> > +static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> > +{
> > +     unsigned long total;
> > +     int swappiness = get_swappiness(lruvec, sc);
> > +     struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > +
> > +     total = lruvec_evictable_size(lruvec, swappiness);
> > +
> >       /* whether the size is big enough to be helpful */
> >       return mem_cgroup_online(memcg) ? (total >> sc->priority) : total;
> >  }
> > @@ -4921,9 +4927,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> >                            int swappiness, unsigned long *nr_to_scan)
> >  {
> > -     int gen, type, zone;
> > -     unsigned long size = 0;
> > -     struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >       DEFINE_MIN_SEQ(lruvec);
> >
> >       *nr_to_scan = 0;
> > @@ -4931,18 +4934,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> >       if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
> >               return true;
> >
> > -     for_each_evictable_type(type, swappiness) {
> > -             unsigned long seq;
> > -
> > -             for (seq = min_seq[type]; seq <= max_seq; seq++) {
> > -                     gen = lru_gen_from_seq(seq);
> > -
> > -                     for (zone = 0; zone < MAX_NR_ZONES; zone++)
> > -                             size += max(READ_ONCE(lrugen->nr_pages[gen][type][zone]), 0L);
> > -             }
> > -     }
> > -
> > -     *nr_to_scan = size;
> > +     *nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> >       /* better to run aging even though eviction is still possible */
> >       return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
> >  }
> > @@ -4954,7 +4946,7 @@ static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> >   */
> >  static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> >  {
> > -     bool success;
> > +     bool need_aging;
>
> I have suffered a lot because of this name. Thank you.
>
> >       unsigned long nr_to_scan;
> >       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >       DEFINE_MAX_SEQ(lruvec);
> > @@ -4962,7 +4954,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
> >       if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> >               return -1;
> >
> > -     success = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> > +     need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> >
> >       /* try to scrape all its memory if this memcg was deleted */
> >       if (nr_to_scan && !mem_cgroup_online(memcg))
> > @@ -4971,7 +4963,7 @@ static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int s
> >       nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> >
> >       /* try to get away with not aging at the default priority */
> > -     if (!success || sc->priority == DEF_PRIORITY)
> > +     if (!need_aging || sc->priority == DEF_PRIORITY)
> >               return nr_to_scan >> sc->priority;
> >
> >       /* stop scanning this lruvec as it's low on cold folios */
> >
>
> --
> Best regards,
> Ridong
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] mm/mglru: restructure the reclaim loop
  2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
@ 2026-03-20 20:09   ` Axel Rasmussen
  2026-03-22 16:11     ` Kairui Song
  2026-03-24  6:41   ` Chen Ridong
  2026-03-26  7:31   ` Baolin Wang
  2 siblings, 1 reply; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 20:09 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

This looks like a reasonable refactor to me. To me the new code is
more straightforward to reason about, and I don't see anything this
breaks (either by inspection of with basic functional testing).

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The current loop will calculate the scan number on each iteration. The
> number of folios to scan is based on the LRU length, with some unclear
> behaviors, eg, it only shifts the scan number by reclaim priority at the
> default priority, and it couples the number calculation with aging and
> rotation.
>
> Adjust, simplify it, and decouple aging and rotation. Just calculate the
> scan number for once at the beginning of the reclaim, always respect the
> reclaim priority, and make the aging and rotation more explicit.
>
> This slightly changes how offline memcg aging works: previously, offline
> memcg wouldn't be aged unless it didn't have any evictable folios. Now,
> we might age it if it has only 3 generations and the reclaim priority is
> less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
> might still hold long-term folios, and in fact, a long-existing offline
> memcg must be pinned by some long-term folios like shmem. These folios
> might be used by other memcg, so aging them as ordinary memcg doesn't
> seem wrong. And besides, aging enables further reclaim of an offlined
> memcg, which will certainly happen if we keep shrinking it. And offline
> memcg might soon be no longer an issue once reparenting is all ready.
>
> Overall, the memcg LRU rotation, as described in mmzone.h,
> remains the same.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 74 ++++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 36 insertions(+), 38 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d48074f9bd87..ed5b5f8dd3c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4926,49 +4926,35 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  }
>
>  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> -                            int swappiness, unsigned long *nr_to_scan)
> +                            struct scan_control *sc, int swappiness)
>  {
>         DEFINE_MIN_SEQ(lruvec);
>
> -       *nr_to_scan = 0;
>         /* have to run aging, since eviction is not possible anymore */
>         if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
>                 return true;
>
> -       *nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> +       /* try to get away with not aging at the default priority */
> +       if (sc->priority == DEF_PRIORITY)
> +               return false;
> +
>         /* better to run aging even though eviction is still possible */
>         return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
>  }
>
> -/*
> - * For future optimizations:
> - * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> - *    reclaim.
> - */
> -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> +                          struct mem_cgroup *memcg, int swappiness)
>  {
> -       bool need_aging;
>         unsigned long nr_to_scan;
> -       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> -       DEFINE_MAX_SEQ(lruvec);
> -
> -       if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> -               return -1;
> -
> -       need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
>
> +       nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
>         /* try to scrape all its memory if this memcg was deleted */
> -       if (nr_to_scan && !mem_cgroup_online(memcg))
> +       if (!mem_cgroup_online(memcg))
>                 return nr_to_scan;
>
>         nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> -
> -       /* try to get away with not aging at the default priority */
> -       if (!need_aging || sc->priority == DEF_PRIORITY)
> -               return nr_to_scan >> sc->priority;
> -
> -       /* stop scanning this lruvec as it's low on cold folios */
> -       return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> +       /* always respect scan priority */
> +       return nr_to_scan >> sc->priority;
>  }
>
>  static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> @@ -4998,31 +4984,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>         return true;
>  }
>
> +/*
> + * For future optimizations:
> + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> + *    reclaim.
> + */
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> +       bool need_rotate = false;
>         long nr_batch, nr_to_scan;
> -       unsigned long scanned = 0;
>         int swappiness = get_swappiness(lruvec, sc);
> +       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>
> -       while (true) {
> +       nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> +       while (nr_to_scan > 0) {
>                 int delta;
> +               DEFINE_MAX_SEQ(lruvec);
>
> -               nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> -               if (nr_to_scan <= 0)
> +               if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> +                       need_rotate = true;
>                         break;
> +               }
> +
> +               if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> +                       if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> +                               need_rotate = true;
> +                       break;
> +               }
>
>                 nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
>                 delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>                 if (!delta)
>                         break;
>
> -               scanned += delta;
> -               if (scanned >= nr_to_scan)
> -                       break;
> -
>                 if (should_abort_scan(lruvec, sc))
>                         break;
>
> +               nr_to_scan -= delta;
>                 cond_resched();
>         }
>
> @@ -5034,12 +5032,12 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 wakeup_flusher_threads(WB_REASON_VMSCAN);
>
>         /* whether this lruvec should be rotated */

It's a nitpick, but with the variable rename, this comment isn't doing
is much good now. :)

> -       return nr_to_scan < 0;
> +       return need_rotate;
>  }
>
>  static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -       bool success;
> +       bool need_rotate;
>         unsigned long scanned = sc->nr_scanned;
>         unsigned long reclaimed = sc->nr_reclaimed;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> @@ -5057,7 +5055,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>                 memcg_memory_event(memcg, MEMCG_LOW);
>         }
>
> -       success = try_to_shrink_lruvec(lruvec, sc);
> +       need_rotate = try_to_shrink_lruvec(lruvec, sc);
>
>         shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>
> @@ -5067,10 +5065,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>
>         flush_reclaim_state(sc);
>
> -       if (success && mem_cgroup_online(memcg))
> +       if (need_rotate && mem_cgroup_online(memcg))
>                 return MEMCG_LRU_YOUNG;
>
> -       if (!success && lruvec_is_sizable(lruvec, sc))
> +       if (!need_rotate && lruvec_is_sizable(lruvec, sc))
>                 return 0;
>
>         /* one retry if offlined or too small */
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-17 19:09 ` [PATCH 4/8] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
@ 2026-03-20 20:57   ` Axel Rasmussen
  2026-03-22 16:20     ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 20:57 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Make the scan helpers return the exact number of folios being scanned
> or isolated. This should make the scan more accurate and easier to
> follow.
>
> Now there is no more need for special handling when there is no
> progress made. The old livelock prevention `(return isolated ||
> !remaining ? scanned : 0)` is replaced by the natural scan budget
> exhaustion in try_to_shrink_lruvec, and sort_folio moves ineligible
> folios to newer generations.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 27 +++++++++++----------------
>  1 file changed, 11 insertions(+), 16 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index ed5b5f8dd3c7..4f4548ff3a17 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4680,7 +4680,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>
>  static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                        struct scan_control *sc, int type, int tier,
> -                      struct list_head *list)
> +                      struct list_head *list, int *isolatedp)
>  {
>         int i;
>         int gen;
> @@ -4750,11 +4750,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>         if (type == LRU_GEN_FILE)
>                 sc->nr.file_taken += isolated;
> -       /*
> -        * There might not be eligible folios due to reclaim_idx. Check the
> -        * remaining to prevent livelock if it's not making progress.
> -        */
> -       return isolated || !remaining ? scanned : 0;
> +
> +       *isolatedp = isolated;
> +       return scanned;
>  }
>
>  static int get_tier_idx(struct lruvec *lruvec, int type)
> @@ -4819,23 +4817,24 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                           int *type_scanned, struct list_head *list)
>  {
>         int i;
> +       int scanned = 0;
> +       int isolated = 0;
>         int type = get_type_to_scan(lruvec, swappiness);
>
>         for_each_evictable_type(i, swappiness) {
> -               int scanned;
>                 int tier = get_tier_idx(lruvec, type);
>
>                 *type_scanned = type;

I think this is problematic, now `isolate_folios` can scan a nonzero
amount of > 1 type of memory. Then the caller (`evict_folios`) calls
`trace_mm_vmscan_lru_shrink_inactive` with the total scanned amount,
with only the last type we scanned (misattributing part of the scan,
potentially). Not a "functional" issue, but it could mean confusing
data for anyone watching the tracepoint.


>
> -               scanned = scan_folios(nr_to_scan, lruvec, sc,
> -                                     type, tier, list);
> -               if (scanned)
> +               scanned += scan_folios(nr_to_scan, lruvec, sc,
> +                                     type, tier, list, &isolated);
> +               if (isolated)
>                         return scanned;
>
>                 type = !type;
>         }
>
> -       return 0;
> +       return scanned;
>  }
>
>  static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> @@ -4852,7 +4851,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         struct reclaim_stat stat;
>         struct lru_gen_mm_walk *walk;
>         bool skip_retry = false;
> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>
> @@ -4860,10 +4858,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>         scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
>
> -       scanned += try_to_inc_min_seq(lruvec, swappiness);
> -
> -       if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
> -               scanned = 0;
> +       try_to_inc_min_seq(lruvec, swappiness);

IIUC, this change is what introduces the issue patch 6 is trying to
resolve. Is it worth squashing patch 6 in to this one, so we don't
have this non-ideal intermediate state?

>
>         lruvec_unlock_irq(lruvec);
>
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/8] mm/mglru: use a smaller batch for reclaim
  2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
@ 2026-03-20 20:58   ` Axel Rasmussen
  2026-03-24  7:51   ` Chen Ridong
  1 sibling, 0 replies; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 20:58 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> With a fixed number to reclaim calculated at the beginning, making each
> following step smaller should reduce the lock contention and avoid
> over-aggressive reclaim of folios, as it will abort earlier when the
> number of folios to be reclaimed is reached.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f4548ff3a17..2ff1609ff4de 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5007,7 +5007,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                         break;
>                 }
>
> -               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +               nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
>                 delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>                 if (!delta)
>                         break;
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
  2026-03-19  2:00   ` Chen Ridong
@ 2026-03-20 21:00   ` Axel Rasmussen
  2026-03-22  8:14   ` Barry Song
  2 siblings, 0 replies; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 21:00 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Same as active / inactive LRU, MGLRU isolates and scans folios in
> batches.  The batch split is done hidden deep in the helper, which
> makes the code harder to follow.  The helper's arguments are also
> confusing since callers usually request more folios than the batch
> size, so the helper almost never processes the full requested amount.
>
> Move the batch splitting into the top loop to make it cleaner, there
> should be no behavior change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

To Chen's concern, I see patch 5 makes use of this refactor for example.

I don't have a super strong opinion on keeping this separate here vs.
squashing into patch 5. I slightly prefer keeping this
no-functional-change part separate, then patch 5 becomes very easy to
review.

> ---
>  mm/vmscan.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d7fc7f1fe06d..d48074f9bd87 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4689,10 +4689,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         int scanned = 0;
>         int isolated = 0;
>         int skipped = 0;
> -       int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
> -       int remaining = scan_batch;
> +       unsigned long remaining = nr_to_scan;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>
> +       VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
>         VM_WARN_ON_ONCE(!list_empty(list));
>
>         if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> @@ -4745,7 +4745,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         mod_lruvec_state(lruvec, item, isolated);
>         mod_lruvec_state(lruvec, PGREFILL, sorted);
>         mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
> -       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
> +       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>         if (type == LRU_GEN_FILE)
> @@ -4827,7 +4827,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>                 *type_scanned = type;
>
> -               scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
> +               scanned = scan_folios(nr_to_scan, lruvec, sc,
> +                                     type, tier, list);
>                 if (scanned)
>                         return scanned;
>
> @@ -4999,7 +5000,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -       long nr_to_scan;
> +       long nr_batch, nr_to_scan;
>         unsigned long scanned = 0;
>         int swappiness = get_swappiness(lruvec, sc);
>
> @@ -5010,7 +5011,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 if (nr_to_scan <= 0)
>                         break;
>
> -               delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
> +               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +               delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>                 if (!delta)
>                         break;
>
> @@ -5615,6 +5617,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
>  static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
>                         int swappiness, unsigned long nr_to_reclaim)
>  {
> +       int nr_batch;
>         DEFINE_MAX_SEQ(lruvec);
>
>         if (seq + MIN_NR_GENS > max_seq)
> @@ -5631,8 +5634,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
>                 if (sc->nr_reclaimed >= nr_to_reclaim)
>                         return 0;
>
> -               if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
> -                                 swappiness))
> +               nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);
> +               if (!evict_folios(nr_batch, lruvec, sc, swappiness))
>                         return 0;
>
>                 cond_resched();
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling
  2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
@ 2026-03-20 21:18   ` Axel Rasmussen
  2026-03-22 16:22     ` Kairui Song
  2026-03-24  8:57   ` Chen Ridong
  2026-03-26  7:56   ` Baolin Wang
  2 siblings, 1 reply; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 21:18 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivated upon pageout
> (shrink_folio_list).
>
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.
>
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
>
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.
>
> Test with YCSB workloadb showed a major performance improvement:
>
> Before this series:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
>
> After this commit:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
>
> The refault rate is 50% lower, and throughput is 30% higher, which is a
> huge gain. We also observed significant performance gain for other
> real-world workloads.
>
> We were concerned that the dirty flush could cause more wear for SSD:
> that should not be the problem here, since the wakeup condition is when
> the dirty folios have been pushed to the tail of LRU, which indicates
> that memory pressure is so high that writeback is blocking the workload
> already.

This looks reasonable to me overall. I unfortunately don't have a fast
way of reproducing the results under production workloads. At least
under basic functional testing, this seems to work as advertised.

Besides one small clean-up:

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> ---
>  mm/vmscan.c | 44 +++++++++++++-------------------------------
>  1 file changed, 13 insertions(+), 31 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b26959d90850..e11d0f1a8b68 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4577,7 +4577,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                        int tier_idx)
>  {
>         bool success;
> -       bool dirty, writeback;
>         int gen = folio_lru_gen(folio);
>         int type = folio_is_file_lru(folio);
>         int zone = folio_zonenum(folio);
> @@ -4627,21 +4626,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>                 return true;
>         }
>
> -       dirty = folio_test_dirty(folio);
> -       writeback = folio_test_writeback(folio);
> -       if (type == LRU_GEN_FILE && dirty) {
> -               sc->nr.file_taken += delta;
> -               if (!writeback)
> -                       sc->nr.unqueued_dirty += delta;

A grep says that after this commit, nobody is left *reading* from
`unqueued_dirty`, so can we remove that field and the couple of
remaining places that modify it?

In `struct scan_control` I mean, we do still use this field in `struct
reclaim_stat`.

> -       }
> -
> -       /* waiting for writeback */
> -       if (writeback || (type == LRU_GEN_FILE && dirty)) {
> -               gen = folio_inc_gen(lruvec, folio, true);
> -               list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> -               return true;
> -       }
> -
>         return false;
>  }
>
> @@ -4748,8 +4732,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -       if (type == LRU_GEN_FILE)
> -               sc->nr.file_taken += isolated;
>
>         *isolatedp = isolated;
>         return scanned;
> @@ -4814,11 +4796,11 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
>
>  static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                           struct scan_control *sc, int swappiness,
> -                         int *type_scanned, struct list_head *list)
> +                         int *type_scanned,
> +                         struct list_head *list, int *isolated)
>  {
>         int i;
>         int scanned = 0;
> -       int isolated = 0;
>         int type = get_type_to_scan(lruvec, swappiness);
>
>         for_each_evictable_type(i, swappiness) {
> @@ -4827,8 +4809,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                 *type_scanned = type;
>
>                 scanned += scan_folios(nr_to_scan, lruvec, sc,
> -                                     type, tier, list, &isolated);
> -               if (isolated)
> +                                     type, tier, list, isolated);
> +               if (*isolated)
>                         return scanned;
>
>                 type = !type;
> @@ -4843,6 +4825,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         int type;
>         int scanned;
>         int reclaimed;
> +       int isolated = 0;
>         LIST_HEAD(list);
>         LIST_HEAD(clean);
>         struct folio *folio;
> @@ -4856,7 +4839,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>         lruvec_lock_irq(lruvec);
>
> -       scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
> +       scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list, &isolated);
>
>         try_to_inc_min_seq(lruvec, swappiness);
>
> @@ -4866,12 +4849,18 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>                 return scanned;
>  retry:
>         reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
> -       sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>         sc->nr_reclaimed += reclaimed;
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         scanned, reclaimed, &stat, sc->priority,
>                         type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>
> +       /*
> +        * If too many file cache in the coldest generation can't be evicted
> +        * due to being dirty, wake up the flusher.
> +        */
> +       if (stat.nr_unqueued_dirty == isolated)
> +               wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
>         list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>                 DEFINE_MIN_SEQ(lruvec);
>
> @@ -5023,13 +5012,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 cond_resched();
>         }
>
> -       /*
> -        * If too many file cache in the coldest generation can't be evicted
> -        * due to being dirty, wake up the flusher.
> -        */
> -       if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> -               wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
>         /* whether this lruvec should be rotated */
>         return need_rotate;
>  }
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 8/8] mm/vmscan: remove sc->file_taken
  2026-03-17 19:09 ` [PATCH 8/8] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-03-20 21:19   ` Axel Rasmussen
  0 siblings, 0 replies; 44+ messages in thread
From: Axel Rasmussen @ 2026-03-20 21:19 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No one is using it now, just remove it.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

> ---
>  mm/vmscan.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e11d0f1a8b68..b95c9fc17edf 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -174,7 +174,6 @@ struct scan_control {
>                 unsigned int congested;
>                 unsigned int writeback;
>                 unsigned int immediate;
> -               unsigned int file_taken;
>                 unsigned int taken;
>         } nr;
>
> @@ -2041,8 +2040,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
>         sc->nr.writeback += stat.nr_writeback;
>         sc->nr.immediate += stat.nr_immediate;
>         sc->nr.taken += nr_taken;
> -       if (file)
> -               sc->nr.file_taken += nr_taken;
>
>         trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>                         nr_scanned, nr_reclaimed, &stat, sc->priority, file);
>
> --
> 2.53.0
>
>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
  2026-03-19  2:00   ` Chen Ridong
  2026-03-20 21:00   ` Axel Rasmussen
@ 2026-03-22  8:14   ` Barry Song
  2026-03-24  6:05     ` Kairui Song
  2 siblings, 1 reply; 44+ messages in thread
From: Barry Song @ 2026-03-22  8:14 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, David Stevens, Chen Ridong,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Wed, Mar 18, 2026 at 3:11 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Same as active / inactive LRU, MGLRU isolates and scans folios in
> batches.  The batch split is done hidden deep in the helper, which
> makes the code harder to follow.  The helper's arguments are also
> confusing since callers usually request more folios than the batch
> size, so the helper almost never processes the full requested amount.
>
> Move the batch splitting into the top loop to make it cleaner, there
> should be no behavior change.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d7fc7f1fe06d..d48074f9bd87 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4689,10 +4689,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         int scanned = 0;
>         int isolated = 0;
>         int skipped = 0;
> -       int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
> -       int remaining = scan_batch;
> +       unsigned long remaining = nr_to_scan;
>         struct lru_gen_folio *lrugen = &lruvec->lrugen;
>
> +       VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
>         VM_WARN_ON_ONCE(!list_empty(list));
>
>         if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> @@ -4745,7 +4745,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>         mod_lruvec_state(lruvec, item, isolated);
>         mod_lruvec_state(lruvec, PGREFILL, sorted);
>         mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
> -       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
> +       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>                                 scanned, skipped, isolated,
>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>         if (type == LRU_GEN_FILE)
> @@ -4827,7 +4827,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>
>                 *type_scanned = type;
>
> -               scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
> +               scanned = scan_folios(nr_to_scan, lruvec, sc,
> +                                     type, tier, list);

Do we need to change this?

>                 if (scanned)
>                         return scanned;
>
> @@ -4999,7 +5000,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -       long nr_to_scan;
> +       long nr_batch, nr_to_scan;
>         unsigned long scanned = 0;
>         int swappiness = get_swappiness(lruvec, sc);
>
> @@ -5010,7 +5011,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>                 if (nr_to_scan <= 0)
>                         break;
>
> -               delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
> +               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);

I wonder if we should modify get_nr_to_scan() to return
a maximum of MAX_LRU_BATCH?

> +               delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>                 if (!delta)
>                         break;
>
> @@ -5615,6 +5617,7 @@ static int run_aging(struct lruvec *lruvec, unsigned long seq,
>  static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_control *sc,
>                         int swappiness, unsigned long nr_to_reclaim)
>  {
> +       int nr_batch;
>         DEFINE_MAX_SEQ(lruvec);
>
>         if (seq + MIN_NR_GENS > max_seq)
> @@ -5631,8 +5634,8 @@ static int run_eviction(struct lruvec *lruvec, unsigned long seq, struct scan_co
>                 if (sc->nr_reclaimed >= nr_to_reclaim)
>                         return 0;
>
> -               if (!evict_folios(nr_to_reclaim - sc->nr_reclaimed, lruvec, sc,
> -                                 swappiness))
> +               nr_batch = min(nr_to_reclaim - sc->nr_reclaimed, MAX_LRU_BATCH);

Looks good to me.

> +               if (!evict_folios(nr_batch, lruvec, sc, swappiness))
>                         return 0;
>
>                 cond_resched();
>
> --
> 2.53.0

Thanks
Barry


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-20 19:51     ` Axel Rasmussen
@ 2026-03-22 16:10       ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-22 16:10 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Chen Ridong, linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Sat, Mar 21, 2026 at 3:53 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> For what it's worth, I applied the full series and ran it through some
> basic functional testing, I didn't see any bugs or regressions from
> that.
>
> Unfortunately, the best signal would be actually deploying it under
> some real serving workloads, but the latency for me to do that + get
> results is like order(weeks) and I suspect you don't want to wait that
> long. :)
>
> This particular commit looks good besides one minor nitpick:
>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>

Thanks! Usually I ran the series through a about 2 day long stress
test matrix, and similar code has been deployed to our production
fleet for months. Just didn't find enough time to push these upstream.
Glad to see we are making progress to improve MGLRU upstream.

> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 33287ba4a500..d7fc7f1fe06d 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -4078,27 +4078,33 @@ static void set_initial_priority(struct pglist_data *pgdat, struct scan_control
> > >       sc->priority = clamp(priority, DEF_PRIORITY / 2, DEF_PRIORITY);
> > >  }
> > >
> > > -static bool lruvec_is_sizable(struct lruvec *lruvec, struct scan_control *sc)
> > > +static long lruvec_evictable_size(struct lruvec *lruvec, int swappiness)
> > >  {
>
> Since `total` is unsigned long, should this function likewise return
> `unsigned long`? It seems ideal to avoid conversions unless there's a
> good reason to do so.

Ah nice catch, that would avoid an implicit casting, I think it has no
effect but wIll update in V2.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] mm/mglru: restructure the reclaim loop
  2026-03-20 20:09   ` Axel Rasmussen
@ 2026-03-22 16:11     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-22 16:11 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Sat, Mar 21, 2026 at 4:10 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> This looks like a reasonable refactor to me. To me the new code is
> more straightforward to reason about, and I don't see anything this
> breaks (either by inspection of with basic functional testing).
>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
>
> On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current loop will calculate the scan number on each iteration. The
> > number of folios to scan is based on the LRU length, with some unclear
> > behaviors, eg, it only shifts the scan number by reclaim priority at the
> > default priority, and it couples the number calculation with aging and
> > rotation.
> >
> > Adjust, simplify it, and decouple aging and rotation. Just calculate the
> > scan number for once at the beginning of the reclaim, always respect the
> > reclaim priority, and make the aging and rotation more explicit.
> >
> > This slightly changes how offline memcg aging works: previously, offline
> > memcg wouldn't be aged unless it didn't have any evictable folios. Now,
> > we might age it if it has only 3 generations and the reclaim priority is
> > less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
> > might still hold long-term folios, and in fact, a long-existing offline
> > memcg must be pinned by some long-term folios like shmem. These folios
> > might be used by other memcg, so aging them as ordinary memcg doesn't
> > seem wrong. And besides, aging enables further reclaim of an offlined
> > memcg, which will certainly happen if we keep shrinking it. And offline
> > memcg might soon be no longer an issue once reparenting is all ready.
> >
> > Overall, the memcg LRU rotation, as described in mmzone.h,
> > remains the same.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 74 ++++++++++++++++++++++++++++++-------------------------------
> >  1 file changed, 36 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d48074f9bd87..ed5b5f8dd3c7 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4926,49 +4926,35 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >  }
> >
> >  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> > -                            int swappiness, unsigned long *nr_to_scan)
> > +                            struct scan_control *sc, int swappiness)
> >  {
> >         DEFINE_MIN_SEQ(lruvec);
> >
> > -       *nr_to_scan = 0;
> >         /* have to run aging, since eviction is not possible anymore */
> >         if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
> >                 return true;
> >
> > -       *nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> > +       /* try to get away with not aging at the default priority */
> > +       if (sc->priority == DEF_PRIORITY)
> > +               return false;
> > +
> >         /* better to run aging even though eviction is still possible */
> >         return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
> >  }
> >
> > -/*
> > - * For future optimizations:
> > - * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> > - *    reclaim.
> > - */
> > -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> > +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> > +                          struct mem_cgroup *memcg, int swappiness)
> >  {
> > -       bool need_aging;
> >         unsigned long nr_to_scan;
> > -       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> > -       DEFINE_MAX_SEQ(lruvec);
> > -
> > -       if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> > -               return -1;
> > -
> > -       need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
> >
> > +       nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> >         /* try to scrape all its memory if this memcg was deleted */
> > -       if (nr_to_scan && !mem_cgroup_online(memcg))
> > +       if (!mem_cgroup_online(memcg))
> >                 return nr_to_scan;
> >
> >         nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> > -
> > -       /* try to get away with not aging at the default priority */
> > -       if (!need_aging || sc->priority == DEF_PRIORITY)
> > -               return nr_to_scan >> sc->priority;
> > -
> > -       /* stop scanning this lruvec as it's low on cold folios */
> > -       return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> > +       /* always respect scan priority */
> > +       return nr_to_scan >> sc->priority;
> >  }
> >
> >  static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> > @@ -4998,31 +4984,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> >         return true;
> >  }
> >
> > +/*
> > + * For future optimizations:
> > + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> > + *    reclaim.
> > + */
> >  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >  {
> > +       bool need_rotate = false;
> >         long nr_batch, nr_to_scan;
> > -       unsigned long scanned = 0;
> >         int swappiness = get_swappiness(lruvec, sc);
> > +       struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >
> > -       while (true) {
> > +       nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> > +       while (nr_to_scan > 0) {
> >                 int delta;
> > +               DEFINE_MAX_SEQ(lruvec);
> >
> > -               nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> > -               if (nr_to_scan <= 0)
> > +               if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> > +                       need_rotate = true;
> >                         break;
> > +               }
> > +
> > +               if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> > +                       if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> > +                               need_rotate = true;
> > +                       break;
> > +               }
> >
> >                 nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> >                 delta = evict_folios(nr_batch, lruvec, sc, swappiness);
> >                 if (!delta)
> >                         break;
> >
> > -               scanned += delta;
> > -               if (scanned >= nr_to_scan)
> > -                       break;
> > -
> >                 if (should_abort_scan(lruvec, sc))
> >                         break;
> >
> > +               nr_to_scan -= delta;
> >                 cond_resched();
> >         }
> >
> > @@ -5034,12 +5032,12 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 wakeup_flusher_threads(WB_REASON_VMSCAN);
> >
> >         /* whether this lruvec should be rotated */
>
> It's a nitpick, but with the variable rename, this comment isn't doing
> is much good now. :)

Thanks for suggesting, this can be simplified.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-20 20:57   ` Axel Rasmussen
@ 2026-03-22 16:20     ` Kairui Song
  2026-03-24  7:22       ` Chen Ridong
  0 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-03-22 16:20 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Sat, Mar 21, 2026 at 4:59 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Make the scan helpers return the exact number of folios being scanned
> > or isolated. This should make the scan more accurate and easier to
> > follow.
> >
> > Now there is no more need for special handling when there is no
> > progress made. The old livelock prevention `(return isolated ||
> > !remaining ? scanned : 0)` is replaced by the natural scan budget
> > exhaustion in try_to_shrink_lruvec, and sort_folio moves ineligible
> > folios to newer generations.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 27 +++++++++++----------------
> >  1 file changed, 11 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index ed5b5f8dd3c7..4f4548ff3a17 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4680,7 +4680,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
> >
> >  static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >                        struct scan_control *sc, int type, int tier,
> > -                      struct list_head *list)
> > +                      struct list_head *list, int *isolatedp)
> >  {
> >         int i;
> >         int gen;
> > @@ -4750,11 +4750,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >         if (type == LRU_GEN_FILE)
> >                 sc->nr.file_taken += isolated;
> > -       /*
> > -        * There might not be eligible folios due to reclaim_idx. Check the
> > -        * remaining to prevent livelock if it's not making progress.
> > -        */
> > -       return isolated || !remaining ? scanned : 0;
> > +
> > +       *isolatedp = isolated;
> > +       return scanned;
> >  }
> >
> >  static int get_tier_idx(struct lruvec *lruvec, int type)
> > @@ -4819,23 +4817,24 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >                           int *type_scanned, struct list_head *list)
> >  {
> >         int i;
> > +       int scanned = 0;
> > +       int isolated = 0;
> >         int type = get_type_to_scan(lruvec, swappiness);
> >
> >         for_each_evictable_type(i, swappiness) {
> > -               int scanned;
> >                 int tier = get_tier_idx(lruvec, type);
> >
> >                 *type_scanned = type;
>
> I think this is problematic, now `isolate_folios` can scan a nonzero
> amount of > 1 type of memory. Then the caller (`evict_folios`) calls
> `trace_mm_vmscan_lru_shrink_inactive` with the total scanned amount,
> with only the last type we scanned (misattributing part of the scan,
> potentially). Not a "functional" issue, but it could mean confusing
> data for anyone watching the tracepoint.

Thanks! Nice catch, I'll introduce another variable for the tracepoint
then it should be fine.

>
>
> >
> > -               scanned = scan_folios(nr_to_scan, lruvec, sc,
> > -                                     type, tier, list);
> > -               if (scanned)
> > +               scanned += scan_folios(nr_to_scan, lruvec, sc,
> > +                                     type, tier, list, &isolated);
> > +               if (isolated)
> >                         return scanned;
> >
> >                 type = !type;
> >         }
> >
> > -       return 0;
> > +       return scanned;
> >  }
> >
> >  static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> > @@ -4852,7 +4851,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >         struct reclaim_stat stat;
> >         struct lru_gen_mm_walk *walk;
> >         bool skip_retry = false;
> > -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >
> > @@ -4860,10 +4858,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >
> >         scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
> >
> > -       scanned += try_to_inc_min_seq(lruvec, swappiness);
> > -
> > -       if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
> > -               scanned = 0;
> > +       try_to_inc_min_seq(lruvec, swappiness);
>
> IIUC, this change is what introduces the issue patch 6 is trying to
> resolve. Is it worth squashing patch 6 in to this one, so we don't
> have this non-ideal intermediate state?

Well it's not, patch 6 is fixing an existing problem, see the cover
letter about the OOM issue.

This part of changing is just cleanup the loop code. It looks really
strange to me that increasing min_seq is considered as scanning one
folio. Aborting the scan if there is only 2 gen kind of make sense but
this doesn't seems the right place. These strange parts to avoid
livelock can be dropped since we have an exact count of folios being
scanned now. I'll add more words in the commit message.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling
  2026-03-20 21:18   ` Axel Rasmussen
@ 2026-03-22 16:22     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-22 16:22 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Chen Ridong, Leno Hou,
	Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Sat, Mar 21, 2026 at 5:24 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current handling of dirty writeback folios is not working well for
> > file page heavy workloads: Dirty folios are protected and move to next
> > gen upon isolation of getting throttled or reactivated upon pageout
> > (shrink_folio_list).
> >
> > This might help to reduce the LRU lock contention slightly, but as a
> > result, the ping-pong effect of folios between head and tail of last two
> > gens is serious as the shrinker will run into protected dirty writeback
> > folios more frequently compared to activation. The dirty flush wakeup
> > condition is also much more passive compared to active/inactive LRU.
> > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > instead has to check this after the whole reclaim loop is done, and then
> > count the isolation protection number compared to the total reclaim
> > number.
> >
> > And we previously saw OOM problems with it, too, which were fixed but
> > still not perfect [1].
> >
> > So instead, just drop the special handling for dirty writeback, just
> > re-activate it like active / inactive LRU. And also move the dirty flush
> > wake up check right after shrink_folio_list. This should improve both
> > throttling and performance.
> >
> > Test with YCSB workloadb showed a major performance improvement:
> >
> > Before this series:
> > Throughput(ops/sec): 61642.78008938203
> > AverageLatency(us):  507.11127774145166
> > pgpgin 158190589
> > pgpgout 5880616
> > workingset_refault 7262988
> >
> > After this commit:
> > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > pgpgin 101871227                        (-35.6%, lower is better)
> > pgpgout 5770028
> > workingset_refault 3418186              (-52.9%, lower is better)
> >
> > The refault rate is 50% lower, and throughput is 30% higher, which is a
> > huge gain. We also observed significant performance gain for other
> > real-world workloads.
> >
> > We were concerned that the dirty flush could cause more wear for SSD:
> > that should not be the problem here, since the wakeup condition is when
> > the dirty folios have been pushed to the tail of LRU, which indicates
> > that memory pressure is so high that writeback is blocking the workload
> > already.
>
> This looks reasonable to me overall. I unfortunately don't have a fast
> way of reproducing the results under production workloads. At least
> under basic functional testing, this seems to work as advertised.
>
> Besides one small clean-up:
>
> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > ---
> >  mm/vmscan.c | 44 +++++++++++++-------------------------------
> >  1 file changed, 13 insertions(+), 31 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index b26959d90850..e11d0f1a8b68 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4577,7 +4577,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                        int tier_idx)
> >  {
> >         bool success;
> > -       bool dirty, writeback;
> >         int gen = folio_lru_gen(folio);
> >         int type = folio_is_file_lru(folio);
> >         int zone = folio_zonenum(folio);
> > @@ -4627,21 +4626,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
> >                 return true;
> >         }
> >
> > -       dirty = folio_test_dirty(folio);
> > -       writeback = folio_test_writeback(folio);
> > -       if (type == LRU_GEN_FILE && dirty) {
> > -               sc->nr.file_taken += delta;
> > -               if (!writeback)
> > -                       sc->nr.unqueued_dirty += delta;
>
> A grep says that after this commit, nobody is left *reading* from
> `unqueued_dirty`, so can we remove that field and the couple of
> remaining places that modify it?
>
> In `struct scan_control` I mean, we do still use this field in `struct
> reclaim_stat`.
>

Thanks for the review! Yeah, I cleaned one of the unused variables in
the next patch, but there is still one :)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers
  2026-03-22  8:14   ` Barry Song
@ 2026-03-24  6:05     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-24  6:05 UTC (permalink / raw)
  To: Barry Song
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel

On Sun, Mar 22, 2026 at 04:14:31PM +0800, Barry Song wrote:
> On Wed, Mar 18, 2026 at 3:11 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Same as active / inactive LRU, MGLRU isolates and scans folios in
> > batches.  The batch split is done hidden deep in the helper, which
> > makes the code harder to follow.  The helper's arguments are also
> > confusing since callers usually request more folios than the batch
> > size, so the helper almost never processes the full requested amount.
> >
> > Move the batch splitting into the top loop to make it cleaner, there
> > should be no behavior change.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/vmscan.c | 19 +++++++++++--------
> >  1 file changed, 11 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d7fc7f1fe06d..d48074f9bd87 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -4689,10 +4689,10 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >         int scanned = 0;
> >         int isolated = 0;
> >         int skipped = 0;
> > -       int scan_batch = min(nr_to_scan, MAX_LRU_BATCH);
> > -       int remaining = scan_batch;
> > +       unsigned long remaining = nr_to_scan;
> >         struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >
> > +       VM_WARN_ON_ONCE(nr_to_scan > MAX_LRU_BATCH);
> >         VM_WARN_ON_ONCE(!list_empty(list));
> >
> >         if (get_nr_gens(lruvec, type) == MIN_NR_GENS)
> > @@ -4745,7 +4745,7 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >         mod_lruvec_state(lruvec, item, isolated);
> >         mod_lruvec_state(lruvec, PGREFILL, sorted);
> >         mod_lruvec_state(lruvec, PGSCAN_ANON + type, isolated);
> > -       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, scan_batch,
> > +       trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
> >                                 scanned, skipped, isolated,
> >                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> >         if (type == LRU_GEN_FILE)
> > @@ -4827,7 +4827,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >
> >                 *type_scanned = type;
> >
> > -               scanned = scan_folios(nr_to_scan, lruvec, sc, type, tier, list);
> > +               scanned = scan_folios(nr_to_scan, lruvec, sc,
> > +                                     type, tier, list);
> 
> Do we need to change this?

That's a irrelevant blank line change, will drop it, thanks!
> 
> >                 if (scanned)
> >                         return scanned;
> >
> > @@ -4999,7 +5000,7 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> >
> >  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >  {
> > -       long nr_to_scan;
> > +       long nr_batch, nr_to_scan;
> >         unsigned long scanned = 0;
> >         int swappiness = get_swappiness(lruvec, sc);
> >
> > @@ -5010,7 +5011,8 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
> >                 if (nr_to_scan <= 0)
> >                         break;
> >
> > -               delta = evict_folios(nr_to_scan, lruvec, sc, swappiness);
> > +               nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> 
> I wonder if we should modify get_nr_to_scan() to return
> a maximum of MAX_LRU_BATCH?

We'll change that in a later commit to let each iteration use a smaller batch.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] mm/mglru: restructure the reclaim loop
  2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
  2026-03-20 20:09   ` Axel Rasmussen
@ 2026-03-24  6:41   ` Chen Ridong
  2026-03-26  7:31   ` Baolin Wang
  2 siblings, 0 replies; 44+ messages in thread
From: Chen Ridong @ 2026-03-24  6:41 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel



On 2026/3/18 3:08, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current loop will calculate the scan number on each iteration. The
> number of folios to scan is based on the LRU length, with some unclear
> behaviors, eg, it only shifts the scan number by reclaim priority at the
> default priority, and it couples the number calculation with aging and
> rotation.
> 
> Adjust, simplify it, and decouple aging and rotation. Just calculate the
> scan number for once at the beginning of the reclaim, always respect the
> reclaim priority, and make the aging and rotation more explicit.
> 
> This slightly changes how offline memcg aging works: previously, offline
> memcg wouldn't be aged unless it didn't have any evictable folios. Now,
> we might age it if it has only 3 generations and the reclaim priority is
> less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
> might still hold long-term folios, and in fact, a long-existing offline
> memcg must be pinned by some long-term folios like shmem. These folios
> might be used by other memcg, so aging them as ordinary memcg doesn't
> seem wrong. And besides, aging enables further reclaim of an offlined
> memcg, which will certainly happen if we keep shrinking it. And offline
> memcg might soon be no longer an issue once reparenting is all ready.
> 
> Overall, the memcg LRU rotation, as described in mmzone.h,
> remains the same.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 74 ++++++++++++++++++++++++++++++-------------------------------
>  1 file changed, 36 insertions(+), 38 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d48074f9bd87..ed5b5f8dd3c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4926,49 +4926,35 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  }
>  
>  static bool should_run_aging(struct lruvec *lruvec, unsigned long max_seq,
> -			     int swappiness, unsigned long *nr_to_scan)
> +			     struct scan_control *sc, int swappiness)
>  {
>  	DEFINE_MIN_SEQ(lruvec);
>  
> -	*nr_to_scan = 0;
>  	/* have to run aging, since eviction is not possible anymore */
>  	if (evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS > max_seq)
>  		return true;
>  
> -	*nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
> +	/* try to get away with not aging at the default priority */
> +	if (sc->priority == DEF_PRIORITY)
> +		return false;
> +
>  	/* better to run aging even though eviction is still possible */
>  	return evictable_min_seq(min_seq, swappiness) + MIN_NR_GENS == max_seq;
>  }
>  
> -/*
> - * For future optimizations:
> - * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> - *    reclaim.
> - */
> -static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc, int swappiness)
> +static long get_nr_to_scan(struct lruvec *lruvec, struct scan_control *sc,
> +			   struct mem_cgroup *memcg, int swappiness)
>  {
> -	bool need_aging;
>  	unsigned long nr_to_scan;
> -	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> -	DEFINE_MAX_SEQ(lruvec);
> -
> -	if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg))
> -		return -1;
> -
> -	need_aging = should_run_aging(lruvec, max_seq, swappiness, &nr_to_scan);
>  
> +	nr_to_scan = lruvec_evictable_size(lruvec, swappiness);
>  	/* try to scrape all its memory if this memcg was deleted */
> -	if (nr_to_scan && !mem_cgroup_online(memcg))
> +	if (!mem_cgroup_online(memcg))
>  		return nr_to_scan;
>  
>  	nr_to_scan = apply_proportional_protection(memcg, sc, nr_to_scan);
> -
> -	/* try to get away with not aging at the default priority */
> -	if (!need_aging || sc->priority == DEF_PRIORITY)
> -		return nr_to_scan >> sc->priority;
> -
> -	/* stop scanning this lruvec as it's low on cold folios */
> -	return try_to_inc_max_seq(lruvec, max_seq, swappiness, false) ? -1 : 0;
> +	/* always respect scan priority */
> +	return nr_to_scan >> sc->priority;
>  }
>  
>  static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
> @@ -4998,31 +4984,43 @@ static bool should_abort_scan(struct lruvec *lruvec, struct scan_control *sc)
>  	return true;
>  }
>  
> +/*
> + * For future optimizations:
> + * 1. Defer try_to_inc_max_seq() to workqueues to reduce latency for memcg
> + *    reclaim.
> + */
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> +	bool need_rotate = false;
>  	long nr_batch, nr_to_scan;
> -	unsigned long scanned = 0;
>  	int swappiness = get_swappiness(lruvec, sc);
> +	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>  
> -	while (true) {
> +	nr_to_scan = get_nr_to_scan(lruvec, sc, memcg, swappiness);
> +	while (nr_to_scan > 0) {
>  		int delta;
> +		DEFINE_MAX_SEQ(lruvec);
>  
> -		nr_to_scan = get_nr_to_scan(lruvec, sc, swappiness);
> -		if (nr_to_scan <= 0)
> +		if (mem_cgroup_below_min(sc->target_mem_cgroup, memcg)) {
> +			need_rotate = true;
>  			break;
> +		}
> +
> +		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> +			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> +				need_rotate = true;
> +			break;
> +		}
>  
>  		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
>  		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>  		if (!delta)
>  			break;
>  
> -		scanned += delta;
> -		if (scanned >= nr_to_scan)
> -			break;
> -
>  		if (should_abort_scan(lruvec, sc))
>  			break;
>  
> +		nr_to_scan -= delta;
>  		cond_resched();
>  	}
>  
> @@ -5034,12 +5032,12 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  		wakeup_flusher_threads(WB_REASON_VMSCAN);
>  
>  	/* whether this lruvec should be rotated */
> -	return nr_to_scan < 0;
> +	return need_rotate;
>  }
>  
>  static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  {
> -	bool success;
> +	bool need_rotate;
>  	unsigned long scanned = sc->nr_scanned;
>  	unsigned long reclaimed = sc->nr_reclaimed;
>  	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> @@ -5057,7 +5055,7 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  		memcg_memory_event(memcg, MEMCG_LOW);
>  	}
>  
> -	success = try_to_shrink_lruvec(lruvec, sc);
> +	need_rotate = try_to_shrink_lruvec(lruvec, sc);
>  
>  	shrink_slab(sc->gfp_mask, pgdat->node_id, memcg, sc->priority);
>  
> @@ -5067,10 +5065,10 @@ static int shrink_one(struct lruvec *lruvec, struct scan_control *sc)
>  
>  	flush_reclaim_state(sc);
>  
> -	if (success && mem_cgroup_online(memcg))
> +	if (need_rotate && mem_cgroup_online(memcg))
>  		return MEMCG_LRU_YOUNG;
>  
> -	if (!success && lruvec_is_sizable(lruvec, sc))
> +	if (!need_rotate && lruvec_is_sizable(lruvec, sc))
>  		return 0;
>  
>  	/* one retry if offlined or too small */
> 

Maybe this renaming could be combined with the renaming in path 1/7 to split the
patch, which would be much clearer. Other than that, the path looks good to me.

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-22 16:20     ` Kairui Song
@ 2026-03-24  7:22       ` Chen Ridong
  2026-03-24  8:05         ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Chen Ridong @ 2026-03-24  7:22 UTC (permalink / raw)
  To: Kairui Song, Axel Rasmussen
  Cc: linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Shakeel Butt,
	Lorenzo Stoakes, Barry Song, David Stevens, Leno Hou, Yafang Shao,
	Yu Zhao, Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel



On 2026/3/23 0:20, Kairui Song wrote:
> On Sat, Mar 21, 2026 at 4:59 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>>
>> On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
>> <devnull+kasong.tencent.com@kernel.org> wrote:
>>>
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Make the scan helpers return the exact number of folios being scanned
>>> or isolated. This should make the scan more accurate and easier to
>>> follow.
>>>
>>> Now there is no more need for special handling when there is no
>>> progress made. The old livelock prevention `(return isolated ||
>>> !remaining ? scanned : 0)` is replaced by the natural scan budget
>>> exhaustion in try_to_shrink_lruvec, and sort_folio moves ineligible
>>> folios to newer generations.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>  mm/vmscan.c | 27 +++++++++++----------------
>>>  1 file changed, 11 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index ed5b5f8dd3c7..4f4548ff3a17 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -4680,7 +4680,7 @@ static bool isolate_folio(struct lruvec *lruvec, struct folio *folio, struct sca
>>>
>>>  static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>                        struct scan_control *sc, int type, int tier,
>>> -                      struct list_head *list)
>>> +                      struct list_head *list, int *isolatedp)
>>>  {
>>>         int i;
>>>         int gen;
>>> @@ -4750,11 +4750,9 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>                                 type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>>>         if (type == LRU_GEN_FILE)
>>>                 sc->nr.file_taken += isolated;
>>> -       /*
>>> -        * There might not be eligible folios due to reclaim_idx. Check the
>>> -        * remaining to prevent livelock if it's not making progress.
>>> -        */
>>> -       return isolated || !remaining ? scanned : 0;
>>> +
>>> +       *isolatedp = isolated;
>>> +       return scanned;
>>>  }
>>>
>>>  static int get_tier_idx(struct lruvec *lruvec, int type)
>>> @@ -4819,23 +4817,24 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>                           int *type_scanned, struct list_head *list)
>>>  {
>>>         int i;
>>> +       int scanned = 0;
>>> +       int isolated = 0;
>>>         int type = get_type_to_scan(lruvec, swappiness);
>>>
>>>         for_each_evictable_type(i, swappiness) {
>>> -               int scanned;
>>>                 int tier = get_tier_idx(lruvec, type);
>>>
>>>                 *type_scanned = type;
>>
>> I think this is problematic, now `isolate_folios` can scan a nonzero
>> amount of > 1 type of memory. Then the caller (`evict_folios`) calls
>> `trace_mm_vmscan_lru_shrink_inactive` with the total scanned amount,
>> with only the last type we scanned (misattributing part of the scan,
>> potentially). Not a "functional" issue, but it could mean confusing
>> data for anyone watching the tracepoint.
> 
> Thanks! Nice catch, I'll introduce another variable for the tracepoint
> then it should be fine.
> 
>>
>>
>>>
>>> -               scanned = scan_folios(nr_to_scan, lruvec, sc,
>>> -                                     type, tier, list);
>>> -               if (scanned)
>>> +               scanned += scan_folios(nr_to_scan, lruvec, sc,
>>> +                                     type, tier, list, &isolated);
>>> +               if (isolated)
>>>                         return scanned;
>>>
>>>                 type = !type;
>>>         }
>>>
>>> -       return 0;
>>> +       return scanned;
>>>  }
>>>
>>>  static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>> @@ -4852,7 +4851,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>         struct reclaim_stat stat;
>>>         struct lru_gen_mm_walk *walk;
>>>         bool skip_retry = false;
>>> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
>>>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>
>>> @@ -4860,10 +4858,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>
>>>         scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
>>>
>>> -       scanned += try_to_inc_min_seq(lruvec, swappiness);
>>> -
>>> -       if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
>>> -               scanned = 0;
>>> +       try_to_inc_min_seq(lruvec, swappiness);
>>
>> IIUC, this change is what introduces the issue patch 6 is trying to
>> resolve. Is it worth squashing patch 6 in to this one, so we don't
>> have this non-ideal intermediate state?
> 
> Well it's not, patch 6 is fixing an existing problem, see the cover
> letter about the OOM issue.
> 
> This part of changing is just cleanup the loop code. It looks really
> strange to me that increasing min_seq is considered as scanning one
> folio. Aborting the scan if there is only 2 gen kind of make sense but
> this doesn't seems the right place. These strange parts to avoid
> livelock can be dropped since we have an exact count of folios being
> scanned now. I'll add more words in the commit message.

This change confused me too.

IIUC, this change looks conceptually tied to patch 3. The following change means
that evict_folios should not be invoked if aging is needed. So the judge can be
dropped there, right?


```
 static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 {
...
+		if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
+			if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
+				need_rotate = true;
+			break;
+		}
```

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/8] mm/mglru: use a smaller batch for reclaim
  2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
  2026-03-20 20:58   ` Axel Rasmussen
@ 2026-03-24  7:51   ` Chen Ridong
  1 sibling, 0 replies; 44+ messages in thread
From: Chen Ridong @ 2026-03-24  7:51 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel



On 2026/3/18 3:09, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> With a fixed number to reclaim calculated at the beginning, making each
> following step smaller should reduce the lock contention and avoid
> over-aggressive reclaim of folios, as it will abort earlier when the
> number of folios to be reclaimed is reached.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f4548ff3a17..2ff1609ff4de 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -5007,7 +5007,7 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  			break;
>  		}
>  
> -		nr_batch = min(nr_to_scan, MAX_LRU_BATCH);
> +		nr_batch = min(nr_to_scan, MIN_LRU_BATCH);
>  		delta = evict_folios(nr_batch, lruvec, sc, swappiness);
>  		if (!delta)
>  			break;
> 

LGTM.

Reviewed-by: Chen Ridong <chenridong@huaweicloud.com>

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-24  7:22       ` Chen Ridong
@ 2026-03-24  8:05         ` Kairui Song
  2026-03-24  9:10           ` Chen Ridong
  0 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-03-24  8:05 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Axel Rasmussen, linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 24, 2026 at 3:22 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> On 2026/3/23 0:20, Kairui Song wrote:
> > On Sat, Mar 21, 2026 at 4:59 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >>
> >> On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
> >> <devnull+kasong.tencent.com@kernel.org> wrote:
> >>>
> >>> From: Kairui Song <kasong@tencent.com>
> >>>
> >>> Make the scan helpers return the exact number of folios being scanned
> >>> or isolated. This should make the scan more accurate and easier to
> >>> follow.
> >>>
> >>> Now there is no more need for special handling when there is no
> >>> progress made. The old livelock prevention `(return isolated ||
> >>> !remaining ? scanned : 0)` is replaced by the natural scan budget
> >>> exhaustion in try_to_shrink_lruvec, and sort_folio moves ineligible
> >>> folios to newer generations.
> >>>
> >>> Signed-off-by: Kairui Song <kasong@tencent.com>

...

> >>>  static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >>> @@ -4852,7 +4851,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >>>         struct reclaim_stat stat;
> >>>         struct lru_gen_mm_walk *walk;
> >>>         bool skip_retry = false;
> >>> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
> >>>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
> >>>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
> >>>
> >>> @@ -4860,10 +4858,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
> >>>
> >>>         scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
> >>>
> >>> -       scanned += try_to_inc_min_seq(lruvec, swappiness);
> >>> -
> >>> -       if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
> >>> -               scanned = 0;
> >>> +       try_to_inc_min_seq(lruvec, swappiness);
> >>
> >> IIUC, this change is what introduces the issue patch 6 is trying to
> >> resolve. Is it worth squashing patch 6 in to this one, so we don't
> >> have this non-ideal intermediate state?
> >
> > Well it's not, patch 6 is fixing an existing problem, see the cover
> > letter about the OOM issue.
> >
> > This part of changing is just cleanup the loop code. It looks really
> > strange to me that increasing min_seq is considered as scanning one
> > folio. Aborting the scan if there is only 2 gen kind of make sense but
> > this doesn't seems the right place. These strange parts to avoid
> > livelock can be dropped since we have an exact count of folios being
> > scanned now. I'll add more words in the commit message.
>
> This change confused me too.
>
> IIUC, this change looks conceptually tied to patch 3. The following change means
> that evict_folios should not be invoked if aging is needed. So the judge can be
> dropped there, right?
>
>
> ```
>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  {
> ...
> +               if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
> +                       if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
> +                               need_rotate = true;
> +                       break;
> +               }
> ```
>

Hi Ridong,

Ahh yes, as you pointed out, the explicit should_run_aging kind of
guards the evict_folio. That's not everything, besides, previously
isolate_folios may return 0 if there is no folio isolated. But now it
always return the number of folios being scanned, unless there are
only two genes left and hit the force protection, which also makes the
judge here can be dropped.

But not invoking evict_folios if aging is needed is an existing
behavior, that commit (patch 3) didn't change it, just made it cleaner
so we can see it well.

Now the folio scan number combines well with the scan budget
introduced in the previous commit.

And I just noticed it might be even better to move try_to_inc_min_seq
before isolate_folios, to avoid an empty gen blocking isolate_folios.
Usually this won't be an issue since calling try_to_inc_min_seq after
isolate_folios also ensures reclaim won't generate any problematic
empty gen, but removing folio by things like freeing could introduce
one or two empty gens.

The forced gen protection may cause other problems but that's
irrelevant to this commit, should be improved later.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling
  2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
  2026-03-20 21:18   ` Axel Rasmussen
@ 2026-03-24  8:57   ` Chen Ridong
  2026-03-24 11:09     ` Kairui Song
  2026-03-26  7:56   ` Baolin Wang
  2 siblings, 1 reply; 44+ messages in thread
From: Chen Ridong @ 2026-03-24  8:57 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel



On 2026/3/18 3:09, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivated upon pageout
> (shrink_folio_list).
> 
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.
> 
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
> 
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.
> 
> Test with YCSB workloadb showed a major performance improvement:
> 
> Before this series:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
> 
> After this commit:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
> 
> The refault rate is 50% lower, and throughput is 30% higher, which is a
> huge gain. We also observed significant performance gain for other
> real-world workloads.
> 
> We were concerned that the dirty flush could cause more wear for SSD:
> that should not be the problem here, since the wakeup condition is when
> the dirty folios have been pushed to the tail of LRU, which indicates
> that memory pressure is so high that writeback is blocking the workload
> already.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> ---
>  mm/vmscan.c | 44 +++++++++++++-------------------------------
>  1 file changed, 13 insertions(+), 31 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b26959d90850..e11d0f1a8b68 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -4577,7 +4577,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>  		       int tier_idx)
>  {
>  	bool success;
> -	bool dirty, writeback;
>  	int gen = folio_lru_gen(folio);
>  	int type = folio_is_file_lru(folio);
>  	int zone = folio_zonenum(folio);
> @@ -4627,21 +4626,6 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
>  		return true;
>  	}
>  
> -	dirty = folio_test_dirty(folio);
> -	writeback = folio_test_writeback(folio);
> -	if (type == LRU_GEN_FILE && dirty) {
> -		sc->nr.file_taken += delta;
> -		if (!writeback)
> -			sc->nr.unqueued_dirty += delta;
> -	}
> -
> -	/* waiting for writeback */
> -	if (writeback || (type == LRU_GEN_FILE && dirty)) {
> -		gen = folio_inc_gen(lruvec, folio, true);
> -		list_move(&folio->lru, &lrugen->folios[gen][type][zone]);
> -		return true;
> -	}
> -
>  	return false;
>  }
>  
> @@ -4748,8 +4732,6 @@ static int scan_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan,
>  				scanned, skipped, isolated,
>  				type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
> -	if (type == LRU_GEN_FILE)
> -		sc->nr.file_taken += isolated;
>  
>  	*isolatedp = isolated;
>  	return scanned;
> @@ -4814,11 +4796,11 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness)
>  
>  static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  			  struct scan_control *sc, int swappiness,
> -			  int *type_scanned, struct list_head *list)
> +			  int *type_scanned,
> +			  struct list_head *list, int *isolated)
>  {
>  	int i;
>  	int scanned = 0;
> -	int isolated = 0;
>  	int type = get_type_to_scan(lruvec, swappiness);
>  
>  	for_each_evictable_type(i, swappiness) {
> @@ -4827,8 +4809,8 @@ static int isolate_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  		*type_scanned = type;
>  
>  		scanned += scan_folios(nr_to_scan, lruvec, sc,
> -				      type, tier, list, &isolated);
> -		if (isolated)
> +				      type, tier, list, isolated);
> +		if (*isolated)
>  			return scanned;
>  
>  		type = !type;
> @@ -4843,6 +4825,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  	int type;
>  	int scanned;
>  	int reclaimed;
> +	int isolated = 0;
>  	LIST_HEAD(list);
>  	LIST_HEAD(clean);
>  	struct folio *folio;
> @@ -4856,7 +4839,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  
>  	lruvec_lock_irq(lruvec);
>  
> -	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
> +	scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list, &isolated);
>  
>  	try_to_inc_min_seq(lruvec, swappiness);
>  
> @@ -4866,12 +4849,18 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>  		return scanned;
>  retry:
>  	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
> -	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
>  	sc->nr_reclaimed += reclaimed;
>  	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
>  			scanned, reclaimed, &stat, sc->priority,
>  			type ? LRU_INACTIVE_FILE : LRU_INACTIVE_ANON);
>  
> +	/*
> +	 * If too many file cache in the coldest generation can't be evicted
> +	 * due to being dirty, wake up the flusher.
> +	 */
> +	if (stat.nr_unqueued_dirty == isolated)
> +		wakeup_flusher_threads(WB_REASON_VMSCAN);
> +
>  	list_for_each_entry_safe_reverse(folio, next, &list, lru) {
>  		DEFINE_MIN_SEQ(lruvec);
>  
> @@ -5023,13 +5012,6 @@ static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  		cond_resched();
>  	}
>  
> -	/*
> -	 * If too many file cache in the coldest generation can't be evicted
> -	 * due to being dirty, wake up the flusher.
> -	 */
> -	if (sc->nr.unqueued_dirty && sc->nr.unqueued_dirty == sc->nr.file_taken)
> -		wakeup_flusher_threads(WB_REASON_VMSCAN);
> -
>  	/* whether this lruvec should be rotated */
>  	return need_rotate;
>  }
> 

I may be missing something, but I think this change moves dirty/writeback
folios into `shrink_folio_list()` without moving the corresponding reclaim
feedback as well.

Before this patch, MGLRU mostly filtered dirty/writeback folios in
`sort_folio()`. After this patch they can be isolated and processed by
`shrink_folio_list()`, but the new code seems to only keep the flusher wakeup
and no longer feeds the resulting state back into `sc->nr.*` (`dirty`,
`congested`, `writeback`, `immediate`, `taken`).

Those counters are consumed later by reclaim/throttling logic, so shouldn't
MGLRU update them here too, similar to the classic inactive-LRU path?

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-24  8:05         ` Kairui Song
@ 2026-03-24  9:10           ` Chen Ridong
  2026-03-24  9:29             ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Chen Ridong @ 2026-03-24  9:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: Axel Rasmussen, linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel



On 2026/3/24 16:05, Kairui Song wrote:
> On Tue, Mar 24, 2026 at 3:22 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
>> On 2026/3/23 0:20, Kairui Song wrote:
>>> On Sat, Mar 21, 2026 at 4:59 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>>>>
>>>> On Tue, Mar 17, 2026 at 12:11 PM Kairui Song via B4 Relay
>>>> <devnull+kasong.tencent.com@kernel.org> wrote:
>>>>>
>>>>> From: Kairui Song <kasong@tencent.com>
>>>>>
>>>>> Make the scan helpers return the exact number of folios being scanned
>>>>> or isolated. This should make the scan more accurate and easier to
>>>>> follow.
>>>>>
>>>>> Now there is no more need for special handling when there is no
>>>>> progress made. The old livelock prevention `(return isolated ||
>>>>> !remaining ? scanned : 0)` is replaced by the natural scan budget
>>>>> exhaustion in try_to_shrink_lruvec, and sort_folio moves ineligible
>>>>> folios to newer generations.
>>>>>
>>>>> Signed-off-by: Kairui Song <kasong@tencent.com>
> 
> ...
> 
>>>>>  static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>>> @@ -4852,7 +4851,6 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>>>         struct reclaim_stat stat;
>>>>>         struct lru_gen_mm_walk *walk;
>>>>>         bool skip_retry = false;
>>>>> -       struct lru_gen_folio *lrugen = &lruvec->lrugen;
>>>>>         struct mem_cgroup *memcg = lruvec_memcg(lruvec);
>>>>>         struct pglist_data *pgdat = lruvec_pgdat(lruvec);
>>>>>
>>>>> @@ -4860,10 +4858,7 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
>>>>>
>>>>>         scanned = isolate_folios(nr_to_scan, lruvec, sc, swappiness, &type, &list);
>>>>>
>>>>> -       scanned += try_to_inc_min_seq(lruvec, swappiness);
>>>>> -
>>>>> -       if (evictable_min_seq(lrugen->min_seq, swappiness) + MIN_NR_GENS > lrugen->max_seq)
>>>>> -               scanned = 0;
>>>>> +       try_to_inc_min_seq(lruvec, swappiness);
>>>>
>>>> IIUC, this change is what introduces the issue patch 6 is trying to
>>>> resolve. Is it worth squashing patch 6 in to this one, so we don't
>>>> have this non-ideal intermediate state?
>>>
>>> Well it's not, patch 6 is fixing an existing problem, see the cover
>>> letter about the OOM issue.
>>>
>>> This part of changing is just cleanup the loop code. It looks really
>>> strange to me that increasing min_seq is considered as scanning one
>>> folio. Aborting the scan if there is only 2 gen kind of make sense but
>>> this doesn't seems the right place. These strange parts to avoid
>>> livelock can be dropped since we have an exact count of folios being
>>> scanned now. I'll add more words in the commit message.
>>
>> This change confused me too.
>>
>> IIUC, this change looks conceptually tied to patch 3. The following change means
>> that evict_folios should not be invoked if aging is needed. So the judge can be
>> dropped there, right?
>>
>>
>> ```
>>  static bool try_to_shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>  {
>> ...
>> +               if (should_run_aging(lruvec, max_seq, sc, swappiness)) {
>> +                       if (try_to_inc_max_seq(lruvec, max_seq, swappiness, false))
>> +                               need_rotate = true;
>> +                       break;
>> +               }
>> ```
>>
> 
> Hi Ridong,
> 
> Ahh yes, as you pointed out, the explicit should_run_aging kind of
> guards the evict_folio. That's not everything, besides, previously
> isolate_folios may return 0 if there is no folio isolated. But now it
> always return the number of folios being scanned, unless there are
> only two genes left and hit the force protection, which also makes the
> judge here can be dropped.
> 
> But not invoking evict_folios if aging is needed is an existing
> behavior, that commit (patch 3) didn't change it, just made it cleaner
> so we can see it well.
> 

Thanks for the explanation.

Would it be better to combine this change with patch 3, rather than adding to
the commit message?

-- 
Best regards,
Ridong



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/8] mm/mglru: scan and count the exact number of folios
  2026-03-24  9:10           ` Chen Ridong
@ 2026-03-24  9:29             ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-24  9:29 UTC (permalink / raw)
  To: Chen Ridong
  Cc: Axel Rasmussen, linux-mm, Andrew Morton, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 24, 2026 at 5:10 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> On 2026/3/24 16:05, Kairui Song wrote:
> > Hi Ridong,
> >
> > Ahh yes, as you pointed out, the explicit should_run_aging kind of
> > guards the evict_folio. That's not everything, besides, previously
> > isolate_folios may return 0 if there is no folio isolated. But now it
> > always return the number of folios being scanned, unless there are
> > only two genes left and hit the force protection, which also makes the
> > judge here can be dropped.
> >
> > But not invoking evict_folios if aging is needed is an existing
> > behavior, that commit (patch 3) didn't change it, just made it cleaner
> > so we can see it well.
> >
>
> Thanks for the explanation.
>
> Would it be better to combine this change with patch 3, rather than adding to
> the commit message?
>
> --
> Best regards,
> Ridong
>

Hi Ridong, thanks for the suggestion.

Patch 3 is already a bit complex I think, so I split this out as a
separate patch so the review might be easier. Maybe I can try to merge
them later if they still look confusing in V2.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling
  2026-03-24  8:57   ` Chen Ridong
@ 2026-03-24 11:09     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-24 11:09 UTC (permalink / raw)
  To: Chen Ridong
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang, Kalesh Singh,
	Suren Baghdasaryan, Chris Li, Vernon Yang, linux-kernel

On Tue, Mar 24, 2026 at 5:10 PM Chen Ridong <chenridong@huaweicloud.com> wrote:
> On 2026/3/18 3:09, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current handling of dirty writeback folios is not working well for
> > file page heavy workloads: Dirty folios are protected and move to next
> > gen upon isolation of getting throttled or reactivated upon pageout
> > (shrink_folio_list).
> >
> > This might help to reduce the LRU lock contention slightly, but as a
> > result, the ping-pong effect of folios between head and tail of last two
> > gens is serious as the shrinker will run into protected dirty writeback
> > folios more frequently compared to activation. The dirty flush wakeup
> > condition is also much more passive compared to active/inactive LRU.
> > Active / inactve LRU wakes the flusher if one batch of folios passed to
> > shrink_folio_list is unevictable due to under writeback, but MGLRU
> > instead has to check this after the whole reclaim loop is done, and then
> > count the isolation protection number compared to the total reclaim
> > number.
> >
> > And we previously saw OOM problems with it, too, which were fixed but
> > still not perfect [1].
> >
> > So instead, just drop the special handling for dirty writeback, just
> > re-activate it like active / inactive LRU. And also move the dirty flush
> > wake up check right after shrink_folio_list. This should improve both
> > throttling and performance.
> >
> > Test with YCSB workloadb showed a major performance improvement:
> >
> > Before this series:
> > Throughput(ops/sec): 61642.78008938203
> > AverageLatency(us):  507.11127774145166
> > pgpgin 158190589
> > pgpgout 5880616
> > workingset_refault 7262988
> >
> > After this commit:
> > Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> > AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> > pgpgin 101871227                        (-35.6%, lower is better)
> > pgpgout 5770028
> > workingset_refault 3418186              (-52.9%, lower is better)
> >
> > The refault rate is 50% lower, and throughput is 30% higher, which is a
> > huge gain. We also observed significant performance gain for other
> > real-world workloads.
> >
> > We were concerned that the dirty flush could cause more wear for SSD:
> > that should not be the problem here, since the wakeup condition is when
> > the dirty folios have been pushed to the tail of LRU, which indicates
> > that memory pressure is so high that writeback is blocking the workload
> > already.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Link: https://lore.kernel.org/linux-mm/20241026115714.1437435-1-jingxiangzeng.cas@gmail.com/ [1]
> > ---
> >  mm/vmscan.c | 44 +++++++++++++-------------------------------
> >  1 file changed, 13 insertions(+), 31 deletions(-)
> >

...

>
> I may be missing something, but I think this change moves dirty/writeback
> folios into `shrink_folio_list()` without moving the corresponding reclaim
> feedback as well.
>
> Before this patch, MGLRU mostly filtered dirty/writeback folios in
> `sort_folio()`. After this patch they can be isolated and processed by
> `shrink_folio_list()`, but the new code seems to only keep the flusher wakeup
> and no longer feeds the resulting state back into `sc->nr.*` (`dirty`,
> `congested`, `writeback`, `immediate`, `taken`).
>
> Those counters are consumed later by reclaim/throttling logic, so shouldn't
> MGLRU update them here too, similar to the classic inactive-LRU path?
>

Yeah, how about we make better use of them in a seperate patch? MGLRU
pretty much just ignored these counters and never populated some
sc->nr.* so far. It's still not an issue introduced by this patch,
could be an existing issue, if that's a valid issue.

This patch only changed sc->nr.unqueued_dirty/file_taken, combined
with tweaks to dirty handling, the result is pretty good.

How about a seperate patch after cleaning up the counters? The next
patch will remove unused ones, I think another patch can be
separately tested and reviewed for things like throttling.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
  2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
                   ` (7 preceding siblings ...)
  2026-03-17 19:09 ` [PATCH 8/8] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
@ 2026-03-25  4:49 ` Eric Naim
  2026-03-25  5:47   ` Kairui Song
  8 siblings, 1 reply; 44+ messages in thread
From: Eric Naim @ 2026-03-25  4:49 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel

Hi Kairui,

On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> faults and 30% increase in MongoDB throughput with YCSB and no swap
> involved, other common benchmarks have no regression, and LOC is
> reduced, with less unexpected OOM in our production environment.
> 
> Some of the problems were found in our production environment, and
> others are mostly exposed while stress testing the LFU-like design as
> proposed in the LSM/MM/BPF topic this year [1]. This series has no
> direct relationship to that topic, but it cleans up the code base and
> fixes several strange behaviors that make the test result of the
> LFU-like design not as good as expected.
> 
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective too.
> 
> This series slightly cleans up and improves the reclaim loop using a
> scan budget by calculating the number of folios to scan at the beginning
> of the loop, and decouples aging from the reclaim calculation helpers
> Then move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
> 
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and 128G memory machine using NVME as storage.
> 
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
> 
> Not using SWAP.
> 
> Median of 3 test run, results are stable.
> 
> Before:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
> 
> After:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
> 
> We can see a significant performance improvement after this series for
> file cache heavy workloads like this. The test is done on NVME and the
> performance gap would be even larger for slow devices, we observed
>> 100% gain for some other workloads running on HDD devices.
> 
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
> 
> Before:
> Total requests:            77920
> Per-worker 95% CI (mean):  [1199.9, 1235.1]
> Per-worker stdev:          70.5
> Jain's fairness:           0.996706 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     25649   32.92%   32.92%
> [1,2)s      7759    9.96%   42.87%
> [2,4)s      5156    6.62%   49.49%
> [4,8)s     39356   50.51%  100.00%
> 
> After:
> Total requests:            79564
> Per-worker 95% CI (mean):  [1224.2, 1262.2]
> Per-worker stdev:          76.1
> Jain's fairness:           0.996328 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     25485   32.03%   32.03%
> [1,2)s      8661   10.89%   42.92%
> [2,4)s      6268    7.88%   50.79%
> [4,8)s     39150   49.21%  100.00%
> 
> Seems identical, reclaim is still fair and effective, total requests
> number seems slightly better.
> 
> OOM issue [4]
> =============
> Testing with a specific reproducer [4] to simulate what we encounterd in
> production environment. Still using the same test machine but one node
> is used as pmem ramdisk following steps in the reproducer, no SWAP used.
> 
> This reproducer spawns multiple workers that keep reading the given file
> using mmap, and pauses for 120ms after one file read batch. It also
> spawns another set of workers that keep allocating and freeing a
> given size of anonymous memory. The total memory size exceeds the
> memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
> But by evicting the file cache, the workload should hold just fine,
> especially given that the file worker pauses after every batch, allowing
> other workers to catch up.
> 
> - MGLRU disabled:
>   Finished 128 iterations.
> 
> - MGLRU enabled:
>   Hung or OOM with following info after about ~10-20 iterations:
> 
>     [  357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     ... <snip> ...
>     [  357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
>     [  357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [  357.348192] Memory cgroup stats for /demo:
>     [  357.348314] anon 46724382720
>     [  357.348963] file 4160753664
> 
>   OOM occurs despite there is still evictable file folios.
> 
> - MGLRU enabled after this series:
>   Finished 128 iterations.
> 
> With aging blocking reclaim, the OOM will be much more likely to occur.
> This issue is mostly fixed by patch 6 and result is much better, but
> this series is still only the first step to improve file folio reclaim
> for MGLRU, as there are still cases where file folios can't be
> effectively reclaimed.
> 
> MySQL:
> ======
> 
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
> 
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=96 --time=600 run
> 
> Before:         22343.701667 tps
> After patch 4:  22327.325000 tps
> After patch 5:  22373.224000 tps
> After patch 6:  22321.174000 tps
> After patch 7:  22625.961667 tps (+1.26%, higher is better)
> 
> MySQL is anon folios heavy but still looking good. Seems only noise level
> changes, no regression.
> 
> FIO:
> ====
> Testing with the following command, where /mnt is an EXT4 ramdisk, 6
> test runs each in a 10G memcg:
> 
> fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
>   --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
>   --iodepth_batch_complete=32 --rw=randread \
>   --random_distribution=zipf:1.2 --norandommap --time_based \
>   --ramp_time=1m --runtime=10m --group_reporting
> 
> Before:        32039.56 MB/s
> After patch 3: 32751.50 MB/s
> After patch 4: 32703.03 MB/s
> After patch 5: 33395.52 MB/s
> After patch 6: 32031.51 MB/s
> After patch 7: 32534.29 MB/s
> 
> Also seem only noise level changes and no regression.
> 
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 8 test run each.
> 
> Before:        2881.41s
> After patch 3: 2894.09s
> After patch 4: 2846.73s
> After patch 5: 2847.91s
> After patch 6: 2835.17s
> After patch 7: 2842.90s
> 
> Also seem only noise level changes, no regression or very slightly better.
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.

fallocate -l 5G 5G
while true; do tail /dev/zero; done
while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done

After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. Anything that I (CachyOS) can do to help debug this regression, if it is to be considered one as according to [1], zram as swap seems to be unsupported by upstream. (the user that tested this wasn't able to get a good kernel trace, the only thing left was a trace of the OOM killer firing)

[1] https://chrisdown.name/2026/03/24/zswap-vs-zram-when-to-use-what.html

-- 
Regards,
  Eric


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
  2026-03-25  4:49 ` [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Eric Naim
@ 2026-03-25  5:47   ` Kairui Song
  2026-03-25  9:26     ` Eric Naim
  0 siblings, 1 reply; 44+ messages in thread
From: Kairui Song @ 2026-03-25  5:47 UTC (permalink / raw)
  To: Eric Naim
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel

On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
>
> Hi Kairui,
>
> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> > This series cleans up and slightly improves MGLRU's reclaim loop and
> > dirty flush logic. As a result, we can see an up to ~50% reduce of file
> > faults and 30% increase in MongoDB throughput with YCSB and no swap
> > involved, other common benchmarks have no regression, and LOC is
> > reduced, with less unexpected OOM in our production environment.
> >

...

> > Before:        2881.41s
> > After patch 3: 2894.09s
> > After patch 4: 2846.73s
> > After patch 5: 2847.91s
> > After patch 6: 2835.17s
> > After patch 7: 2842.90s
> >
> > Also seem only noise level changes, no regression or very slightly better.
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> > Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> > Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> > Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
>
> fallocate -l 5G 5G
> while true; do tail /dev/zero; done
> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
>
> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.

Hi Eric,

Thanks for the report, I was about to send V2 but noticing your report
I'll try to reproduce your issue first.

So far I didn't notice any regression, is this an issue caused by this
patch or is it an existing issue? I don't have any context about how
you are doing the test. BTW the calculation in patch "mm/mglru:
restructure the reclaim loop" needs to have a lowest bar
"max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
related but will add to V2.

And about the test you posted:
while true; do tail /dev/zero; done

I believe this will just consume all memory with zero pages and then
get OOM killed, that's exactly what the test is meant to do. By lockup
I'm not sure you mean since you mentioned OOM kill. The system
actually hung or the desktop is dead?

I just ran that with or without ZRAM on two machines and my laptop,
everything looks good here with this series.

> zram as swap seems to be unsupported by upstream.

That's simply not true, other distros like Fedora even have ZRAM as
swap by default:
https://fedoraproject.org/wiki/Changes/SwapOnZRAM

And systemd have a widely used ZRAM swap support:
https://github.com/systemd/zram-generator

Android also uses that, and we are using ZRAM by default in our fleet
which runs fine.

> the user that tested this wasn't able to get a
> good kernel trace, the only thing left was
> a trace of the OOM killer firing.

No worry, that's fine, just send me the OOM trace or log, the more
detailed context I get the better.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
  2026-03-25  5:47   ` Kairui Song
@ 2026-03-25  9:26     ` Eric Naim
  2026-03-25  9:47       ` Kairui Song
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Naim @ 2026-03-25  9:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel

On 3/25/26 1:47 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
>>
>> Hi Kairui,
>>
>> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
>>> This series cleans up and slightly improves MGLRU's reclaim loop and
>>> dirty flush logic. As a result, we can see an up to ~50% reduce of file
>>> faults and 30% increase in MongoDB throughput with YCSB and no swap
>>> involved, other common benchmarks have no regression, and LOC is
>>> reduced, with less unexpected OOM in our production environment.
>>>
> 
> ...
> 
>>
>> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
>>
>> fallocate -l 5G 5G
>> while true; do tail /dev/zero; done
>> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
>>
>> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.
> 
> Hi Eric,
> 
> Thanks for the report, I was about to send V2 but noticing your report
> I'll try to reproduce your issue first.
> 
> So far I didn't notice any regression, is this an issue caused by this
> patch or is it an existing issue? I don't have any context about how
> you are doing the test. BTW the calculation in patch "mm/mglru:
> restructure the reclaim loop" needs to have a lowest bar
> "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
> related but will add to V2.
> 

As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms). 

So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all).

Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset.

> And about the test you posted:
> while true; do tail /dev/zero; done
> 
> I believe this will just consume all memory with zero pages and then
> get OOM killed, that's exactly what the test is meant to do. By lockup
> I'm not sure you mean since you mentioned OOM kill. The system
> actually hung or the desktop is dead?

The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario)

> 
> I just ran that with or without ZRAM on two machines and my laptop,
> everything looks good here with this series.
> 
>> zram as swap seems to be unsupported by upstream.
> 
> That's simply not true, other distros like Fedora even have ZRAM as
> swap by default:
> https://fedoraproject.org/wiki/Changes/SwapOnZRAM
> 
> And systemd have a widely used ZRAM swap support:
> https://github.com/systemd/zram-generator
> 
> Android also uses that, and we are using ZRAM by default in our fleet
> which runs fine.
> 
>> the user that tested this wasn't able to get a
>> good kernel trace, the only thing left was
>> a trace of the OOM killer firing.
> 
> No worry, that's fine, just send me the OOM trace or log, the more
> detailed context I get the better.

Mar 25 08:24:22 osiris kernel: Call Trace:
Mar 25 08:24:22 osiris kernel:  <TASK>
Mar 25 08:24:22 osiris kernel:  dump_stack_lvl+0x61/0x80
Mar 25 08:24:22 osiris kernel:  dump_header+0x4a/0x160
Mar 25 08:24:22 osiris kernel:  oom_kill_process+0x18f/0x1f0
Mar 25 08:24:22 osiris kernel:  out_of_memory+0x4ab/0x5c0
Mar 25 08:24:22 osiris kernel:  __alloc_pages_slowpath+0x9ac/0x1060
Mar 25 08:24:22 osiris kernel:  __alloc_frozen_pages_noprof+0x29a/0x320
Mar 25 08:24:22 osiris kernel:  alloc_pages_mpol+0x107/0x1b0
Mar 25 08:24:22 osiris kernel:  folio_alloc_noprof+0x85/0xb0
Mar 25 08:24:22 osiris kernel:  __filemap_get_folio_mpol+0x1ff/0x4c0
Mar 25 08:24:22 osiris kernel:  filemap_fault+0x3e3/0x6e0
Mar 25 08:24:22 osiris kernel:  __do_fault+0x46/0x140
Mar 25 08:24:22 osiris kernel:  do_pte_missing+0x154/0xea0
Mar 25 08:24:22 osiris kernel:  ? __pte_offset_map+0x1d/0xd0
Mar 25 08:24:22 osiris kernel:  handle_mm_fault+0x89c/0x1280
Mar 25 08:24:22 osiris kernel:  do_user_addr_fault+0x23b/0x720
Mar 25 08:24:22 osiris kernel:  exc_page_fault+0x75/0xe0
Mar 25 08:24:22 osiris kernel:  asm_exc_page_fault+0x26/0x30
Mar 25 08:24:22 osiris kernel: RIP: 0033:0x7fec4beb43c0
Mar 25 08:24:22 osiris kernel: Code: Unable to access opcode bytes at 0x7fec4beb4396.
Mar 25 08:24:22 osiris kernel: RSP: 002b:00007ffcb348d698 EFLAGS: 00010293
Mar 25 08:24:22 osiris kernel: RAX: 00000000c70f6907 RBX: 00007ffcb348d8d0 RCX: 00007fec4bb1604d
Mar 25 08:24:22 osiris kernel: RDX: c6a4a7935bd1e995 RSI: 4fb7dae88ad99bfb RDI: 000055ee77cc8150
Mar 25 08:24:22 osiris kernel: RBP: 00007ffcb348dd60 R08: 000055ee77cc8158 R09: 000000000000000c
Mar 25 08:24:22 osiris kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
Mar 25 08:24:22 osiris kernel: R13: 000055ee77cc8150 R14: 0000000000000064 R15: 431bde82d7b634db
Mar 25 08:24:22 osiris kernel:  </TASK>

Here's the call trace that was recovered. Some mm related settings that we set in our kernel in case its useful:

vm.compact_unevictable_allowed = 0
vm.compaction_proactiveness = 0
vm.page-cluster = 0
vm.swappiness = 150 
vm.vfs_cache_pressure = 50
vm.dirty_bytes = 268435456
vm.dirty_background_bytes = 67108864
vm.dirty_writeback_centisecs = 1500
vm.watermark_boost_factor = 0

/sys/kernel/mm/transparent_hugepage/defrag = defer+madvise

[1] https://github.com/firelzrd/le9uo/

-- 
Regards,
  Eric


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
  2026-03-25  9:26     ` Eric Naim
@ 2026-03-25  9:47       ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-25  9:47 UTC (permalink / raw)
  To: Eric Naim
  Cc: linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel

On Wed, Mar 25, 2026 at 5:27 PM Eric Naim <dnaim@cachyos.org> wrote:
>
> On 3/25/26 1:47 PM, Kairui Song wrote:
> > On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@cachyos.org> wrote:
> >>
> >> Hi Kairui,
> >>
> >> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> >>> This series cleans up and slightly improves MGLRU's reclaim loop and
> >>> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> >>> faults and 30% increase in MongoDB throughput with YCSB and no swap
> >>> involved, other common benchmarks have no regression, and LOC is
> >>> reduced, with less unexpected OOM in our production environment.
> >>>
> >
> > ...
> >
> >>
> >> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
> >>
> >> fallocate -l 5G 5G
> >> while true; do tail /dev/zero; done
> >> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
> >>
> >> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.
> >
> > Hi Eric,
> >
> > Thanks for the report, I was about to send V2 but noticing your report
> > I'll try to reproduce your issue first.
> >
> > So far I didn't notice any regression, is this an issue caused by this
> > patch or is it an existing issue? I don't have any context about how
> > you are doing the test. BTW the calculation in patch "mm/mglru:
> > restructure the reclaim loop" needs to have a lowest bar
> > "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
> > related but will add to V2.
> >
>
> As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms).
>
> So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all).
>
> Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset.

Ah thanks, that makes sense now, the downstream patch you mentioned
limits the reclaim of file pages to avoid thrashing, and your test
cases exhaust the memory on purpose which forces the kernel to reclaim
all reclaimable folios including page cache.

A thrashing page cache causes desktop hangs easily, using TTL is an
effective way to avoid thrashing and trigger OOM early. That's why the
problem is gone with lru_gen_min_ttl_ms = 100 or le9.

> > And about the test you posted:
> > while true; do tail /dev/zero; done
> >
> > I believe this will just consume all memory with zero pages and then
> > get OOM killed, that's exactly what the test is meant to do. By lockup
> > I'm not sure you mean since you mentioned OOM kill. The system
> > actually hung or the desktop is dead?
>
> The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario)

Yeah I believe so.

Thrashing prevention is why MGLRU's TTL is introduced, so I do suggest
using that. It can be further improved too.

Will keep that in mind and try to make some test cases to cover your
case too and make some adjustments.

BTW how does the kernel behave with MGLRU disabled for your case?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size
  2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
                     ` (2 preceding siblings ...)
  2026-03-19  1:40   ` Chen Ridong
@ 2026-03-26  6:25   ` Baolin Wang
  3 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-26  6:25 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel



On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Merge commonly used code for counting evictable folios in a lruvec.
> 
> No behavior change.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

With lruvec_evictable_size() changed to return unsigned long, LGTM.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] mm/mglru: restructure the reclaim loop
  2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
  2026-03-20 20:09   ` Axel Rasmussen
  2026-03-24  6:41   ` Chen Ridong
@ 2026-03-26  7:31   ` Baolin Wang
  2026-03-26  8:37     ` Kairui Song
  2 siblings, 1 reply; 44+ messages in thread
From: Baolin Wang @ 2026-03-26  7:31 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel



On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current loop will calculate the scan number on each iteration. The
> number of folios to scan is based on the LRU length, with some unclear
> behaviors, eg, it only shifts the scan number by reclaim priority at the
> default priority, and it couples the number calculation with aging and
> rotation.
> 
> Adjust, simplify it, and decouple aging and rotation. Just calculate the
> scan number for once at the beginning of the reclaim, always respect the
> reclaim priority, and make the aging and rotation more explicit.
> 
> This slightly changes how offline memcg aging works: previously, offline
> memcg wouldn't be aged unless it didn't have any evictable folios. Now,
> we might age it if it has only 3 generations and the reclaim priority is
> less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
> might still hold long-term folios, and in fact, a long-existing offline
> memcg must be pinned by some long-term folios like shmem. These folios
> might be used by other memcg, so aging them as ordinary memcg doesn't
> seem wrong. And besides, aging enables further reclaim of an offlined
> memcg, which will certainly happen if we keep shrinking it. And offline
> memcg might soon be no longer an issue once reparenting is all ready.
> 
> Overall, the memcg LRU rotation, as described in mmzone.h,
> remains the same.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Overall, I really like the code cleanup here, and it makes the code much 
more readable. Thanks for your work.

However, one concern is that you've mixed some functional changes (such 
as the offline memcg aging and shifting the scan number by reclaim 
priority) into these cleanups. This makes the commit difficult to 
review, though I think the functional changes make sense to me.

Can we split this up? That means you can send the cleanups as a separate 
patch first, followed by the functional changes in the following patches 
with an explanation of their impact. Then reviewers can focus on 
discussing the functional changes.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling
  2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
  2026-03-20 21:18   ` Axel Rasmussen
  2026-03-24  8:57   ` Chen Ridong
@ 2026-03-26  7:56   ` Baolin Wang
  2 siblings, 0 replies; 44+ messages in thread
From: Baolin Wang @ 2026-03-26  7:56 UTC (permalink / raw)
  To: kasong, linux-mm
  Cc: Andrew Morton, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Johannes Weiner, David Hildenbrand, Michal Hocko, Qi Zheng,
	Shakeel Butt, Lorenzo Stoakes, Barry Song, David Stevens,
	Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao, Zicheng Wang,
	Kalesh Singh, Suren Baghdasaryan, Chris Li, Vernon Yang,
	linux-kernel



On 3/18/26 3:09 AM, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current handling of dirty writeback folios is not working well for
> file page heavy workloads: Dirty folios are protected and move to next
> gen upon isolation of getting throttled or reactivated upon pageout
> (shrink_folio_list).
> 
> This might help to reduce the LRU lock contention slightly, but as a
> result, the ping-pong effect of folios between head and tail of last two
> gens is serious as the shrinker will run into protected dirty writeback
> folios more frequently compared to activation. The dirty flush wakeup
> condition is also much more passive compared to active/inactive LRU.
> Active / inactve LRU wakes the flusher if one batch of folios passed to
> shrink_folio_list is unevictable due to under writeback, but MGLRU
> instead has to check this after the whole reclaim loop is done, and then
> count the isolation protection number compared to the total reclaim
> number.
> 
> And we previously saw OOM problems with it, too, which were fixed but
> still not perfect [1].
> 
> So instead, just drop the special handling for dirty writeback, just
> re-activate it like active / inactive LRU. And also move the dirty flush
> wake up check right after shrink_folio_list. This should improve both
> throttling and performance.

Make sense to me. Additionally, I think there is still room to improve 
the writeback-related logic in shrink_folio_list().


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/8] mm/mglru: restructure the reclaim loop
  2026-03-26  7:31   ` Baolin Wang
@ 2026-03-26  8:37     ` Kairui Song
  0 siblings, 0 replies; 44+ messages in thread
From: Kairui Song @ 2026-03-26  8:37 UTC (permalink / raw)
  To: Baolin Wang
  Cc: kasong, linux-mm, Andrew Morton, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Shakeel Butt, Lorenzo Stoakes, Barry Song,
	David Stevens, Chen Ridong, Leno Hou, Yafang Shao, Yu Zhao,
	Zicheng Wang, Kalesh Singh, Suren Baghdasaryan, Chris Li,
	Vernon Yang, linux-kernel

On Thu, Mar 26, 2026 at 03:31:43PM +0800, Baolin Wang wrote:
> 
> 
> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current loop will calculate the scan number on each iteration. The
> > number of folios to scan is based on the LRU length, with some unclear
> > behaviors, eg, it only shifts the scan number by reclaim priority at the
> > default priority, and it couples the number calculation with aging and
> > rotation.
> > 
> > Adjust, simplify it, and decouple aging and rotation. Just calculate the
> > scan number for once at the beginning of the reclaim, always respect the
> > reclaim priority, and make the aging and rotation more explicit.
> > 
> > This slightly changes how offline memcg aging works: previously, offline
> > memcg wouldn't be aged unless it didn't have any evictable folios. Now,
> > we might age it if it has only 3 generations and the reclaim priority is
> > less than DEF_PRIORITY, which should be fine. On one hand, offline memcg
> > might still hold long-term folios, and in fact, a long-existing offline
> > memcg must be pinned by some long-term folios like shmem. These folios
> > might be used by other memcg, so aging them as ordinary memcg doesn't
> > seem wrong. And besides, aging enables further reclaim of an offlined
> > memcg, which will certainly happen if we keep shrinking it. And offline
> > memcg might soon be no longer an issue once reparenting is all ready.
> > 
> > Overall, the memcg LRU rotation, as described in mmzone.h,
> > remains the same.
> > 
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> 
> Overall, I really like the code cleanup here, and it makes the code much
> more readable. Thanks for your work.

Thanks for the review!

> 
> However, one concern is that you've mixed some functional changes (such as
> the offline memcg aging and shifting the scan number by reclaim priority)
> into these cleanups. This makes the commit difficult to review, though I
> think the functional changes make sense to me.

Right, so I also include very detailed test in the cover letter
for each step. This part of code is kind of coupled together
so by decoupling them, it may get more messy if we try to keep some
of the trivial old behavior.

But will have a try on V2. Already doing some stress testing but
splitting the patch shouldn't effect the final code.


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2026-03-26  8:38 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
2026-03-17 19:55   ` Yuanchu Xie
2026-03-18  9:42   ` Barry Song
2026-03-18  9:57     ` Kairui Song
2026-03-19  1:40   ` Chen Ridong
2026-03-20 19:51     ` Axel Rasmussen
2026-03-22 16:10       ` Kairui Song
2026-03-26  6:25   ` Baolin Wang
2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-03-19  2:00   ` Chen Ridong
2026-03-19  4:12     ` Kairui Song
2026-03-20 21:00   ` Axel Rasmussen
2026-03-22  8:14   ` Barry Song
2026-03-24  6:05     ` Kairui Song
2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-03-20 20:09   ` Axel Rasmussen
2026-03-22 16:11     ` Kairui Song
2026-03-24  6:41   ` Chen Ridong
2026-03-26  7:31   ` Baolin Wang
2026-03-26  8:37     ` Kairui Song
2026-03-17 19:09 ` [PATCH 4/8] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-03-20 20:57   ` Axel Rasmussen
2026-03-22 16:20     ` Kairui Song
2026-03-24  7:22       ` Chen Ridong
2026-03-24  8:05         ` Kairui Song
2026-03-24  9:10           ` Chen Ridong
2026-03-24  9:29             ` Kairui Song
2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-03-20 20:58   ` Axel Rasmussen
2026-03-24  7:51   ` Chen Ridong
2026-03-17 19:09 ` [PATCH 6/8] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-03-20 21:18   ` Axel Rasmussen
2026-03-22 16:22     ` Kairui Song
2026-03-24  8:57   ` Chen Ridong
2026-03-24 11:09     ` Kairui Song
2026-03-26  7:56   ` Baolin Wang
2026-03-17 19:09 ` [PATCH 8/8] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-03-20 21:19   ` Axel Rasmussen
2026-03-25  4:49 ` [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Eric Naim
2026-03-25  5:47   ` Kairui Song
2026-03-25  9:26     ` Eric Naim
2026-03-25  9:47       ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox