From: Eric Naim <dnaim@cachyos.org>
To: kasong@tencent.com, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
Johannes Weiner <hannes@cmpxchg.org>,
David Hildenbrand <david@kernel.org>,
Michal Hocko <mhocko@kernel.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Shakeel Butt <shakeel.butt@linux.dev>,
Lorenzo Stoakes <ljs@kernel.org>, Barry Song <baohua@kernel.org>,
David Stevens <stevensd@google.com>,
Chen Ridong <chenridong@huaweicloud.com>,
Leno Hou <lenohou@gmail.com>, Yafang Shao <laoar.shao@gmail.com>,
Yu Zhao <yuzhao@google.com>, Zicheng Wang <wangzicheng@honor.com>,
Kalesh Singh <kaleshsingh@google.com>,
Suren Baghdasaryan <surenb@google.com>,
Chris Li <chrisl@kernel.org>, Vernon Yang <vernon2gm@gmail.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
Date: Wed, 25 Mar 2026 04:49:00 +0000 [thread overview]
Message-ID: <7ab8edd7-381f-4db2-9560-b58718669208@cachyos.org> (raw)
In-Reply-To: <20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com>
Hi Kairui,
On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> faults and 30% increase in MongoDB throughput with YCSB and no swap
> involved, other common benchmarks have no regression, and LOC is
> reduced, with less unexpected OOM in our production environment.
>
> Some of the problems were found in our production environment, and
> others are mostly exposed while stress testing the LFU-like design as
> proposed in the LSM/MM/BPF topic this year [1]. This series has no
> direct relationship to that topic, but it cleans up the code base and
> fixes several strange behaviors that make the test result of the
> LFU-like design not as good as expected.
>
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective too.
>
> This series slightly cleans up and improves the reclaim loop using a
> scan budget by calculating the number of folios to scan at the beginning
> of the loop, and decouples aging from the reclaim calculation helpers
> Then move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
>
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and 128G memory machine using NVME as storage.
>
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
>
> Not using SWAP.
>
> Median of 3 test run, results are stable.
>
> Before:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us): 507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
>
> After:
> Throughput(ops/sec): 80216.04855744806 (+30.1%, higher is better)
> AverageLatency(us): 388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227 (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186 (-52.9%, lower is better)
>
> We can see a significant performance improvement after this series for
> file cache heavy workloads like this. The test is done on NVME and the
> performance gap would be even larger for slow devices, we observed
>> 100% gain for some other workloads running on HDD devices.
>
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
>
> Before:
> Total requests: 77920
> Per-worker 95% CI (mean): [1199.9, 1235.1]
> Per-worker stdev: 70.5
> Jain's fairness: 0.996706 (1.0 = perfectly fair)
> Latency:
> Bucket Count Pct Cumul
> [0,1)s 25649 32.92% 32.92%
> [1,2)s 7759 9.96% 42.87%
> [2,4)s 5156 6.62% 49.49%
> [4,8)s 39356 50.51% 100.00%
>
> After:
> Total requests: 79564
> Per-worker 95% CI (mean): [1224.2, 1262.2]
> Per-worker stdev: 76.1
> Jain's fairness: 0.996328 (1.0 = perfectly fair)
> Latency:
> Bucket Count Pct Cumul
> [0,1)s 25485 32.03% 32.03%
> [1,2)s 8661 10.89% 42.92%
> [2,4)s 6268 7.88% 50.79%
> [4,8)s 39150 49.21% 100.00%
>
> Seems identical, reclaim is still fair and effective, total requests
> number seems slightly better.
>
> OOM issue [4]
> =============
> Testing with a specific reproducer [4] to simulate what we encounterd in
> production environment. Still using the same test machine but one node
> is used as pmem ramdisk following steps in the reproducer, no SWAP used.
>
> This reproducer spawns multiple workers that keep reading the given file
> using mmap, and pauses for 120ms after one file read batch. It also
> spawns another set of workers that keep allocating and freeing a
> given size of anonymous memory. The total memory size exceeds the
> memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
> But by evicting the file cache, the workload should hold just fine,
> especially given that the file worker pauses after every batch, allowing
> other workers to catch up.
>
> - MGLRU disabled:
> Finished 128 iterations.
>
> - MGLRU enabled:
> Hung or OOM with following info after about ~10-20 iterations:
>
> [ 357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
> ... <snip> ...
> [ 357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
> [ 357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
> [ 357.348192] Memory cgroup stats for /demo:
> [ 357.348314] anon 46724382720
> [ 357.348963] file 4160753664
>
> OOM occurs despite there is still evictable file folios.
>
> - MGLRU enabled after this series:
> Finished 128 iterations.
>
> With aging blocking reclaim, the OOM will be much more likely to occur.
> This issue is mostly fixed by patch 6 and result is much better, but
> this series is still only the first step to improve file folio reclaim
> for MGLRU, as there are still cases where file folios can't be
> effectively reclaimed.
>
> MySQL:
> ======
>
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
>
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
> --tables=48 --table-size=2000000 --threads=96 --time=600 run
>
> Before: 22343.701667 tps
> After patch 4: 22327.325000 tps
> After patch 5: 22373.224000 tps
> After patch 6: 22321.174000 tps
> After patch 7: 22625.961667 tps (+1.26%, higher is better)
>
> MySQL is anon folios heavy but still looking good. Seems only noise level
> changes, no regression.
>
> FIO:
> ====
> Testing with the following command, where /mnt is an EXT4 ramdisk, 6
> test runs each in a 10G memcg:
>
> fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
> --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
> --iodepth_batch_complete=32 --rw=randread \
> --random_distribution=zipf:1.2 --norandommap --time_based \
> --ramp_time=1m --runtime=10m --group_reporting
>
> Before: 32039.56 MB/s
> After patch 3: 32751.50 MB/s
> After patch 4: 32703.03 MB/s
> After patch 5: 33395.52 MB/s
> After patch 6: 32031.51 MB/s
> After patch 7: 32534.29 MB/s
>
> Also seem only noise level changes and no regression.
>
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 8 test run each.
>
> Before: 2881.41s
> After patch 3: 2894.09s
> After patch 4: 2846.73s
> After patch 5: 2847.91s
> After patch 6: 2835.17s
> After patch 7: 2842.90s
>
> Also seem only noise level changes, no regression or very slightly better.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
fallocate -l 5G 5G
while true; do tail /dev/zero; done
while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. Anything that I (CachyOS) can do to help debug this regression, if it is to be considered one as according to [1], zram as swap seems to be unsupported by upstream. (the user that tested this wasn't able to get a good kernel trace, the only thing left was a trace of the OOM killer firing)
[1] https://chrisdown.name/2026/03/24/zswap-vs-zram-when-to-use-what.html
--
Regards,
Eric
next prev parent reply other threads:[~2026-03-25 4:50 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
2026-03-17 19:55 ` Yuanchu Xie
2026-03-18 9:42 ` Barry Song
2026-03-18 9:57 ` Kairui Song
2026-03-19 1:40 ` Chen Ridong
2026-03-20 19:51 ` Axel Rasmussen
2026-03-22 16:10 ` Kairui Song
2026-03-26 6:25 ` Baolin Wang
2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-03-19 2:00 ` Chen Ridong
2026-03-19 4:12 ` Kairui Song
2026-03-20 21:00 ` Axel Rasmussen
2026-03-22 8:14 ` Barry Song
2026-03-24 6:05 ` Kairui Song
2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-03-20 20:09 ` Axel Rasmussen
2026-03-22 16:11 ` Kairui Song
2026-03-24 6:41 ` Chen Ridong
2026-03-26 7:31 ` Baolin Wang
2026-03-26 8:37 ` Kairui Song
2026-03-17 19:09 ` [PATCH 4/8] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-03-20 20:57 ` Axel Rasmussen
2026-03-22 16:20 ` Kairui Song
2026-03-24 7:22 ` Chen Ridong
2026-03-24 8:05 ` Kairui Song
2026-03-24 9:10 ` Chen Ridong
2026-03-24 9:29 ` Kairui Song
2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-03-20 20:58 ` Axel Rasmussen
2026-03-24 7:51 ` Chen Ridong
2026-03-17 19:09 ` [PATCH 6/8] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-03-20 21:18 ` Axel Rasmussen
2026-03-22 16:22 ` Kairui Song
2026-03-24 8:57 ` Chen Ridong
2026-03-24 11:09 ` Kairui Song
2026-03-26 7:56 ` Baolin Wang
2026-03-17 19:09 ` [PATCH 8/8] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-03-20 21:19 ` Axel Rasmussen
2026-03-25 4:49 ` Eric Naim [this message]
2026-03-25 5:47 ` [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song
2026-03-25 9:26 ` Eric Naim
2026-03-25 9:47 ` Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7ab8edd7-381f-4db2-9560-b58718669208@cachyos.org \
--to=dnaim@cachyos.org \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=chenridong@huaweicloud.com \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kaleshsingh@google.com \
--cc=kasong@tencent.com \
--cc=laoar.shao@gmail.com \
--cc=lenohou@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=stevensd@google.com \
--cc=surenb@google.com \
--cc=vernon2gm@gmail.com \
--cc=wangzicheng@honor.com \
--cc=weixugc@google.com \
--cc=yuanchu@google.com \
--cc=yuzhao@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox