public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: Eric Naim <dnaim@cachyos.org>
To: kasong@tencent.com, linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	David Hildenbrand <david@kernel.org>,
	Michal Hocko <mhocko@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Lorenzo Stoakes <ljs@kernel.org>, Barry Song <baohua@kernel.org>,
	David Stevens <stevensd@google.com>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Leno Hou <lenohou@gmail.com>, Yafang Shao <laoar.shao@gmail.com>,
	Yu Zhao <yuzhao@google.com>, Zicheng Wang <wangzicheng@honor.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Chris Li <chrisl@kernel.org>, Vernon Yang <vernon2gm@gmail.com>,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling
Date: Wed, 25 Mar 2026 04:49:00 +0000	[thread overview]
Message-ID: <7ab8edd7-381f-4db2-9560-b58718669208@cachyos.org> (raw)
In-Reply-To: <20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com>

Hi Kairui,

On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
> This series cleans up and slightly improves MGLRU's reclaim loop and
> dirty flush logic. As a result, we can see an up to ~50% reduce of file
> faults and 30% increase in MongoDB throughput with YCSB and no swap
> involved, other common benchmarks have no regression, and LOC is
> reduced, with less unexpected OOM in our production environment.
> 
> Some of the problems were found in our production environment, and
> others are mostly exposed while stress testing the LFU-like design as
> proposed in the LSM/MM/BPF topic this year [1]. This series has no
> direct relationship to that topic, but it cleans up the code base and
> fixes several strange behaviors that make the test result of the
> LFU-like design not as good as expected.
> 
> MGLRU's reclaim loop is a bit complex, and hence these problems are
> somehow related to each other. The aging, scan number calculation, and
> reclaim loop are coupled together, and the dirty folio handling logic is
> quite different, making the reclaim loop hard to follow and the dirty
> flush ineffective too.
> 
> This series slightly cleans up and improves the reclaim loop using a
> scan budget by calculating the number of folios to scan at the beginning
> of the loop, and decouples aging from the reclaim calculation helpers
> Then move the dirty flush logic inside the reclaim loop so it can kick
> in more effectively. These issues are somehow related, and this series
> handles them and improves MGLRU reclaim in many ways.
> 
> Test results: All tests are done on a 48c96t NUMA machine with 2 nodes
> and 128G memory machine using NVME as storage.
> 
> MongoDB
> =======
> Running YCSB workloadb [2] (recordcount:20000000 operationcount:6000000,
> threads:32), which does 95% read and 5% update to generate mixed read
> and dirty writeback. MongoDB is set up in a 10G cgroup using Docker, and
> the WiredTiger cache size is set to 4.5G, using NVME as storage.
> 
> Not using SWAP.
> 
> Median of 3 test run, results are stable.
> 
> Before:
> Throughput(ops/sec): 61642.78008938203
> AverageLatency(us):  507.11127774145166
> pgpgin 158190589
> pgpgout 5880616
> workingset_refault 7262988
> 
> After:
> Throughput(ops/sec): 80216.04855744806  (+30.1%, higher is better)
> AverageLatency(us):  388.17633477268913 (-23.5%, lower is better)
> pgpgin 101871227                        (-35.6%, lower is better)
> pgpgout 5770028
> workingset_refault 3418186              (-52.9%, lower is better)
> 
> We can see a significant performance improvement after this series for
> file cache heavy workloads like this. The test is done on NVME and the
> performance gap would be even larger for slow devices, we observed
>> 100% gain for some other workloads running on HDD devices.
> 
> Chrome & Node.js [3]
> ====================
> Using Yu Zhao's test script [3], testing on a x86_64 NUMA machine with 2
> nodes and 128G memory, using 256G ZRAM as swap and spawn 32 memcg 64
> workers:
> 
> Before:
> Total requests:            77920
> Per-worker 95% CI (mean):  [1199.9, 1235.1]
> Per-worker stdev:          70.5
> Jain's fairness:           0.996706 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     25649   32.92%   32.92%
> [1,2)s      7759    9.96%   42.87%
> [2,4)s      5156    6.62%   49.49%
> [4,8)s     39356   50.51%  100.00%
> 
> After:
> Total requests:            79564
> Per-worker 95% CI (mean):  [1224.2, 1262.2]
> Per-worker stdev:          76.1
> Jain's fairness:           0.996328 (1.0 = perfectly fair)
> Latency:
> Bucket     Count      Pct    Cumul
> [0,1)s     25485   32.03%   32.03%
> [1,2)s      8661   10.89%   42.92%
> [2,4)s      6268    7.88%   50.79%
> [4,8)s     39150   49.21%  100.00%
> 
> Seems identical, reclaim is still fair and effective, total requests
> number seems slightly better.
> 
> OOM issue [4]
> =============
> Testing with a specific reproducer [4] to simulate what we encounterd in
> production environment. Still using the same test machine but one node
> is used as pmem ramdisk following steps in the reproducer, no SWAP used.
> 
> This reproducer spawns multiple workers that keep reading the given file
> using mmap, and pauses for 120ms after one file read batch. It also
> spawns another set of workers that keep allocating and freeing a
> given size of anonymous memory. The total memory size exceeds the
> memory limit (eg. 44G anon + 8G file, which is 52G vs 48G memcg limit).
> But by evicting the file cache, the workload should hold just fine,
> especially given that the file worker pauses after every batch, allowing
> other workers to catch up.
> 
> - MGLRU disabled:
>   Finished 128 iterations.
> 
> - MGLRU enabled:
>   Hung or OOM with following info after about ~10-20 iterations:
> 
>     [  357.332946] file_anon_mix_pressure invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>     ... <snip> ...
>     [  357.333827] memory: usage 50331648kB, limit 50331648kB, failcnt 90907
>     [  357.347728] swap: usage 0kB, limit 9007199254740988kB, failcnt 0
>     [  357.348192] Memory cgroup stats for /demo:
>     [  357.348314] anon 46724382720
>     [  357.348963] file 4160753664
> 
>   OOM occurs despite there is still evictable file folios.
> 
> - MGLRU enabled after this series:
>   Finished 128 iterations.
> 
> With aging blocking reclaim, the OOM will be much more likely to occur.
> This issue is mostly fixed by patch 6 and result is much better, but
> this series is still only the first step to improve file folio reclaim
> for MGLRU, as there are still cases where file folios can't be
> effectively reclaimed.
> 
> MySQL:
> ======
> 
> Testing with innodb_buffer_pool_size=26106127360, in a 2G memcg, using
> ZRAM as swap and test command:
> 
> sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-db=sb \
>   --tables=48 --table-size=2000000 --threads=96 --time=600 run
> 
> Before:         22343.701667 tps
> After patch 4:  22327.325000 tps
> After patch 5:  22373.224000 tps
> After patch 6:  22321.174000 tps
> After patch 7:  22625.961667 tps (+1.26%, higher is better)
> 
> MySQL is anon folios heavy but still looking good. Seems only noise level
> changes, no regression.
> 
> FIO:
> ====
> Testing with the following command, where /mnt is an EXT4 ramdisk, 6
> test runs each in a 10G memcg:
> 
> fio -name=cached --numjobs=16 --filename=/mnt/test.img --buffered=1 \
>   --ioengine=io_uring --iodepth=128 --iodepth_batch_submit=32 \
>   --iodepth_batch_complete=32 --rw=randread \
>   --random_distribution=zipf:1.2 --norandommap --time_based \
>   --ramp_time=1m --runtime=10m --group_reporting
> 
> Before:        32039.56 MB/s
> After patch 3: 32751.50 MB/s
> After patch 4: 32703.03 MB/s
> After patch 5: 33395.52 MB/s
> After patch 6: 32031.51 MB/s
> After patch 7: 32534.29 MB/s
> 
> Also seem only noise level changes and no regression.
> 
> Build kernel:
> =============
> Build kernel test using ZRAM as swap, on top of tmpfs, in a 3G memcg
> using make -j96 and defconfig, measuring system time, 8 test run each.
> 
> Before:        2881.41s
> After patch 3: 2894.09s
> After patch 4: 2846.73s
> After patch 5: 2847.91s
> After patch 6: 2835.17s
> After patch 7: 2842.90s
> 
> Also seem only noise level changes, no regression or very slightly better.
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/ [1]
> Link: https://github.com/brianfrankcooper/YCSB/blob/master/workloads/workloadb [2]
> Link: https://lore.kernel.org/all/20221220214923.1229538-1-yuzhao@google.com/ [3]
> Link: https://github.com/ryncsn/emm-test-project/tree/master/file-anon-mix-pressure [4]
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.

fallocate -l 5G 5G
while true; do tail /dev/zero; done
while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done

After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur. Anything that I (CachyOS) can do to help debug this regression, if it is to be considered one as according to [1], zram as swap seems to be unsupported by upstream. (the user that tested this wasn't able to get a good kernel trace, the only thing left was a trace of the OOM killer firing)

[1] https://chrisdown.name/2026/03/24/zswap-vs-zram-when-to-use-what.html

-- 
Regards,
  Eric


  parent reply	other threads:[~2026-03-25  4:50 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-17 19:08 [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song via B4 Relay
2026-03-17 19:08 ` [PATCH 1/8] mm/mglru: consolidate common code for retrieving evitable size Kairui Song via B4 Relay
2026-03-17 19:55   ` Yuanchu Xie
2026-03-18  9:42   ` Barry Song
2026-03-18  9:57     ` Kairui Song
2026-03-19  1:40   ` Chen Ridong
2026-03-20 19:51     ` Axel Rasmussen
2026-03-22 16:10       ` Kairui Song
2026-03-26  6:25   ` Baolin Wang
2026-03-17 19:08 ` [PATCH 2/8] mm/mglru: relocate the LRU scan batch limit to callers Kairui Song via B4 Relay
2026-03-19  2:00   ` Chen Ridong
2026-03-19  4:12     ` Kairui Song
2026-03-20 21:00   ` Axel Rasmussen
2026-03-22  8:14   ` Barry Song
2026-03-24  6:05     ` Kairui Song
2026-03-17 19:08 ` [PATCH 3/8] mm/mglru: restructure the reclaim loop Kairui Song via B4 Relay
2026-03-20 20:09   ` Axel Rasmussen
2026-03-22 16:11     ` Kairui Song
2026-03-24  6:41   ` Chen Ridong
2026-03-26  7:31   ` Baolin Wang
2026-03-26  8:37     ` Kairui Song
2026-03-17 19:09 ` [PATCH 4/8] mm/mglru: scan and count the exact number of folios Kairui Song via B4 Relay
2026-03-20 20:57   ` Axel Rasmussen
2026-03-22 16:20     ` Kairui Song
2026-03-24  7:22       ` Chen Ridong
2026-03-24  8:05         ` Kairui Song
2026-03-24  9:10           ` Chen Ridong
2026-03-24  9:29             ` Kairui Song
2026-03-17 19:09 ` [PATCH 5/8] mm/mglru: use a smaller batch for reclaim Kairui Song via B4 Relay
2026-03-20 20:58   ` Axel Rasmussen
2026-03-24  7:51   ` Chen Ridong
2026-03-17 19:09 ` [PATCH 6/8] mm/mglru: don't abort scan immediately right after aging Kairui Song via B4 Relay
2026-03-17 19:09 ` [PATCH 7/8] mm/mglru: simplify and improve dirty writeback handling Kairui Song via B4 Relay
2026-03-20 21:18   ` Axel Rasmussen
2026-03-22 16:22     ` Kairui Song
2026-03-24  8:57   ` Chen Ridong
2026-03-24 11:09     ` Kairui Song
2026-03-26  7:56   ` Baolin Wang
2026-03-17 19:09 ` [PATCH 8/8] mm/vmscan: remove sc->file_taken Kairui Song via B4 Relay
2026-03-20 21:19   ` Axel Rasmussen
2026-03-25  4:49 ` Eric Naim [this message]
2026-03-25  5:47   ` [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling Kairui Song
2026-03-25  9:26     ` Eric Naim
2026-03-25  9:47       ` Kairui Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7ab8edd7-381f-4db2-9560-b58718669208@cachyos.org \
    --to=dnaim@cachyos.org \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=chrisl@kernel.org \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kaleshsingh@google.com \
    --cc=kasong@tencent.com \
    --cc=laoar.shao@gmail.com \
    --cc=lenohou@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=vernon2gm@gmail.com \
    --cc=wangzicheng@honor.com \
    --cc=weixugc@google.com \
    --cc=yuanchu@google.com \
    --cc=yuzhao@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox