From: Kairui Song <ryncsn@gmail.com>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Yu Zhao <yuzhao@google.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.com>, Hugh Dickins <hughd@google.com>,
Nhat Pham <nphamcs@gmail.com>, Yuanchu Xie <yuanchu@google.com>,
Suren Baghdasaryan <surenb@google.com>,
"T . J . Mercier" <tjmercier@google.com>,
Kairui Song <kasong@tencent.com>
Subject: [RFC PATCH 0/4] Refault distance checking for MGLRU
Date: Wed, 26 Jul 2023 02:57:28 +0800 [thread overview]
Message-ID: <20230725185733.43929-1-ryncsn@gmail.com> (raw)
From: Kairui Song <kasong@tencent.com>
Hi, linux-mm
I noticed MGLRU not working very well on certain workflows, which is
observed on some instances on some heavily stressed machines.
I found this was related to refault distance detection, when the
file page workingset size exceeds total memory, and the access
distance (the left-shift time of a page before it gets activated,
considering LRU starts from right) of file pages also larger than
total memory. All file pages are stuck on the oldest generation
and getting read-in then evicted permutably, few get activated and
stay in memory.
This series tries to fix this problem by rework the refault distance
detection to better fit MGLRU, and also tries to use a unified
algorithm for both MGLRU and Inactive/Active LRU.
Patch 1/4 reworked the refault distance detection model for
Inactive/Active LRU.
Patch 2/4 and 3/4 are simplification and prepare.
Patch 4/4 applies the modified refault distance detection
for MGLRU.
Following benchmark showed 5x improvement:
To simulate the workflow, I setup a 3-replicated mongodb cluster using
docker, each in a standalone cgroup, set to use 5 gb of cache and 10g
of oplog, on a 32G VM. The benchmark is done using
https://github.com/apavlo/py-tpcc.git, modified to run STOCK_LEVEL
query only, for simulating slow query and get a stable result.
Before the patch (with 10G swap, the result won't change whether
swap is on or not):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 904 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 503 27150226136.4 0.02 txn/s
------------------------------------------------------------------
TOTAL 503 27150226136.4 0.02 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 53391
workingset_refault_anon 0
workingset_refault_file 23856735
workingset_activate_anon 0
workingset_activate_file 23845737
workingset_restore_anon 0
workingset_restore_file 18280692
workingset_nodereclaim 1024
$ free -m
total used free shared buff/cache available
Mem: 31837 6752 379 23 24706 24607
Swap: 10239 0 10239
After the patch (with 10G swap on same disk, similar result using ZRAM):
$ tpcc.py --config=mongodb.config mongodb --duration=900 --warehouses=500 --clients=30
==================================================================
Execution Results after 903 seconds
------------------------------------------------------------------
Executed Time (µs) Rate
STOCK_LEVEL 2575 27094953498.8 0.10 txn/s
------------------------------------------------------------------
TOTAL 2575 27094953498.8 0.10 txn/s
$ cat /proc/vmstat | grep working
workingset_nodes 78249
workingset_refault_anon 10139
workingset_refault_file 23001863
workingset_activate_anon 7238
workingset_activate_file 6718032
workingset_restore_anon 7432
workingset_restore_file 6719406
workingset_nodereclaim 9747
$ free -m
total used free shared buff/cache available
Mem: 31837 7376 320 3 24140 24014
Swap: 10239 1662 8577
The performance is 5x times better than before, and the idle anon pages
now can get swapped out as expected. Testing with lower stress also shows
a improvement.
I also checked the benchmark with memtier/memcached and fio,
using similar setup as in commit ac35a4902374 but scaled down to fit in
my test environment:
memtier test (with 16G ramdisk as swap and 2G cgroup limit):
memcached -u nobody -m 16384 -s /tmp/memcached.socket -a 0766 \
-t 12 -B binary &
memtier_benchmark -S /tmp/memcached.socket -P memcache_binary -n allkeys\
--key-minimum=1 --key-maximum=24000000 --key-pattern=P:P -c 1 \
-t 12 --ratio 1:0 --pipeline 8 -d 2000 -x 6
fio test (with 16G ramdisk on /mnt and 4G cgroup limit):
fio -name=refault --numjobs=12 --directory=/mnt --size=1024m \
--buffered=1 --ioengine=io_uring --iodepth=128 \
--iodepth_batch_submit=32 --iodepth_batch_complete=32 \
--rw=randread --random_distribution=random --norandommap \
--time_based --ramp_time=5m --runtime=5m --group_reporting
Before this patch:
memcached:
Ops/sec Hits/sec Misses/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Best 52832.79 0.00 0.00 1.82042 1.70300 4.54300 6.27100 105641.69
Worst 46613.56 0.00 0.00 2.05686 1.77500 7.80700 11.83900 93206.05
Avg (6x) 51024.85 0.00 0.00 1.88506 1.73500 5.43900 9.47100 102026.64
fio:
read: IOPS=2211k, BW=8637MiB/s (9056MB/s)(2530GiB/300001msec)
After this patch:
memcached:
Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec
Best 54218.92 1.76930 1.65500 4.41500 6.27100 108413.34
Worst 47640.13 2.01495 1.74300 7.64700 11.64700 95258.72
Avg (6x) 51408.33 1.86988 1.71900 5.43900 9.34300 102793.42
fio:
read: IOPS=2166k, BW=8462MiB/s (8873MB/s)(2479GiB/300001msec)
memcached looks ok but there is a %2 performance drop for FIO test,
and after some profiling this is mainly caused by the extra atomic
operations and new functions, there seems to be no LRU accuracy drop.
Sending this as RFC as I'm not entirely sure if this is the right
way to fix this issue, of if this is a generic issue or considered
more of a misconfiguration. Any suggetions about how should I test
it is welcomed.
Signed-off-by: Kairui Song <kasong@tencent.com>
Kairui Song (4):
workingset: simplify and use a more intuitive model
workingset: simplify lru_gen_test_recent
lru_gen: convert avg_total and avg_refaulted to atomic
workingset, lru_gen: apply refault-distance based re-activation
include/linux/mmzone.h | 4 +-
include/linux/swap.h | 2 -
mm/swap.c | 1 -
mm/vmscan.c | 18 ++-
mm/workingset.c | 315 ++++++++++++++++++++++-------------------
5 files changed, 179 insertions(+), 161 deletions(-)
--
2.41.0
next reply other threads:[~2023-07-25 18:57 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-07-25 18:57 Kairui Song [this message]
2023-07-25 18:57 ` [RFC PATCH 1/4] workingset: simplify and use a more intuitive model Kairui Song
2023-07-25 18:57 ` [RFC PATCH 2/4] workingset: simplify lru_gen_test_recent Kairui Song
2023-07-25 18:57 ` [RFC PATCH 3/4] lru_gen: convert avg_total and avg_refaulted to atomic Kairui Song
2023-07-25 18:57 ` [RFC PATCH 4/4] workingset, lru_gen: apply refault-distance based re-activation Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230725185733.43929-1-ryncsn@gmail.com \
--to=ryncsn@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kasong@tencent.com \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=nphamcs@gmail.com \
--cc=roman.gushchin@linux.dev \
--cc=surenb@google.com \
--cc=tjmercier@google.com \
--cc=yuanchu@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).