linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I)
@ 2025-08-22 19:20 Kairui Song
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
                   ` (10 more replies)
  0 siblings, 11 replies; 94+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This is the first phase of the bigger series implementing basic
infrastructures for the Swap Table idea proposed at the LSF/MM/BPF
topic "Integrate swap cache, swap maps with swap allocator" [1].

This phase I contains 9 patches, introduces the swap table infrastructure
and uses it as the swap cache backend. By doing so, we have up to ~5-20%
performance gain in throughput, RPS or build time for benchmark and
workload tests. This is based on Chris Li's idea of using cluster size
atomic arrays to implement swap cache. It has less contention on the swap
cache access. The cluster size is much finer-grained than the 64M address
space split, which is removed in this phase I. It also unifies and cleans
up the swap code base.

Each swap cluster will dynamically allocate the swap table, which is an
atomic array to cover every swap slot in the cluster. It replaces the swap
cache back by Xarray. In phase I, the static allocated swap_map still
co-exists with the swap table. The memory usage is about the same as the
original on average. A few exception test cases show about 1% higher in
memory usage. In the following phases of the series, swap_map will merge
into the swap table without additional memory allocation. It will result
in net memory reduction compared to the original swap cache.

Testing has shown that phase I has a significant performance improvement
from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
workloads.

The full picture with a summary can be found at [2]. An older bigger
series of 28 patches is posted at [3].

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
                Before:         After:
System time:    220.86s         160.42s      (-27.36%)
Throughput:     4775.18 MB/s    6381.43 MB/s (+33.63%)
Free latency:   174492 us       132122 us    (+24.28%)

usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
PMEM as swap)
                Before:         After:
System time:    355.23s         295.28s      (-16.87%)
Throughput:     4659.89 MB/s    5765.80 MB/s (+23.73%)
Free latency:   500417 us       477098 us     (-4.66%)

This shows an improvement of more than 20% improvement in most readings.

Build kernel test:
==================
Building kernel with defconfig on tmpfs with ZSWAP / ZRAM is looking
good. The results below show a test matrix using different memory
pressure and setups. Tests are done with shmem as filesystem and
using the same build config, measuring sys and real time in seconds
(user time is almost identical as expected):

 -j<NR> / Mem  | Sys before / after  | Real before / after
Using 16G ZRAM with memcg limit:
     12 / 256M | 6475 / 6232  -3.75% | 814 / 793   -2.58%
     24 / 384M | 5904 / 5560  -5.82% | 413 / 397   -3.87%
     48 / 768M | 4762 / 4242  -10.9% | 187 / 179   -4.27%
With 64k folio:
     24 / 512M | 4196 / 4062  -3.19% | 325 / 319   -1.84%
     48 / 1G   | 3622 / 3544  -2.15% | 148 / 146   -1.37%
With ZSWAP with 3G memcg (using higher limit due to kmem account):
     48 / 3G   |  605 /  571  -5.61% |  81 /  79   -2.47%

For extremely high pressure global pressure, using ZSWAP with 32G
NVMEs in a 48c VM that has 4G memory globally, no memcg limit, system
components take up about 1.5G so the pressure is high, using make -j48:

Before:  sys time: 2061.72s            real time: 135.61s
After:   sys time: 1990.96s (-3.43%)   real time: 134.03s (-1.16%)

All cases are faster, and no regression even under heavy global
memory pressure.

Redis / Valkey bench:
=====================
The test machine is a ARM64 VM with 1.5G memory, redis is set to
use 2.5G memory:

Testing with:
redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get

                no BGSAVE                with BGSAVE
Before:         433015.08 RPS            271421.15 RPS
After:          431537.61 RPS (-0.34%)   290441.79 RPS (+7.0%)

Testing with:
redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get
                no BGSAVE                with BGSAVE
Before:         446339.45 RPS            274845.19 RPS
After:          442697.29 RPS (-0.81%)   293053.59 RPS (+6.6%)

With BGSAVE enabled, most Redis memory will have a swap count > 1 so
swap cache is heavily in use. We can see a >5% performance. No BGSAVE
is very slightly slower (<1%) due to the higher memory pressure of the
co-existence of swap_map and swap table. This will be optimzed into a
net gain and up to 20% gain in BGSAVE case in the following phases.

Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]

Kairui Song (9):
  mm, swap: use unified helper for swap cache look up
  mm, swap: always lock and check the swap cache folio before use
  mm, swap: rename and move some swap cluster definition and helpers
  mm, swap: tidy up swap device and cluster info helpers
  mm/shmem, swap: remove redundant error handling for replacing folio
  mm, swap: use the swap table for the swap cache and switch API
  mm, swap: remove contention workaround for swap cache
  mm, swap: implement dynamic allocation of swap table
  mm, swap: use a single page for swap table when the size fits

 MAINTAINERS          |   1 +
 include/linux/swap.h |  42 ----
 mm/filemap.c         |   2 +-
 mm/huge_memory.c     |  16 +-
 mm/memory-failure.c  |   2 +-
 mm/memory.c          |  30 +--
 mm/migrate.c         |  28 +--
 mm/mincore.c         |   3 +-
 mm/page_io.c         |  12 +-
 mm/shmem.c           |  56 ++----
 mm/swap.h            | 268 +++++++++++++++++++++----
 mm/swap_state.c      | 404 +++++++++++++++++++-------------------
 mm/swap_table.h      | 136 +++++++++++++
 mm/swapfile.c        | 456 ++++++++++++++++++++++++++++---------------
 mm/userfaultfd.c     |   5 +-
 mm/vmscan.c          |  20 +-
 mm/zswap.c           |   9 +-
 17 files changed, 954 insertions(+), 536 deletions(-)
 create mode 100644 mm/swap_table.h

---

I was trying some new tools like b4 for branch management, and it seems
a draft version was sent out by accident, but seems got rejected. I'm
not sure if anyone is seeing duplicated or a malformed email. If so,
please accept my apology and use this series for review, discussion
or merge.

-- 
2.51.0


^ permalink raw reply	[flat|nested] 94+ messages in thread

end of thread, other threads:[~2025-09-04  9:28 UTC | newest]

Thread overview: 94+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
2025-08-27  2:47   ` Chris Li
2025-08-27  3:50     ` Chris Li
2025-08-27 13:45     ` Kairui Song
2025-08-27  3:52   ` Baoquan He
2025-08-27 13:46     ` Kairui Song
2025-08-28  3:20   ` Baolin Wang
2025-09-01 23:50   ` Barry Song
2025-09-02  6:12     ` Kairui Song
2025-09-02  6:52       ` Chris Li
2025-09-02 10:06   ` David Hildenbrand
2025-09-02 12:32     ` Chris Li
2025-09-02 13:18       ` David Hildenbrand
2025-09-02 16:38     ` Kairui Song
2025-09-02 10:10   ` David Hildenbrand
2025-09-02 17:13     ` Kairui Song
2025-09-03  8:00       ` David Hildenbrand
2025-09-03 17:41   ` Nhat Pham
2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
2025-08-27  6:13   ` Chris Li
2025-08-27 13:44     ` Kairui Song
2025-08-30  1:42       ` Chris Li
2025-08-27  7:03   ` Chris Li
2025-08-27 14:35     ` Kairui Song
2025-08-28  3:41       ` Baolin Wang
2025-08-28 18:05         ` Kairui Song
2025-08-30  1:53       ` Chris Li
2025-08-30 15:15         ` Kairui Song
2025-08-30 17:17           ` Chris Li
2025-09-01 18:17         ` Kairui Song
2025-09-01 21:10           ` Chris Li
2025-09-02  5:40   ` Barry Song
2025-09-02 10:18   ` David Hildenbrand
2025-09-02 10:21     ` David Hildenbrand
2025-09-02 12:46     ` Chris Li
2025-09-02 13:27       ` Kairui Song
2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
2025-08-30  2:31   ` Chris Li
2025-09-02  5:53   ` Barry Song
2025-09-02 10:20   ` David Hildenbrand
2025-09-02 12:50     ` Chris Li
2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-08-27  3:47   ` Baoquan He
2025-08-27 17:44     ` Chris Li
2025-08-27 23:46       ` Baoquan He
2025-08-30  2:38         ` Chris Li
2025-09-02  6:01       ` Barry Song
2025-09-03  9:28       ` David Hildenbrand
2025-09-02  6:02   ` Barry Song
2025-09-02 13:33   ` David Hildenbrand
2025-09-02 15:03     ` Kairui Song
2025-09-03  8:11       ` David Hildenbrand
2025-08-22 19:20 ` [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
2025-08-25  3:02   ` Baolin Wang
2025-08-25  9:45     ` Kairui Song
2025-08-30  2:41       ` Chris Li
2025-09-03  8:25   ` David Hildenbrand
2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
2025-08-30  1:54   ` Baoquan He
2025-08-30  3:40     ` Chris Li
2025-08-30  3:34   ` Chris Li
2025-08-30 16:52     ` Kairui Song
2025-08-31  1:00       ` Chris Li
2025-09-02 11:51         ` Kairui Song
2025-09-02  9:55   ` Barry Song
2025-09-02 11:58     ` Kairui Song
2025-09-02 23:44       ` Barry Song
2025-09-03  2:12         ` Kairui Song
2025-09-03  2:31           ` Barry Song
2025-09-03 11:41   ` David Hildenbrand
2025-09-03 12:54     ` Kairui Song
2025-09-04  9:28       ` David Hildenbrand
2025-08-22 19:20 ` [PATCH 7/9] mm, swap: remove contention workaround for swap cache Kairui Song
2025-08-30  4:07   ` Chris Li
2025-08-30 15:24     ` Kairui Song
2025-08-31 15:54       ` Kairui Song
2025-08-31 20:06         ` Chris Li
2025-08-31 20:04       ` Chris Li
2025-09-02 10:06   ` Barry Song
2025-08-22 19:20 ` [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-08-30  4:17   ` Chris Li
2025-09-02 11:15   ` Barry Song
2025-09-02 13:17     ` Chris Li
2025-09-02 16:57       ` Kairui Song
2025-09-02 23:31       ` Barry Song
2025-09-03  2:13         ` Kairui Song
2025-09-03 12:35         ` Chris Li
2025-09-03 20:52           ` Barry Song
2025-09-04  6:50             ` Chris Li
2025-08-22 19:20 ` [PATCH 9/9] mm, swap: use a single page for swap table when the size fits Kairui Song
2025-08-30  4:23   ` Chris Li
2025-08-26 22:00 ` [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Chris Li
2025-08-30  5:44 ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).