[RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Linux Documentation
 help / color / mirror / Atom feed

From: Nhat Pham <nphamcs@gmail.com>
To: kasong@tencent.com
Cc: Liam.Howlett@oracle.com, akpm@linux-foundation.org,
	apopple@nvidia.com, axelrasmussen@google.com, baohua@kernel.org,
	baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com,
	cgroups@vger.kernel.org, chengming.zhou@linux.dev,
	chrisl@kernel.org, corbet@lwn.net, david@kernel.org,
	dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org,
	hughd@google.com, jannh@google.com, joshua.hahnjy@gmail.com,
	lance.yang@linux.dev, lenb@kernel.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-pm@vger.kernel.org, lorenzo.stoakes@oracle.com,
	matthew.brost@intel.com, mhocko@suse.com, muchun.song@linux.dev,
	npache@redhat.com, nphamcs@gmail.com, pavel@kernel.org,
	peterx@redhat.com, peterz@infradead.org, pfalcato@suse.de,
	rafael@kernel.org, rakie.kim@sk.com, roman.gushchin@linux.dev,
	rppt@kernel.org, ryan.roberts@arm.com, shakeel.butt@linux.dev,
	shikemeng@huaweicloud.com, surenb@google.com, tglx@kernel.org,
	vbabka@suse.cz, weixugc@google.com, ying.huang@linux.alibaba.com,
	yosry.ahmed@linux.dev, yuanchu@google.com,
	zhengqi.arch@bytedance.com, ziy@nvidia.com, kernel-team@meta.com,
	riel@surriel.com, haowenchao22@gmail.com
Subject: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)
Date: Thu, 28 May 2026 14:29:24 -0700	[thread overview]
Message-ID: <20260528212955.1912856-1-nphamcs@gmail.com> (raw)

Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].

I manually adapted Kairui's ghost device implementation (from [4])
for my vswap device. I've credited him as Co-developed-by on Patch I
since a substantial portion of the dynamic-cluster infrastructure is
his (I did propose the idea of using xarray/radix tree for dynamic
swap clusters allocation and management though :P).

From here on out, for simplicity, I will refer to swap table phase IV
as "P4", and the older v6 virtual swap space implementation as "v6".

I. Context and Motivation

Virtual swap decouples PTE swap entries from physical swap backing,
allowing pages to be compressed by zswap without pre-allocating a
physical swap slot. See [1] for a more involved discussion on the
motivation of swap virtualization, but in short, a swap virtualization
scheme needs to satisfy 3 requirements, which are all driven by real
pressing use cases of many parties using swap:

1. No backend coupling. For instance, a zswap entry should not
   require a physical swap slot to be allocated. This prevents
   wastage of coupled backend resources, and allows zswap to be
   used in systems that do not have enough storage capacity for
   physical swap (without having to resort to silly hacks). The same
   should hold for zero-filled swap pages, and swap cached folios too.

2. Dynamic swap space. The virtualization scheme should not require
   static provisioning, to accommodate dynamic and unpredictable swap
   usage. This massively simplifies operational provisioning, and
   allow the in-memory compression backend to be maximally utilized.
   It also makes sure we do not induce unbounded overhead on unused
   swap capacity.

3. Efficient backend transfer. The virtualization scheme should not
   introduces PTE/rmap walking overhead for backend transfer. This
   is crucial for systems that want to support multiple swap backends
   in a tiering fashion (for e.g zswap -> disk swap).

There are a lot of other future use cases as well - see [1] for more
details.

This series reimplements the virtual swap space concept (see [1])
on top of Kairui Song's swap table infrastructure, on top of [2]
and in accordance with his proposal in [3]. The proposal's idea
is interesting, so I decided to give it a shot myself. I'm still not
100% sure that this is bug-proof, but hey, it compiles, and has
not crashed in my simple stress testing :)

The prototype here is feature-complete relative to the swap-table P4
baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
shrinker, memcg charging, and THP swapin all work for
both vswap and direct-physical entries — and satisfies all three
requirements above: no backend coupling (zswap/zero entries hold no
physical slot), dynamic swap space (clusters allocated on demand via
xarray, no static provisioning), and efficient backend transfer
(in-place vtable updates, no PTE/rmap walking).

II. Design

With vswap, pages are assigned virtual swap entries on a ghost device
with no backing storage. These entries are backed by zswap, zero pages,
or (lazily) physical swap slots. Physical backing is allocated only
when needed — on zswap writeback or reclaim writeout, after the rmap
step.

Compared to the standalone v6 implementation [1], which introduces a
24-byte per-entry swap descriptor and its own cluster allocator, this
edition uses swap_table infrastructure, and share a lot of the allocator
logic. Per-slot metadata is stored in a tag-encoded virtual_table
(atomic_long_t, 8 bytes per slot), and physical clusters store
Pointer-tagged rmap entries in the swap_table for reverse lookup back to
the virtual cluster.

Here are some data layout diagrams:

  Case 1: vswap entry (virtualized)

  PTE                  swap_cluster_info_dynamic
  vswap_entry          +-------------------------+
  (swp_entry_t) ------>| swap_cluster_info (ci)  |
                       | +--------------------+  |
                       | | swap_table         |  |
                       | |   PFN / Shadow     |  |
                       | | memcg_table        |  |
                       | | count,flags,order  |  |
                       | | lock, list         |  |
                       | +--------------------+  |
                       |                         |
                       | virtual_table           |
                       | +--------------------+  |
                       | | NONE               |  |
                       | | PHYS               |  |
                       | | ZERO               |  |
                       | | ZSWAP(entry*)      |  |
                       | | FOLIO(folio*)      |  |
                       | +--------------------+  |
                       +-------------------------+
                              |
                              | PHYS resolves to
                              v
                       PHYSICAL CLUSTER (swap_cluster_info)
                       +--------------------------+
                       | swap_table per-slot:     |
                       |   NULL   - free          |
                       |   PFN    - cached folio  |
                       |   Shadow - swapped out   |
                       |   Pointer- vswap rmap    |
                       |   Bad    - unusable      |
                       |                          |
                       | Vswap-backing slot:      |
                       |   Pointer(C|swp_entry_t) |
                       |     rmap back to vswap   |
                       +--------------------------+

  Case 2: direct-mapped physical entry (no vswap)

  PTE                  PHYSICAL CLUSTER (swap_cluster_info)
  phys_entry           +--------------------------+
  (swp_entry_t) ------>| swap_table per-slot:     |
                       |   NULL   - free          |
                       |   PFN    - cached folio  |
                       |   Shadow - swapped out   |
                       |   Bad    - unusable      |
                       +--------------------------+

struct swap_cluster_info_dynamic {
    struct swap_cluster_info ci;       /* swap_table, lock, etc. */
    unsigned int index;                /* position in xarray */
    struct rcu_head rcu;               /* kfree_rcu deferred free */
    atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
};

Each vswap cluster (swap_cluster_info_dynamic) extends the classic
swap_cluster_info struct with a virtual_table array that stores the
backend information for each virtual swap entry in the cluster. Each
entry is tag-encoded in the low 3 bits to indicate backend types:

  NONE:   |----- 0000 ------|000|  free / unbacked
  PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
  ZERO:   |----- 0000 ------|010|  zero-filled page
  ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
  FOLIO:  |--- folio* ------|100|  in-memory folio

We still have room for 3 more future backend types, for e.g. CRAM, i.e
compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
case scenario, we can add more fields to this extended struct.

Other design points:
- Both vswap entries (Case 1) and directly-mapped physical entries
  (Case 2) coexist as first-class citizens. All the common swap
  code paths — swapout, swapin, swap freeing, swapoff, zswap
  writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
  the vswap branches compile out and behavior should be identical to
  today's swap-table P4 (at least that is my intention).
- Pointer-tagged swap_table on physical clusters for rmap (physical
  -> virtual) lookup.
- Virtual swap slots not backed by physical swap are not charged to
  memcg swap counters — only physical backing is charged (I made the
  case for this in [7]).
- Careful separation of vswap and physical swap allocation paths and
  structures adds a lot of complexity, but is crucial to make sure
  both paths are efficient and do not conflict with each other (for
  correctness and performance). I do re-use a lot of the allocation
  logic wherever possible though.

  An example of this is the per-cpu cluster caching. I have found that
  caching virtual and physical clusters in the same structure is a
  recipe for bugs and performance regressions :) For instance, zswap
  shrinker will invalidate the cached virtual cluster, and cache its
  physical cluster instead, which will be reverted by the next vswap
  allocation.

And a lot more of these random tidbits off the top of my head. See the
patches for a proof-of-concept implementation.

III. Follow-ups:

In no particular order (and most of which can be done as follow-up
patch series rather than shoving everything in the initial landing):

- More thorough stress testing is very much needed.

- Performance benchmarks to make sure I don't accidentally regress
  the vswap-less case, and that the vswap's case performance is
  good. I suspect I will have to port a lot of the
  optimizations I implemented in v6 over here - some of the
  inefficiencies are inherent in any swap virtualization, and
  would require the same fix (for e.g the MRU cluster caching
  for faster cluster lookup - see [8] and [9]).

- Runtime enable/disable of the vswap device. To be honest, I don't
  know if there is a value in this. My preference is vswap can be
  optimized to the point that any overhead is negligible. Failing that,
  maybe we can come up with some simple heuristics that automatically
  decides for users?

  In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
  boot, and CONFIG_VSWAP=n means the vswap device is never created. This
  *might* be enough just on its own.

  Is a runtime knob (sysfs or sysctl) worth the complexity beyond
  these heuristics? I'm not sure yet. Maintaining both cases
  at runtime also has overhead for checking as well, and some of the
  checks are not cheap :)

  Besides, what does swapon/swapoff buy us here? We do not want
  multiple vswap devices - they're identical performance-wise, so we
  will just fragment clusters unnecessarily. We do not care about
  sizing, since the metadata layer is completely dynamic. If we want
  to opt-out of vswap at runtime per-cgroup, maybe swap.tier by
  Youngjun (see [12]) is a better interface than swapon/swapoff?

- Defer per-cluster memcg_table and zeromap allocation on physical
  clusters. A physical swap cluster backing vswap entries only do
  not really need their memcg_table, but the current design forces
  us to allocate it anyway. This is a waste of memory, and is an
  overhead regression compared to my older design on the zswap-only
  case, which Johannes has pointed out multiple times (see [6]),
  and is one of the biggest reasons why I have not been satisfied
  with this approach thus far. It honestly is a bit of a
  deal-breaker...

  That said, I think I might be able to allocate them on demand, i.e
  only when the first direct-mapped slot is allocated on that cluster.
  That will give us the best of BOTH worlds, for both the vswap and
  directly-mapped physical swap cases. No promises, but I will try
  (if this approach is good enough for all parties).

- Widen swap_info_struct->max to unsigned long. The vswap device's
  max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
  (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
  especially when we're getting increasingly big machines memory-wise.

- Supporting 32-bit architectures. I need to do the math carefully.
  But do we want to optimize for these architectures anyway? I think
  the only argument is if somehow virtual swap is so good that we
  can just get rid of the direct-mapped physical swap case entirely,
  so we need to support 32-bit architectures. I'm willing to have my
  mind changed though.

- Add some fat design doc (assuming this approach is acceptable to
  folks).

- Samefilled page handling is still doable BTW, if folks think this
  has value :)

This is an early RFC — I have only done basic functional testing so
far, and still need to run more thorough stress tests and benchmarks.
That said, I figure I should send this out early to get folks's
feedback, before I get myself too deep in this rabbit hole - the
complexity is already mounting...

[1]: https://lore.kernel.org/all/20260505153854.1612033-1-nphamcs@gmail.com/
[2]: https://lore.kernel.org/all/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com/
[3]: https://lwn.net/Articles/1072657/
[4]: https://lore.kernel.org/all/20260220-swap-table-p4-v1-15-104795d19815@tencent.com/
[5]: https://lore.kernel.org/all/aerrps94j70MkgdW@gourry-fedora-PF4VCD3F/
[6]: https://lore.kernel.org/all/aZyFxKGXc8J6PIij@cmpxchg.org/
[7]: https://lore.kernel.org/linux-mm/CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com/
[8]: https://lore.kernel.org/all/CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com/
[9]: https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
[10]: https://lore.kernel.org/all/aerrps94j70MkgdW@gourry-fedora-PF4VCD3F/
[11]: https://lore.kernel.org/all/afIKxG5mJZE6QgpR@gourry-fedora-PF4VCD3F/
[12]: https://lore.kernel.org/all/20260527062247.3440692-1-youngjun.park@lge.com/

Nhat Pham (5):
  mm, swap: add virtual swap device infrastructure
  mm, swap: support zswap and zeroswap as vswap backends
  mm, swap: support physical swap as a vswap backend
  mm, swap: only charge physical swap entries
  mm, swap: add debugfs counters for vswap

 MAINTAINERS           |    1 +
 include/linux/swap.h  |   71 +++
 include/linux/zswap.h |    3 +
 mm/Kconfig            |   10 +
 mm/internal.h         |   20 +-
 mm/madvise.c          |    2 +-
 mm/memcontrol.c       |  132 ++++-
 mm/memory.c           |   34 +-
 mm/page_io.c          |  195 ++++++--
 mm/swap.h             |   59 ++-
 mm/swap_state.c       |   51 +-
 mm/swap_table.h       |   56 +++
 mm/swapfile.c         | 1096 +++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c           |    5 +-
 mm/vswap.h            |  445 +++++++++++++++++
 mm/zswap.c            |  167 +++++--
 16 files changed, 2108 insertions(+), 239 deletions(-)
 create mode 100644 mm/vswap.h

base-commit: 401c55d4eacd97ffd24a89829655baa43b2b308e
-- 
2.53.0-Meta

next             reply	other threads:[~2026-05-28 21:29 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28 21:29 Nhat Pham [this message]
2026-05-28 21:29 ` [RFC PATCH 1/5] mm, swap: add virtual swap device infrastructure Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 2/5] mm, swap: support zswap and zeroswap as vswap backends Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 3/5] mm, swap: support physical swap as a vswap backend Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 4/5] mm, swap: only charge physical swap entries Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 5/5] mm, swap: add debugfs counters for vswap Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260528212955.1912856-1-nphamcs@gmail.com \
    --to=nphamcs@gmail.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=lenb@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=pavel@kernel.org \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox