Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosry@kernel.org>
To: Nhat Pham <nphamcs@gmail.com>
Cc: kasong@tencent.com, Liam.Howlett@oracle.com,
	akpm@linux-foundation.org,  apopple@nvidia.com,
	axelrasmussen@google.com, baohua@kernel.org,
	 baolin.wang@linux.alibaba.com, bhe@redhat.com, byungchul@sk.com,
	cgroups@vger.kernel.org,  chengming.zhou@linux.dev,
	chrisl@kernel.org, corbet@lwn.net, david@kernel.org,
	 dev.jain@arm.com, gourry@gourry.net, hannes@cmpxchg.org,
	hughd@google.com,  jannh@google.com, joshua.hahnjy@gmail.com,
	lance.yang@linux.dev, lenb@kernel.org,
	 linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,  linux-pm@vger.kernel.org,
	lorenzo.stoakes@oracle.com, matthew.brost@intel.com,
	 mhocko@suse.com, muchun.song@linux.dev, npache@redhat.com,
	pavel@kernel.org,  peterx@redhat.com, peterz@infradead.org,
	pfalcato@suse.de, rafael@kernel.org,  rakie.kim@sk.com,
	roman.gushchin@linux.dev, rppt@kernel.org, ryan.roberts@arm.com,
	 shakeel.butt@linux.dev, shikemeng@huaweicloud.com,
	surenb@google.com, tglx@kernel.org,  vbabka@suse.cz,
	weixugc@google.com, ying.huang@linux.alibaba.com,
	 yosry.ahmed@linux.dev, yuanchu@google.com,
	zhengqi.arch@bytedance.com, ziy@nvidia.com,
	 kernel-team@meta.com, riel@surriel.com, haowenchao22@gmail.com
Subject: Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)
Date: Wed, 3 Jun 2026 01:29:38 +0000	[thread overview]
Message-ID: <ah-A2gQ0GPgerXop@google.com> (raw)
In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com>

> II. Design
> 
> With vswap, pages are assigned virtual swap entries on a ghost device
> with no backing storage. These entries are backed by zswap, zero pages,
> or (lazily) physical swap slots. Physical backing is allocated only
> when needed — on zswap writeback or reclaim writeout, after the rmap
> step.
> 
> Compared to the standalone v6 implementation [1], which introduces a
> 24-byte per-entry swap descriptor and its own cluster allocator, this
> edition uses swap_table infrastructure, and share a lot of the allocator
> logic. Per-slot metadata is stored in a tag-encoded virtual_table
> (atomic_long_t, 8 bytes per slot), and physical clusters store
> Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> the virtual cluster.
> 
> Here are some data layout diagrams:
> 
>   Case 1: vswap entry (virtualized)
> 
>   PTE                  swap_cluster_info_dynamic
>   vswap_entry          +-------------------------+
>   (swp_entry_t) ------>| swap_cluster_info (ci)  |
>                        | +--------------------+  |
>                        | | swap_table         |  |
>                        | |   PFN / Shadow     |  |
>                        | | memcg_table        |  |
>                        | | count,flags,order  |  |
>                        | | lock, list         |  |
>                        | +--------------------+  |
>                        |                         |
>                        | virtual_table           |
>                        | +--------------------+  |
>                        | | NONE               |  |
>                        | | PHYS               |  |
>                        | | ZERO               |  |
>                        | | ZSWAP(entry*)      |  |
>                        | | FOLIO(folio*)      |  |
>                        | +--------------------+  |
>                        +-------------------------+
>                               |
>                               | PHYS resolves to
>                               v
>                        PHYSICAL CLUSTER (swap_cluster_info)
>                        +--------------------------+
>                        | swap_table per-slot:     |
>                        |   NULL   - free          |
>                        |   PFN    - cached folio  |
>                        |   Shadow - swapped out   |
>                        |   Pointer- vswap rmap    |
>                        |   Bad    - unusable      |
>                        |                          |
>                        | Vswap-backing slot:      |
>                        |   Pointer(C|swp_entry_t) |
>                        |     rmap back to vswap   |
>                        +--------------------------+
> 
>   Case 2: direct-mapped physical entry (no vswap)
> 
>   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
>   phys_entry           +--------------------------+
>   (swp_entry_t) ------>| swap_table per-slot:     |
>                        |   NULL   - free          |
>                        |   PFN    - cached folio  |
>                        |   Shadow - swapped out   |
>                        |   Bad    - unusable      |
>                        +--------------------------+
> 
> struct swap_cluster_info_dynamic {
>     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
>     unsigned int index;                /* position in xarray */
>     struct rcu_head rcu;               /* kfree_rcu deferred free */
>     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> };
> 
> Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> swap_cluster_info struct with a virtual_table array that stores the
> backend information for each virtual swap entry in the cluster. Each
> entry is tag-encoded in the low 3 bits to indicate backend types:
> 
>   NONE:   |----- 0000 ------|000|  free / unbacked
>   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
>   ZERO:   |----- 0000 ------|010|  zero-filled page
>   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
>   FOLIO:  |--- folio* ------|100|  in-memory folio
> 
> We still have room for 3 more future backend types, for e.g. CRAM, i.e
> compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
> case scenario, we can add more fields to this extended struct.
> 
> Other design points:
> - Both vswap entries (Case 1) and directly-mapped physical entries
>   (Case 2) coexist as first-class citizens. All the common swap
>   code paths — swapout, swapin, swap freeing, swapoff, zswap
>   writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
>   the vswap branches compile out and behavior should be identical to
>   today's swap-table P4 (at least that is my intention).
> - Pointer-tagged swap_table on physical clusters for rmap (physical
>   -> virtual) lookup.
> - Virtual swap slots not backed by physical swap are not charged to
>   memcg swap counters — only physical backing is charged (I made the
>   case for this in [7]).
> - Careful separation of vswap and physical swap allocation paths and
>   structures adds a lot of complexity, but is crucial to make sure
>   both paths are efficient and do not conflict with each other (for
>   correctness and performance). I do re-use a lot of the allocation
>   logic wherever possible though.

Thanks for working on this! I mostly looked at the high-level design and
the zswap parts, as the swap code has changed a lot since I was familiar
with it :)

It seems like the direction being taken here is that we have one
(massive) vswap swap device, and we keep normal physical swap devices
around as well.

A vswap entry can point at a physical swap entry, or zswap, or zeromap.
If a vswap entry points at a physical swap entry, then the physical swap
entry points back at the vswap entry (a reverse mapping).

I assume the main reason here is to avoid the extra overhead if
everything uses vswap, which would mainly be the reverse mapping
overhead? I guess there's also some simplicity that comes from reusing
the swap info infra as a whole, including the swap table.

I don't like that the code bifurcates for vswap vs. normal swap entries
though. Not sure if this is an issue that can be fixed with proper
abstractions to hide it, or if the design needs modifications. I was
honestly really hoping we don't end up with this. I was hoping that the
physical swap device no longer uses a full swap table and all, and
everything goes through vswap.

I hoping that if redirection isn't needed (e.g. zswap is disabled),
vswap can directly encode the physical swap slot so that the reverse
mapping isn't needed -- so we avoid the overhead without keeping the
physical swap device using a fully-fledged swap table.

All that being said, perhaps I am too out of touch with the code to
realize it's simply not possible.

Honestly, if the main reason we can't have a single swap table for vswap
is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
argument, even if we can't optimize the reverse mapping away. But maybe
I am also out of touch with RAM prices :)

I at least hope that, the current design is not painting us into a
corner (e.g. through userspace interfaces), and we can still achieve a
vswap-for-all implementation in the future (maybe that's what you have
in mind already?).

Aside from the swap code, the only sticking point for me is the logic
bifurcation in zswap. Why does zswap need to handle vswap vs. not vswap?
I thought the point of the design is to use vswap when zswap is used,
and otherwise use a normal swap table. In a way, one of the goals is to
make zswap a first class swap citizen, but it doesn't seem like we are
achieving that?


  parent reply	other threads:[~2026-06-03  1:29 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28 21:29 [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition) Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 1/5] mm, swap: add virtual swap device infrastructure Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 2/5] mm, swap: support zswap and zeroswap as vswap backends Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 3/5] mm, swap: support physical swap as a vswap backend Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 4/5] mm, swap: only charge physical swap entries Nhat Pham
2026-05-28 21:29 ` [RFC PATCH 5/5] mm, swap: add debugfs counters for vswap Nhat Pham
2026-06-01  7:34 ` [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition) Kairui Song
2026-06-01 15:56   ` Nhat Pham
2026-06-01 16:22     ` Nhat Pham
2026-06-01 17:49       ` Kairui Song
2026-06-02 15:54         ` Nhat Pham
2026-06-02 16:43           ` Kairui Song
2026-06-01 17:44     ` Kairui Song
2026-06-01 18:06       ` Nhat Pham
2026-06-02  3:24         ` Kairui Song
2026-06-02 15:28           ` Nhat Pham
2026-06-03  1:29 ` Yosry Ahmed [this message]
2026-06-03 17:12   ` Nhat Pham
2026-06-03 17:22     ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ah-A2gQ0GPgerXop@google.com \
    --to=yosry@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=axelrasmussen@google.com \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhe@redhat.com \
    --cc=byungchul@sk.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chengming.zhou@linux.dev \
    --cc=chrisl@kernel.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=dev.jain@arm.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=haowenchao22@gmail.com \
    --cc=hughd@google.com \
    --cc=jannh@google.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=kasong@tencent.com \
    --cc=kernel-team@meta.com \
    --cc=lance.yang@linux.dev \
    --cc=lenb@kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=npache@redhat.com \
    --cc=nphamcs@gmail.com \
    --cc=pavel@kernel.org \
    --cc=peterx@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pfalcato@suse.de \
    --cc=rafael@kernel.org \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=tglx@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=weixugc@google.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=yosry.ahmed@linux.dev \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox