From: Nhat Pham <nphamcs@gmail.com>
To: akpm@linux-foundation.org
Cc: chrisl@kernel.org, kasong@tencent.com, hannes@cmpxchg.org,
mhocko@kernel.org, roman.gushchin@linux.dev,
shakeel.butt@linux.dev, yosry@kernel.org, david@kernel.org,
muchun.song@linux.dev, shikemeng@huaweicloud.com,
baoquan.he@linux.dev, baohua@kernel.org, youngjun.park@lge.com,
chengming.zhou@linux.dev, ljs@kernel.org, liam@infradead.org,
vbabka@kernel.org, rppt@kernel.org, surenb@google.com,
qi.zheng@linux.dev, axelrasmussen@google.com, yuanchu@google.com,
weixugc@google.com, riel@surriel.com, gourry@gourry.net,
haowenchao22@gmail.com, kernel-team@meta.com, nphamcs@gmail.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
cgroups@vger.kernel.org
Subject: [RFC PATCH v2 1/7] mm, swap: add virtual swap device infrastructure
Date: Fri, 12 Jun 2026 12:37:32 -0700 [thread overview]
Message-ID: <20260612193738.2183968-2-nphamcs@gmail.com> (raw)
In-Reply-To: <20260612193738.2183968-1-nphamcs@gmail.com>
Create a massive virtual swap device at boot, along with the
dynamic cluster infrastructure that the rest of the vswap layer
is built on:
* swap_cluster_info_dynamic: per-cluster dynamic info kept in
an xarray, allowing arbitrary-size devices without the static
cluster_info[] array.
* virtual_table: a per-slot side table for vswap backend metadata
(tag-encoded in low bits). The field itself is added in the
next patch; this commit only introduces the dynamic cluster
container that will hold it.
* The size of the vswap device is ALIGN_DOWN(UINT_MAX,
SWAPFILE_CLUSTER) pages.
Gated by a new CONFIG_VSWAP (depends on SWAP && 64BIT). For now,
the vswap device cannot be swapon'd or swapoff'd - it is created
unconditionally at boot when CONFIG_VSWAP=y and lives for the
lifetime of the kernel. The SWP_VSWAP flag and swap_is_vswap()
helper let hot paths skip per-device bookkeeping that doesn't
apply (avail-list management, percpu_ref get/put, hibernation
target lookup, etc.).
This patch is pure scaffolding. It wires the dynamic-cluster
allocator into cluster_alloc_swap_entry (via an SWP_VSWAP branch
that dispatches to alloc_swap_scan_dynamic), but the branch is
not yet reachable because vswap_si is kept off swap_avail_head
and swap_active_head and folio_alloc_swap has no path that calls
into vswap_si directly. Backends (zswap, zero, physical disk)
and the vswap-aware swap-out / swap-in / writeback paths arrive
in subsequent patches.
Suggested-by: Kairui Song <kasong@tencent.com>
Co-developed-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
MAINTAINERS | 1 +
include/linux/swap.h | 4 +
mm/Kconfig | 10 ++
mm/page_io.c | 18 ++-
mm/swap.h | 46 ++++++--
mm/swap_state.c | 43 ++++---
mm/swap_table.h | 2 +
mm/swapfile.c | 260 +++++++++++++++++++++++++++++++++++++++----
mm/vswap.h | 43 +++++++
mm/zswap.c | 10 +-
10 files changed, 387 insertions(+), 50 deletions(-)
create mode 100644 mm/vswap.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 65bd4328fe05..92dbb159459c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17061,6 +17061,7 @@ F: mm/swap.h
F: mm/swap_table.h
F: mm/swap_state.c
F: mm/swapfile.c
+F: mm/vswap.h
MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
M: Andrew Morton <akpm@linux-foundation.org>
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8f0f68e245ba..822b1c90db1c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -214,6 +214,7 @@ enum {
SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */
SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */
SWP_HIBERNATION = (1 << 13), /* pinned for hibernation */
+ SWP_VSWAP = (1 << 14), /* virtual swap device */
/* add others here before... */
};
@@ -282,6 +283,7 @@ struct swap_info_struct {
struct work_struct reclaim_work; /* reclaim worker */
struct list_head discard_clusters; /* discard clusters list */
struct plist_node avail_list; /* entry in swap_avail_head */
+ struct xarray cluster_info_pool; /* Xarray for vswap dynamic cluster info */
};
static inline swp_entry_t page_swap_entry(struct page *page)
@@ -471,6 +473,8 @@ void swap_free_hibernation_slot(swp_entry_t entry);
static inline void put_swap_device(struct swap_info_struct *si)
{
+ if (si->flags & SWP_VSWAP)
+ return;
percpu_ref_put(&si->users);
}
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..ede1c639d226 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,6 +19,16 @@ menuconfig SWAP
used to provide more virtual memory than the actual RAM present
in your computer. If unsure say Y.
+config VSWAP
+ bool "Virtual swap device"
+ depends on SWAP && 64BIT
+ help
+ Adds a virtual swap layer that decouples swap entries in page
+ tables from physical backing storage. Swap entries are allocated
+ from a virtual swap device and can be backed by zswap, a physical
+ swapfile, or kept in memory - with the backing changeable at
+ runtime without invalidating page table entries.
+
config ZSWAP
bool "Compressed cache for swap pages"
depends on SWAP
diff --git a/mm/page_io.c b/mm/page_io.c
index f2d8fe7fd057..8126be6e4cfb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -295,8 +295,7 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
}
rcu_read_unlock();
- __swap_writepage(folio, swap_plug);
- return 0;
+ return __swap_writepage(folio, swap_plug);
out_unlock:
folio_unlock(folio);
return ret;
@@ -458,11 +457,18 @@ static void swap_writepage_bdev_async(struct folio *folio,
submit_bio(bio);
}
-void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
+int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
{
struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+ if (sis->flags & SWP_VSWAP) {
+ /* Prevent the page from getting reclaimed. */
+ folio_set_dirty(folio);
+ return AOP_WRITEPAGE_ACTIVATE;
+ }
+
/*
* ->flags can be updated non-atomically,
* but that will never affect SWP_FS_OPS, so the data_race
@@ -479,6 +485,7 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
swap_writepage_bdev_sync(folio, sis);
else
swap_writepage_bdev_async(folio, sis);
+ return 0;
}
void swap_write_unplug(struct swap_iocb *sio)
@@ -684,6 +691,11 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
if (zswap_load(folio) != -ENOENT)
goto finish;
+ if (unlikely(sis->flags & SWP_VSWAP)) {
+ folio_unlock(folio);
+ goto finish;
+ }
+
/* We have to read from slower devices. Increase zswap protection. */
zswap_folio_swapin(folio);
diff --git a/mm/swap.h b/mm/swap.h
index 77d2d14eda42..97493551edbd 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -65,6 +65,13 @@ struct swap_cluster_info {
struct list_head list;
};
+struct swap_cluster_info_dynamic {
+ struct swap_cluster_info ci; /* Underlying cluster info */
+ unsigned int index; /* for cluster_index() */
+ struct rcu_head rcu; /* For kfree_rcu deferred free */
+ /* Backend pointers (virtual_table) added in a later patch. */
+};
+
/* All on-list cluster must have a non-zero flag. */
enum swap_cluster_flags {
CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
@@ -75,6 +82,7 @@ enum swap_cluster_flags {
CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
CLUSTER_FLAG_FULL,
CLUSTER_FLAG_DISCARD,
+ CLUSTER_FLAG_DEAD, /* Vswap dynamic cluster pending kfree_rcu */
CLUSTER_FLAG_MAX,
};
@@ -108,9 +116,19 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
static inline struct swap_cluster_info *__swap_offset_to_cluster(
struct swap_info_struct *si, pgoff_t offset)
{
+ unsigned int cluster_idx = offset / SWAPFILE_CLUSTER;
+
VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
- return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+
+ if (si->flags & SWP_VSWAP) {
+ struct swap_cluster_info_dynamic *ci_dyn;
+
+ ci_dyn = xa_load(&si->cluster_info_pool, cluster_idx);
+ return ci_dyn ? &ci_dyn->ci : NULL;
+ }
+
+ return &si->cluster_info[cluster_idx];
}
static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
@@ -122,7 +140,7 @@ static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entr
static __always_inline struct swap_cluster_info *__swap_cluster_lock(
struct swap_info_struct *si, unsigned long offset, bool irq)
{
- struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
+ struct swap_cluster_info *ci;
/*
* Nothing modifies swap cache in an IRQ context. All access to
@@ -135,10 +153,24 @@ static __always_inline struct swap_cluster_info *__swap_cluster_lock(
*/
VM_WARN_ON_ONCE(!in_task());
VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
- if (irq)
- spin_lock_irq(&ci->lock);
- else
- spin_lock(&ci->lock);
+
+ rcu_read_lock();
+ ci = __swap_offset_to_cluster(si, offset);
+ if (ci) {
+ if (irq)
+ spin_lock_irq(&ci->lock);
+ else
+ spin_lock(&ci->lock);
+
+ if (ci->flags == CLUSTER_FLAG_DEAD) {
+ if (irq)
+ spin_unlock_irq(&ci->lock);
+ else
+ spin_unlock(&ci->lock);
+ ci = NULL;
+ }
+ }
+ rcu_read_unlock();
return ci;
}
@@ -250,7 +282,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
}
void swap_write_unplug(struct swap_iocb *sio);
int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
-void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
+int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
/* linux/mm/swap_state.c */
extern struct address_space swap_space __read_mostly;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 9c3a5cf99778..341ca8826507 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -90,8 +90,10 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
struct folio *folio;
for (;;) {
+ rcu_read_lock();
swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
swp_cluster_offset(entry));
+ rcu_read_unlock();
if (!swp_tb_is_folio(swp_tb))
return NULL;
folio = swp_tb_to_folio(swp_tb);
@@ -113,8 +115,10 @@ bool swap_cache_has_folio(swp_entry_t entry)
{
unsigned long swp_tb;
+ rcu_read_lock();
swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
swp_cluster_offset(entry));
+ rcu_read_unlock();
return swp_tb_is_folio(swp_tb);
}
@@ -130,8 +134,10 @@ void *swap_cache_get_shadow(swp_entry_t entry)
{
unsigned long swp_tb;
+ rcu_read_lock();
swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
swp_cluster_offset(entry));
+ rcu_read_unlock();
if (swp_tb_is_shadow(swp_tb))
return swp_tb_to_shadow(swp_tb);
return NULL;
@@ -400,14 +406,16 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
* -ENOENT / -EEXIST: Target swap entry is unavailable or cached, the caller
* should abort or try to use the cached folio instead
*/
-static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
- swp_entry_t targ_entry, gfp_t gfp,
+static struct folio *__swap_cache_alloc(swp_entry_t targ_entry, gfp_t gfp,
unsigned int order, struct vm_fault *vmf,
struct mempolicy *mpol, pgoff_t ilx)
{
int err;
swp_entry_t entry;
struct folio *folio;
+ struct swap_cluster_info *ci;
+ struct swap_info_struct *si = __swap_entry_to_info(targ_entry);
+ unsigned long offset = swp_offset(targ_entry);
void *shadow = NULL;
unsigned short memcg_id;
unsigned long address, nr_pages = 1UL << order;
@@ -417,9 +425,12 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
entry.val = round_down(targ_entry.val, nr_pages);
/* Check if the slot and range are available, skip allocation if not */
- spin_lock(&ci->lock);
- err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL);
- spin_unlock(&ci->lock);
+ err = -ENOENT;
+ ci = swap_cluster_lock(si, offset);
+ if (ci) {
+ err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL);
+ swap_cluster_unlock(ci);
+ }
if (unlikely(err))
return ERR_PTR(err);
@@ -440,10 +451,13 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
return ERR_PTR(-ENOMEM);
/* Double check the range is still not in conflict */
- spin_lock(&ci->lock);
- err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_id);
+ err = -ENOENT;
+ ci = swap_cluster_lock(si, offset);
+ if (ci)
+ err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_id);
if (unlikely(err)) {
- spin_unlock(&ci->lock);
+ if (ci)
+ swap_cluster_unlock(ci);
folio_put(folio);
return ERR_PTR(err);
}
@@ -451,13 +465,14 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
__swap_cache_do_add_folio(ci, folio, entry);
- spin_unlock(&ci->lock);
+ swap_cluster_unlock(ci);
if (mem_cgroup_swapin_charge_folio(folio, memcg_id,
vmf ? vmf->vma->vm_mm : NULL, gfp)) {
- spin_lock(&ci->lock);
+ /* The folio pins the cluster */
+ ci = swap_cluster_lock(si, offset);
__swap_cache_do_del_folio(ci, folio, entry, shadow);
- spin_unlock(&ci->lock);
+ swap_cluster_unlock(ci);
folio_unlock(folio);
/* nr_pages refs from swap cache, 1 from allocation */
folio_put_refs(folio, nr_pages + 1);
@@ -511,9 +526,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
{
int order, err;
struct folio *ret;
- struct swap_cluster_info *ci;
- ci = __swap_entry_to_cluster(targ_entry);
order = highest_order(orders);
/* orders must be non-zero, and must not exceed cluster size. */
@@ -521,12 +534,12 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
return ERR_PTR(-EINVAL);
do {
- ret = __swap_cache_alloc(ci, targ_entry, gfp, order,
+ ret = __swap_cache_alloc(targ_entry, gfp, order,
vmf, mpol, ilx);
if (!IS_ERR(ret))
break;
err = PTR_ERR(ret);
- if (!order || (err && err != -EBUSY && err != -ENOMEM))
+ if (err && err != -EBUSY && err != -ENOMEM)
break;
count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
order = next_order(&orders, order);
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..fd7f0fb9836a 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -255,6 +255,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
unsigned long swp_tb;
VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+ if (!ci)
+ return SWP_TB_NULL;
rcu_read_lock();
table = rcu_dereference(ci->table);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 78b49b0658ad..352c5fb2ab75 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,10 +42,12 @@
#include <linux/suspend.h>
#include <linux/zswap.h>
#include <linux/plist.h>
+#include <linux/major.h>
#include <asm/tlbflush.h>
#include <linux/leafops.h>
#include "swap_table.h"
+#include "vswap.h"
#include "internal.h"
#include "swap.h"
@@ -401,6 +403,8 @@ static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
static inline unsigned int cluster_index(struct swap_info_struct *si,
struct swap_cluster_info *ci)
{
+ if (si->flags & SWP_VSWAP)
+ return container_of(ci, struct swap_cluster_info_dynamic, ci)->index;
return ci - si->cluster_info;
}
@@ -733,6 +737,22 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
return;
}
+ if (si->flags & SWP_VSWAP) {
+ struct swap_cluster_info_dynamic *ci_dyn;
+
+ ci_dyn = container_of(ci, struct swap_cluster_info_dynamic, ci);
+ if (ci->flags != CLUSTER_FLAG_NONE) {
+ spin_lock(&si->lock);
+ list_del(&ci->list);
+ spin_unlock(&si->lock);
+ }
+ swap_cluster_free_table(ci);
+ xa_erase(&si->cluster_info_pool, ci_dyn->index);
+ ci->flags = CLUSTER_FLAG_DEAD;
+ kfree_rcu(ci_dyn, rcu);
+ return;
+ }
+
__free_cluster(si, ci);
}
@@ -835,14 +855,21 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
* stolen by a lower order). @usable will be set to false if that happens.
*/
static bool cluster_reclaim_range(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
+ struct swap_cluster_info **pcip,
unsigned long start, unsigned int order,
bool *usable)
{
+ struct swap_cluster_info *ci = *pcip;
unsigned int nr_pages = 1 << order;
unsigned long offset = start, end = start + nr_pages;
unsigned long swp_tb;
+ /*
+ * Take RCU read lock before releasing the cluster lock to keep ci
+ * alive - for vswap dynamic clusters, ci is freed via kfree_rcu
+ * and the grace period could otherwise elapse in the window.
+ */
+ rcu_read_lock();
spin_unlock(&ci->lock);
do {
swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
@@ -852,7 +879,15 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0)
break;
} while (++offset < end);
- spin_lock(&ci->lock);
+ rcu_read_unlock();
+
+ /* Re-lookup: dynamic cluster may have been freed while lock was dropped */
+ ci = swap_cluster_lock(si, start);
+ *pcip = ci;
+ if (!ci) {
+ *usable = false;
+ return false;
+ }
/*
* We just dropped ci->lock so cluster could be used by another
@@ -983,7 +1018,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
continue;
if (need_reclaim) {
- ret = cluster_reclaim_range(si, ci, offset, order, &usable);
+ ret = cluster_reclaim_range(si, &ci, offset, order,
+ &usable);
if (!usable)
goto out;
if (cluster_is_empty(ci))
@@ -1001,8 +1037,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
break;
}
out:
- relocate_cluster(si, ci);
- swap_cluster_unlock(ci);
+ if (ci) {
+ relocate_cluster(si, ci);
+ swap_cluster_unlock(ci);
+ }
if (si->flags & SWP_SOLIDSTATE) {
this_cpu_write(percpu_swap_cluster.offset[order], next);
this_cpu_write(percpu_swap_cluster.si[order], si);
@@ -1034,6 +1072,41 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
return found;
}
+static unsigned int alloc_swap_scan_dynamic(struct swap_info_struct *si,
+ struct folio *folio)
+{
+ struct swap_cluster_info_dynamic *ci_dyn;
+ struct swap_cluster_info *ci;
+ unsigned long offset;
+
+ VM_WARN_ON(!(si->flags & SWP_VSWAP));
+
+ ci_dyn = kzalloc(sizeof(*ci_dyn), GFP_ATOMIC);
+ if (!ci_dyn)
+ return SWAP_ENTRY_INVALID;
+
+ spin_lock_init(&ci_dyn->ci.lock);
+ INIT_LIST_HEAD(&ci_dyn->ci.list);
+
+ if (swap_cluster_alloc_table(&ci_dyn->ci, GFP_ATOMIC)) {
+ kfree(ci_dyn);
+ return SWAP_ENTRY_INVALID;
+ }
+
+ if (xa_alloc(&si->cluster_info_pool, &ci_dyn->index, ci_dyn,
+ XA_LIMIT(1, DIV_ROUND_UP(si->max, SWAPFILE_CLUSTER) - 1),
+ GFP_ATOMIC)) {
+ swap_cluster_free_table(&ci_dyn->ci);
+ kfree(ci_dyn);
+ return SWAP_ENTRY_INVALID;
+ }
+
+ ci = &ci_dyn->ci;
+ spin_lock(&ci->lock);
+ offset = cluster_offset(si, ci);
+ return alloc_swap_scan_cluster(si, ci, folio, offset);
+}
+
static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
{
long to_scan = 1;
@@ -1056,7 +1129,9 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
spin_unlock(&ci->lock);
nr_reclaim = __try_to_reclaim_swap(si, offset,
TTRS_ANYWAY);
- spin_lock(&ci->lock);
+ ci = swap_cluster_lock(si, offset);
+ if (!ci)
+ goto next;
if (nr_reclaim) {
offset += abs(nr_reclaim);
continue;
@@ -1070,6 +1145,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
relocate_cluster(si, ci);
swap_cluster_unlock(ci);
+next:
if (to_scan <= 0)
break;
cond_resched();
@@ -1140,6 +1216,12 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
goto done;
}
+ if (si->flags & SWP_VSWAP) {
+ found = alloc_swap_scan_dynamic(si, folio);
+ if (found)
+ goto done;
+ }
+
if (!(si->flags & SWP_PAGE_DISCARD)) {
found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
if (found)
@@ -1258,6 +1340,13 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
goto skip;
}
+ /*
+ * Keep vswap off the avail list - it is not allocated from by
+ * the physical swap allocator (swap_alloc_fast/slow).
+ */
+ if (swap_is_vswap(si))
+ goto skip;
+
plist_add(&si->avail_list, &swap_avail_head);
skip:
@@ -1340,6 +1429,10 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
static bool get_swap_device_info(struct swap_info_struct *si)
{
+ /* vswap device is always alive - no ref counting needed */
+ if (swap_is_vswap(si))
+ return true;
+
if (!percpu_ref_tryget_live(&si->users))
return false;
/*
@@ -1375,11 +1468,11 @@ static bool swap_alloc_fast(struct folio *folio)
return false;
ci = swap_cluster_lock(si, offset);
- if (cluster_is_usable(ci, order)) {
+ if (ci && cluster_is_usable(ci, order)) {
if (cluster_is_empty(ci))
offset = cluster_offset(si, ci);
alloc_swap_scan_cluster(si, ci, folio, offset);
- } else {
+ } else if (ci) {
swap_cluster_unlock(ci);
}
@@ -1501,6 +1594,7 @@ int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
if (!si)
return 0;
+ /* Entry is in use (being faulted in), so its cluster is alive. */
ci = __swap_offset_to_cluster(si, offset);
ret = swap_extend_table_alloc(si, ci, swp_cluster_offset(entry), gfp);
@@ -1736,6 +1830,7 @@ int folio_alloc_swap(struct folio *folio)
unsigned int order = folio_order(folio);
unsigned int size = 1 << order;
+ VM_WARN_ON_FOLIO(folio_test_swapcache(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1898,7 +1993,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
return NULL;
put_out:
pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
- percpu_ref_put(&si->users);
+ if (!swap_is_vswap(si))
+ percpu_ref_put(&si->users);
return NULL;
}
@@ -2030,6 +2126,7 @@ static bool folio_maybe_swapped(struct folio *folio)
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+ /* Folio is locked and in swap cache, so ci->count > 0: cluster is alive. */
ci = __swap_entry_to_cluster(entry);
ci_off = swp_cluster_offset(entry);
ci_end = ci_off + folio_nr_pages(folio);
@@ -2217,6 +2314,9 @@ static int __find_hibernation_swap_type(dev_t device, sector_t offset)
if (!(sis->flags & SWP_WRITEOK))
continue;
+ /* vswap has no bdev - never a hibernation target */
+ if (swap_is_vswap(sis))
+ continue;
if (device == sis->bdev->bd_dev) {
struct swap_extent *se = first_se(sis);
@@ -2343,6 +2443,9 @@ int find_first_swap(dev_t *device)
if (!(sis->flags & SWP_WRITEOK))
continue;
+ /* vswap has no bdev - never a hibernation target */
+ if (swap_is_vswap(sis))
+ continue;
*device = sis->bdev->bd_dev;
spin_unlock(&swap_lock);
return type;
@@ -2554,8 +2657,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
&vmf);
}
if (!folio) {
+ rcu_read_lock();
swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
swp_cluster_offset(entry));
+ rcu_read_unlock();
if (swp_tb_get_count(swp_tb) <= 0)
continue;
return -ENOMEM;
@@ -2701,8 +2806,10 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
* allocations from this area (while holding swap_lock).
*/
for (i = prev + 1; i < si->max; i++) {
+ rcu_read_lock();
swp_tb = swap_table_get(__swap_offset_to_cluster(si, i),
i % SWAPFILE_CLUSTER);
+ rcu_read_unlock();
if (!swp_tb_is_null(swp_tb) && !swp_tb_is_bad(swp_tb))
break;
if ((i % LATENCY_LIMIT) == 0)
@@ -2941,6 +3048,11 @@ static int setup_swap_extents(struct swap_info_struct *sis,
struct inode *inode = mapping->host;
int ret;
+ if (sis->flags & SWP_VSWAP) {
+ *span = 0;
+ return 0;
+ }
+
if (S_ISBLK(inode->i_mode)) {
ret = add_swap_extent(sis, 0, sis->max, 0);
*span = sis->pages;
@@ -2965,15 +3077,22 @@ static int setup_swap_extents(struct swap_info_struct *sis,
static void _enable_swap_info(struct swap_info_struct *si)
{
- atomic_long_add(si->pages, &nr_swap_pages);
- total_swap_pages += si->pages;
+ if (!swap_is_vswap(si)) {
+ atomic_long_add(si->pages, &nr_swap_pages);
+ total_swap_pages += si->pages;
+ }
assert_spin_locked(&swap_lock);
- plist_add(&si->list, &swap_active_head);
-
- /* Add back to available list */
- add_to_avail_list(si, true);
+ /*
+ * Vswap has no backing file and no swapoff support - keep it
+ * off swap_active_head (used by swapoff filename lookup and
+ * swap_sync_discard) and swap_avail_head (physical allocator).
+ */
+ if (!swap_is_vswap(si)) {
+ plist_add(&si->list, &swap_active_head);
+ add_to_avail_list(si, true);
+ }
}
/*
@@ -3010,6 +3129,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
struct swap_cluster_info *ci;
BUG_ON(si->flags & SWP_WRITEOK);
+ if (si->flags & SWP_VSWAP)
+ return;
for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
ci = swap_cluster_lock(si, offset);
@@ -3148,7 +3269,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
destroy_swap_extents(p, p->swap_file);
- if (!(p->flags & SWP_SOLIDSTATE))
+ if (!(p->flags & SWP_VSWAP) &&
+ !(p->flags & SWP_SOLIDSTATE))
atomic_dec(&nr_rotate_swap);
mutex_lock(&swapon_mutex);
@@ -3258,6 +3380,19 @@ static void swap_stop(struct seq_file *swap, void *v)
mutex_unlock(&swapon_mutex);
}
+static const char *swap_type_str(struct swap_info_struct *si)
+{
+ struct file *file = si->swap_file;
+
+ if (si->flags & SWP_VSWAP)
+ return "vswap\t";
+
+ if (S_ISBLK(file_inode(file)->i_mode))
+ return "partition";
+
+ return "file\t";
+}
+
static int swap_show(struct seq_file *swap, void *v)
{
struct swap_info_struct *si = v;
@@ -3277,8 +3412,7 @@ static int swap_show(struct seq_file *swap, void *v)
len = seq_file_path(swap, file, " \t\n\\");
seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\n",
len < 40 ? 40 - len : 1, " ",
- S_ISBLK(file_inode(file)->i_mode) ?
- "partition" : "file\t",
+ swap_type_str(si),
bytes, bytes < 10000000 ? "\t" : "",
inuse, inuse < 10000000 ? "\t" : "",
si->prio);
@@ -3410,7 +3544,6 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode)
return 0;
}
-
/*
* Find out how many pages are allowed for a single swap device. There
* are two limiting factors:
@@ -3516,10 +3649,43 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
unsigned long maxpages)
{
unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
- struct swap_cluster_info *cluster_info;
+ struct swap_cluster_info *cluster_info = NULL;
+ struct swap_cluster_info_dynamic *ci_dyn;
int err = -ENOMEM;
unsigned long i;
+ /* For SWP_VSWAP files, initialize Xarray pool instead of static array */
+ if (si->flags & SWP_VSWAP) {
+ /*
+ * Pre-allocate cluster 0 and mark slot 0 (header page)
+ * as bad so the allocator never hands out page offset 0.
+ */
+ ci_dyn = kzalloc(sizeof(*ci_dyn), GFP_KERNEL);
+ if (!ci_dyn)
+ goto err;
+ spin_lock_init(&ci_dyn->ci.lock);
+ INIT_LIST_HEAD(&ci_dyn->ci.list);
+
+ nr_clusters = 0;
+ xa_init_flags(&si->cluster_info_pool, XA_FLAGS_ALLOC);
+ err = xa_insert(&si->cluster_info_pool, 0, ci_dyn, GFP_KERNEL);
+ if (err) {
+ kfree(ci_dyn);
+ goto err;
+ }
+
+ err = swap_cluster_setup_bad_slot(si, &ci_dyn->ci, 0, false);
+ if (err) {
+ xa_erase(&si->cluster_info_pool, 0);
+ swap_cluster_free_table(&ci_dyn->ci);
+ kfree(ci_dyn);
+ xa_destroy(&si->cluster_info_pool);
+ goto err;
+ }
+
+ goto setup_cluster_info;
+ }
+
cluster_info = kvzalloc_objs(*cluster_info, nr_clusters);
if (!cluster_info)
goto err;
@@ -3544,6 +3710,10 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
err = swap_cluster_setup_bad_slot(si, cluster_info, 0, false);
if (err)
goto err;
+
+ if (!swap_header)
+ goto setup_cluster_info;
+
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
@@ -3563,6 +3733,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
goto err;
}
+setup_cluster_info:
INIT_LIST_HEAD(&si->free_clusters);
INIT_LIST_HEAD(&si->full_clusters);
INIT_LIST_HEAD(&si->discard_clusters);
@@ -3599,7 +3770,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
struct dentry *dentry;
int prio;
int error;
- union swap_header *swap_header;
+ union swap_header *swap_header = NULL;
int nr_extents;
sector_t span;
unsigned long maxpages;
@@ -3673,7 +3844,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
goto bad_swap_unlock_inode;
}
swap_header = kmap_local_folio(folio, 0);
-
maxpages = read_swap_header(si, swap_header, inode);
if (unlikely(!maxpages)) {
error = -EINVAL;
@@ -3708,7 +3878,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
if (si->bdev && !bdev_rot(si->bdev)) {
si->flags |= SWP_SOLIDSTATE;
- } else {
+ } else if (!(si->flags & SWP_SOLIDSTATE)) {
atomic_inc(&nr_rotate_swap);
inced_nr_rotate_swap = true;
}
@@ -3930,3 +4100,47 @@ static int __init swapfile_init(void)
return 0;
}
subsys_initcall(swapfile_init);
+
+#ifdef CONFIG_VSWAP
+struct swap_info_struct *vswap_si;
+
+static int __init vswap_init(void)
+{
+ struct swap_info_struct *si;
+ unsigned long maxpages;
+ int err;
+
+ si = alloc_swap_info();
+ if (IS_ERR(si))
+ return PTR_ERR(si);
+
+ maxpages = min(swapfile_maximum_size,
+ ALIGN_DOWN((unsigned long)UINT_MAX, SWAPFILE_CLUSTER));
+ si->flags |= SWP_VSWAP | SWP_SOLIDSTATE | SWP_WRITEOK;
+ si->bdev = NULL;
+ si->max = maxpages;
+ si->pages = maxpages - 1;
+ si->prio = SHRT_MAX;
+ si->list.prio = -si->prio;
+ si->avail_list.prio = -si->prio;
+
+ err = setup_swap_clusters_info(si, NULL, maxpages);
+ if (err)
+ goto fail;
+
+ mutex_lock(&swapon_mutex);
+ enable_swap_info(si);
+ mutex_unlock(&swapon_mutex);
+
+ vswap_si = si;
+ pr_info("vswap: created virtual swap device (%lu pages)\n", maxpages);
+ return 0;
+
+fail:
+ spin_lock(&swap_lock);
+ si->flags = 0;
+ spin_unlock(&swap_lock);
+ return err;
+}
+late_initcall(vswap_init);
+#endif
diff --git a/mm/vswap.h b/mm/vswap.h
new file mode 100644
index 000000000000..a1fd7f7e568f
--- /dev/null
+++ b/mm/vswap.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Virtual swap space
+ *
+ * Copyright (C) 2026 Nhat Pham
+ */
+#ifndef _MM_VSWAP_H
+#define _MM_VSWAP_H
+
+#include <linux/swap.h>
+
+#ifdef CONFIG_VSWAP
+
+extern struct swap_info_struct *vswap_si;
+
+static inline bool swap_is_vswap(struct swap_info_struct *si)
+{
+ return si->flags & SWP_VSWAP;
+}
+
+#else
+
+static inline bool swap_is_vswap(struct swap_info_struct *si)
+{
+ return false;
+}
+
+#endif /* CONFIG_VSWAP */
+
+#ifdef CONFIG_SWAP
+#include "swap.h"
+static inline bool is_vswap_entry(swp_entry_t entry)
+{
+ return swap_is_vswap(__swap_entry_to_info(entry));
+}
+#else
+static inline bool is_vswap_entry(swp_entry_t entry)
+{
+ return false;
+}
+#endif /* CONFIG_SWAP */
+
+#endif /* _MM_VSWAP_H */
diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e0a3..993406074d58 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -994,11 +994,16 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
struct swap_info_struct *si;
int ret = 0;
- /* try to allocate swap cache folio */
si = get_swap_device(swpentry);
if (!si)
return -EEXIST;
+ if (si->flags & SWP_VSWAP) {
+ put_swap_device(si);
+ return -EINVAL;
+ }
+
+ /* try to allocate swap cache folio */
mpol = get_task_policy(current);
folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
NO_INTERLEAVE_INDEX);
@@ -1049,7 +1054,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
folio_set_reclaim(folio);
/* start writeback */
- __swap_writepage(folio, NULL);
+ ret = __swap_writepage(folio, NULL);
+ WARN_ON_ONCE(ret);
out:
if (ret) {
--
2.53.0-Meta
next prev parent reply other threads:[~2026-06-12 19:37 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-12 19:37 [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition) Nhat Pham
2026-06-12 19:37 ` Nhat Pham [this message]
2026-06-12 19:37 ` [RFC PATCH v2 2/7] mm, swap: support zswap and zeroswap as vswap backends Nhat Pham
2026-06-12 19:37 ` [RFC PATCH v2 3/7] mm, swap: support physical swap as a vswap backend Nhat Pham
2026-06-12 19:37 ` [RFC PATCH v2 4/7] mm, swap: only charge physical swap entries Nhat Pham
2026-06-12 19:37 ` [RFC PATCH v2 5/7] mm, swap: add debugfs counters for vswap Nhat Pham
2026-06-12 19:37 ` [RFC PATCH v2 6/7] mm, swap: defer memcg_table allocation on physical clusters Nhat Pham
2026-06-12 19:37 ` [RFC PATCH v2 7/7] mm, swap: widen swap_info_struct max/pages to unsigned long Nhat Pham
2026-06-14 8:20 ` [RFC PATCH v2 0/7] mm, swap: Virtual Swap Space (Swap Table Edition) YoungJun Park
2026-06-15 2:38 ` Nhat Pham
2026-06-15 19:56 ` Yosry Ahmed
2026-06-16 1:29 ` YoungJun Park
2026-06-16 12:15 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260612193738.2183968-2-nphamcs@gmail.com \
--to=nphamcs@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=axelrasmussen@google.com \
--cc=baohua@kernel.org \
--cc=baoquan.he@linux.dev \
--cc=cgroups@vger.kernel.org \
--cc=chengming.zhou@linux.dev \
--cc=chrisl@kernel.org \
--cc=david@kernel.org \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=haowenchao22@gmail.com \
--cc=kasong@tencent.com \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=qi.zheng@linux.dev \
--cc=riel@surriel.com \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=shakeel.butt@linux.dev \
--cc=shikemeng@huaweicloud.com \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=weixugc@google.com \
--cc=yosry@kernel.org \
--cc=youngjun.park@lge.com \
--cc=yuanchu@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox