[RFC PATCH v2 00/18] Virtual Swap Space

cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 00/18] Virtual Swap Space
@ 2025-04-29 23:38 Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 01/18] swap: rearrange the swap header file Nhat Pham
                   ` (19 more replies)
  0 siblings, 20 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Changelog:
* v2:
	* Use a single atomic type (swap_refs) for reference counting
	  purpose. This brings the size of the swap descriptor from 64 KB
	  down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
	* Zeromap bitmap is removed in the virtual swap implementation.
	  This saves one bit per phyiscal swapfile slot.
	* Rearrange the patches and the code change to make things more
	  reviewable. Suggested by Johannes Weiner.
	* Update the cover letter a bit.

This RFC implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).

I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
  we have swapfile in the order of tens to hundreds of GBs, which are
  mostly unused and only exist to enable zswap usage and zero-filled
  pages swap optimizations. This also implicitly limits the memory
  saving potentials of these swap optimizations by the static size of
  the swapfile, which is especially problematic in high memory systems
  that can have up to TBs worth of memory.
* Operationally, the old design poses significant challenges, because
  the sysadmin has to prescribe how much swap is needed a priori, for
  each combination of (memory size x disk space x workload usage). It
  is even more complicated when we take into account the variance of
  memory compression, which changes the reclaim dynamics (and as a
  result, swap space requirement). The problem is further exarcebated
  for users who rely on swap utilization (and exhaustion) as an OOM
  signal.

Another motivation for a swap redesign is to simplify swapoff, which
is both complicated and expensive in the current design. Tight coupling
between a swap entry and its backing storage means that it requires a
whole page table walk to update all the page table entries that refer to
this swap entry, as well as updating all the associated swap data
structures (swap cache, etc.).

II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
	union {
		swp_slot_t slot;
		struct folio *folio;
		struct zswap_entry *zswap_entry;
	};
	struct rcu_head rcu;

	rwlock_t lock;
	enum swap_type type;

#ifdef CONFIG_MEMCG
	atomic_t memcgid;
#endif

	atomic_t swap_refs;
};

The size of the swap descriptor (without debug config options) is 48
bytes.

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

Please see the attached patches for implementation details.

Note that I do not remove the old implementation for now. Users can
select between the old and the new implementation via the
CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
new design, and iteratively optimize upon it (without having to include
everything in an even more massive patch series).

III. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate swapin handle.
* Swapping a folio out with discontiguous physical swap slots
  (see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
  physical swap space for pages when they enter the zswap pool, giving
  the kernel no flexibility at writeback time. With the virtual swap
  implementation, the backends are decoupled, and physical swap space
  is allocated on-demand at writeback time, at which point we can make
  much smarter decisions: we can batch multiple zswap writeback
  operations into a single IO request, allocating contiguous physical
  swap slots for that request. We can even perform compressed writeback
  (i.e writing these pages without decompressing them) (see [12]).

IV. Potential Issues

Here is a couple of issues I can think of, along with some potential
solutions:

1. Space overhead: we need one swap descriptor per swap entry.
* Note that this overhead is dynamic, i.e only incurred when we actually
  need to swap a page out.
* The swap descriptor replaces many other swap data structures:
  swap_cgroup arrays, zeromap, etc.
* It can be further offset by swap_map reduction: we only need 3 states
  for each entry in the swap_map (unallocated, allocated, bad). The two
  last states are potentially mergeable, reducing the swap_map to a
  bitmap.

2. Lock contention: since the virtual swap space is dynamic/unbounded,
we cannot naively range partition it anymore. This can increase lock
contention on swap-related data structures (swap cache, zswap’s xarray,
etc.).
* The problem is slightly alleviated by the lockless nature of the new
  reference counting scheme, as well as the per-entry locking for
  backing store information.
* Johannes suggested that I can implement a dynamic partition scheme, in
  which new partitions (along with associated data structures) are
  allocated on demand. It is one extra layer of indirection, but global
  locking will only be done only on partition allocation, rather than on
  each access. All other accesses only take local (per-partition)
  locks, or are completely lockless (such as partition lookup).

  This idea is very similar to Kairui's work to optimize the (physical)
  swap allocator. He is currently also working on a swap redesign (see
  [11]) - perhaps we can combine the two efforts to take advantage of
  the swap allocator's efficiency for virtual swap.

V. Benchmarking

As a proof of concept, I run the prototype through some simple
benchmarks:

1. usemem: 16 threads, 2G each, memory.max = 16G

I benchmarked the following usemem commands:

time usemem --init-time -w -O -s 10 -n 16 2g

Baseline:
real: 33.96s
user: 25.31s
sys: 341.09s
average throughput: 111295.45 KB/s
average free time: 2079258.68 usecs

New Design:
real: 35.87s
user: 25.15s
sys: 373.01s
average throughput: 106965.46 KB/s
average free time: 3192465.62 usecs

To root cause this regression, I ran perf on the usemem program, as
well as on the following stress-ng program:

perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000

and observed the (predicted) increase in lock contention on swap cache
accesses. This regression is alleviated if I put together the
following hack: limit the virtual swap space to a sufficient size for
the benchmark, range partition the swap-related data structures (swap
cache, zswap tree, etc.) based on the limit, and distribute the
allocation of virtual swap slotss among these partitions (on a per-CPU
basis):

real: 34.94s
user: 25.28s
sys: 360.25s
average throughput: 108181.15 KB/s
average free time: 2680890.24 usecs

As mentioned above, I will implement proper dynamic virtual swap space
partitioning in a follow-up work, or adopt Kairui's solution.

2. Kernel building: zswap enabled, 52 workers (one per processor),
memory.max = 3G.

Baseline:
real: 183.55s
user: 5119.01s
sys: 655.16s

New Design:
real: mean: 184.5s
user: mean: 5117.4s
sys: mean: 695.23s

New Design (Static Partition)
real: 183.95s
user: 5119.29s
sys: 664.24s

3. Swapoff: 32 GB swapfile, 50% full, with a process that mmap-ed a
128GB file.

Baseline:
real: 25.54s
user: 0.00s
sys: 11.48s

New Design:
real: 11.69s
user: 0.00s
sys: 9.96s

The new design reduces the kernel CPU time by about 13%. There is also
reduction in real time, but this is mostly due to more asynchronous IO
(rather than the design itself) :)

VI. TODO list

This RFC includes a feature-complete prototype on top of 6.14. Here are
some action items:

Short-term: needs to be done before merging
* More clean-ups and stress-testing.
* Add more documentation of the new design and its API.

Medium-term: optimizations required to make virtual swap implementation
the default:
* Shrinking the swap map.
* Range partition the virtual swap space.
* More benchmarking and experiments in a variety of use cases.

Long-term: removal of the old implementation and other non-blocking
opportunities
* Remove the old implementation, when there are no major regressions and
  bottlenecks, etc remained with the new design.
* Merge more existing swap data structures into this layer (the MTE
  swap xarray for e.g).
* Adding new use cases :)

VII. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/ 
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/

Nhat Pham (18):
  swap: rearrange the swap header file
  swapfile: rearrange functions
  swapfile: rearrange freeing steps
  mm: swap: add an abstract API for locking out swapoff
  mm: swap: add a separate type for physical swap slots
  mm: create scaffolds for the new virtual swap implementation
  mm: swap: zswap: swap cache and zswap support for virtualized swap
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifetime at the virtual swap layer
  mm: swap: temporarily disable THP swapin and batched freeing swap
  mm: swap: decouple virtual swap slot from backing store
  zswap: do not start zswap shrinker if there is no physical swap slots
  memcg: swap: only charge physical swap slots
  vswap: support THP swapin and batch free_swap_and_cache
  swap: simplify swapoff using virtual swap
  swapfile: move zeromap setup out of enable_swap_info
  swapfile: remove zeromap in virtual swap implementation

 MAINTAINERS                |    7 +
 include/linux/mm_types.h   |    7 +
 include/linux/shmem_fs.h   |    3 +
 include/linux/swap.h       |  263 ++++++-
 include/linux/swap_slots.h |    2 +-
 include/linux/swapops.h    |   37 +
 kernel/power/swap.c        |    6 +-
 mm/Kconfig                 |   25 +
 mm/Makefile                |    3 +
 mm/huge_memory.c           |    5 +-
 mm/internal.h              |   25 +-
 mm/memcontrol.c            |  166 +++--
 mm/memory.c                |  103 ++-
 mm/migrate.c               |    1 +
 mm/page_io.c               |   60 +-
 mm/shmem.c                 |   29 +-
 mm/swap.h                  |   45 +-
 mm/swap_cgroup.c           |   10 +-
 mm/swap_slots.c            |   28 +-
 mm/swap_state.c            |  140 +++-
 mm/swapfile.c              |  831 +++++++++++++--------
 mm/userfaultfd.c           |   11 +-
 mm/vmscan.c                |   26 +-
 mm/vswap.c                 | 1400 ++++++++++++++++++++++++++++++++++++
 mm/zswap.c                 |   80 ++-
 25 files changed, 2799 insertions(+), 514 deletions(-)
 create mode 100644 mm/vswap.c

base-commit: 922ceb9d4bb4dae66c37e24621687e0b4991f5a4
-- 
2.47.1

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 01/18] swap: rearrange the swap header file
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 02/18] swapfile: rearrange functions Nhat Pham
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

In the swap header file (include/linux/swap.h), group the swap API into
the following categories:

1. Lifetime swap functions (i.e the function that changes the reference
   count of the swap entry).

2. Swap cache API.

3. Physical swapfile allocator and swap device API.

Also remove extern in the functions that are rearranged.

This is purely a clean up. No functional change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h | 63 +++++++++++++++++++++++---------------------
 1 file changed, 33 insertions(+), 30 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b13b72645db3..8b8c10356a5c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,24 +453,40 @@ extern void __meminit kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
-		unsigned long nr_pages, sector_t start_block);
-int generic_swapfile_activate(struct swap_info_struct *, struct file *,
-		sector_t *);
-
+/* Lifetime swap API (mm/swapfile.c) */
+swp_entry_t folio_alloc_swap(struct folio *folio);
+bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
+void swap_shmem_alloc(swp_entry_t, int);
+int swap_duplicate(swp_entry_t);
+int swapcache_prepare(swp_entry_t entry, int nr);
+void swap_free_nr(swp_entry_t entry, int nr_pages);
+void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+int __swap_count(swp_entry_t entry);
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
+int swp_swapcount(swp_entry_t entry);
+
+/* Swap cache API (mm/swap_state.c) */
 static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
-
-void free_swap_cache(struct folio *folio);
 void free_page_and_swap_cache(struct page *);
 void free_pages_and_swap_cache(struct encoded_page **, int);
-/* linux/mm/swapfile.c */
+void free_swap_cache(struct folio *folio);
+int init_swap_address_space(unsigned int type, unsigned long nr_pages);
+void exit_swap_address_space(unsigned int type);
+
+/* Physical swap allocator and swap device API (mm/swapfile.c) */
+int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
+		unsigned long nr_pages, sector_t start_block);
+int generic_swapfile_activate(struct swap_info_struct *, struct file *,
+		sector_t *);
+
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
-extern bool has_usable_swap(void);
+bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -483,31 +499,18 @@ static inline long get_nr_swap_pages(void)
 	return atomic_long_read(&nr_swap_pages);
 }
 
-extern void si_swapinfo(struct sysinfo *);
-swp_entry_t folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+void si_swapinfo(struct sysinfo *);
+swp_entry_t get_swap_page_of_type(int);
+int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
+int add_swap_count_continuation(swp_entry_t, gfp_t);
+void swapcache_free_entries(swp_entry_t *entries, int n);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
-extern unsigned int count_swap_pages(int, int);
-extern sector_t swapdev_block(int, pgoff_t);
-extern int __swap_count(swp_entry_t entry);
-extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
-extern int swp_swapcount(swp_entry_t entry);
+unsigned int count_swap_pages(int, int);
+sector_t swapdev_block(int, pgoff_t);
 struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 struct backing_dev_info;
-extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
-extern void exit_swap_address_space(unsigned int type);
-extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
 static inline void put_swap_device(struct swap_info_struct *si)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 02/18] swapfile: rearrange functions
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 01/18] swap: rearrange the swap header file Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 03/18] swapfile: rearrange freeing steps Nhat Pham
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Rearrange some functions in preparation for the rest of the series. No
functional change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 332 +++++++++++++++++++++++++-------------------------
 1 file changed, 166 insertions(+), 166 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index df7c4e8b089c..426674d35983 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -124,11 +124,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
-static inline unsigned char swap_count(unsigned char ent)
-{
-	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
-}
-
 /*
  * Use the second highest bit of inuse_pages counter as the indicator
  * if one swap device is on the available plist, so the atomic can
@@ -161,6 +156,11 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim directly, bypass the slot cache and don't touch device lock */
 #define TTRS_DIRECT		0x8
 
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
+}
+
 static bool swap_is_has_cache(struct swap_info_struct *si,
 			      unsigned long offset, int nr_pages)
 {
@@ -1326,46 +1326,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
-					      unsigned long offset,
-					      unsigned char usage)
-{
-	unsigned char count;
-	unsigned char has_cache;
-
-	count = si->swap_map[offset];
-
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-
-	if (usage == SWAP_HAS_CACHE) {
-		VM_BUG_ON(!has_cache);
-		has_cache = 0;
-	} else if (count == SWAP_MAP_SHMEM) {
-		/*
-		 * Or we could insist on shmem.c using a special
-		 * swap_shmem_free() and free_shmem_swap_and_cache()...
-		 */
-		count = 0;
-	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
-		if (count == COUNT_CONTINUED) {
-			if (swap_count_continued(si, offset, count))
-				count = SWAP_MAP_MAX | COUNT_CONTINUED;
-			else
-				count = SWAP_MAP_MAX;
-		} else
-			count--;
-	}
-
-	usage = count | has_cache;
-	if (usage)
-		WRITE_ONCE(si->swap_map[offset], usage);
-	else
-		WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE);
-
-	return usage;
-}
-
 /*
  * When we get a swap entry, if there aren't some other ways to
  * prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -1432,6 +1392,46 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
+static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
+					      unsigned long offset,
+					      unsigned char usage)
+{
+	unsigned char count;
+	unsigned char has_cache;
+
+	count = si->swap_map[offset];
+
+	has_cache = count & SWAP_HAS_CACHE;
+	count &= ~SWAP_HAS_CACHE;
+
+	if (usage == SWAP_HAS_CACHE) {
+		VM_BUG_ON(!has_cache);
+		has_cache = 0;
+	} else if (count == SWAP_MAP_SHMEM) {
+		/*
+		 * Or we could insist on shmem.c using a special
+		 * swap_shmem_free() and free_shmem_swap_and_cache()...
+		 */
+		count = 0;
+	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
+		if (count == COUNT_CONTINUED) {
+			if (swap_count_continued(si, offset, count))
+				count = SWAP_MAP_MAX | COUNT_CONTINUED;
+			else
+				count = SWAP_MAP_MAX;
+		} else
+			count--;
+	}
+
+	usage = count | has_cache;
+	if (usage)
+		WRITE_ONCE(si->swap_map[offset], usage);
+	else
+		WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE);
+
+	return usage;
+}
+
 static unsigned char __swap_entry_free(struct swap_info_struct *si,
 				       swp_entry_t entry)
 {
@@ -1585,25 +1585,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster(ci);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
-{
-	int i;
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si = NULL;
-
-	if (n <= 0)
-		return;
-
-	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
-		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
-			unlock_cluster(ci);
-		}
-	}
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = swp_swap_info(entry);
@@ -1717,57 +1698,6 @@ static bool folio_swapped(struct folio *folio)
 	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
 }
 
-static bool folio_swapcache_freeable(struct folio *folio)
-{
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-
-	if (!folio_test_swapcache(folio))
-		return false;
-	if (folio_test_writeback(folio))
-		return false;
-
-	/*
-	 * Once hibernation has begun to create its image of memory,
-	 * there's a danger that one of the calls to folio_free_swap()
-	 * - most probably a call from __try_to_reclaim_swap() while
-	 * hibernation is allocating its own swap pages for the image,
-	 * but conceivably even a call from memory reclaim - will free
-	 * the swap from a folio which has already been recorded in the
-	 * image as a clean swapcache folio, and then reuse its swap for
-	 * another page of the image.  On waking from hibernation, the
-	 * original folio might be freed under memory pressure, then
-	 * later read back in from swap, now with the wrong data.
-	 *
-	 * Hibernation suspends storage while it is writing the image
-	 * to disk so check that here.
-	 */
-	if (pm_suspended_storage())
-		return false;
-
-	return true;
-}
-
-/**
- * folio_free_swap() - Free the swap space used for this folio.
- * @folio: The folio to remove.
- *
- * If swap is getting full, or if there are no more mappings of this folio,
- * then call folio_free_swap to free its swap space.
- *
- * Return: true if we were able to release the swap space.
- */
-bool folio_free_swap(struct folio *folio)
-{
-	if (!folio_swapcache_freeable(folio))
-		return false;
-	if (folio_swapped(folio))
-		return false;
-
-	delete_from_swap_cache(folio);
-	folio_set_dirty(folio);
-	return true;
-}
-
 /**
  * free_swap_and_cache_nr() - Release reference on range of swap entries and
  *                            reclaim their cache if no more references remain.
@@ -1842,6 +1772,76 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	put_swap_device(si);
 }
 
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	int i;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si = NULL;
+
+	if (n <= 0)
+		return;
+
+	for (i = 0; i < n; ++i) {
+		si = _swap_info_get(entries[i]);
+		if (si) {
+			ci = lock_cluster(si, swp_offset(entries[i]));
+			swap_entry_range_free(si, ci, entries[i], 1);
+			unlock_cluster(ci);
+		}
+	}
+}
+
+static bool folio_swapcache_freeable(struct folio *folio)
+{
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+	if (!folio_test_swapcache(folio))
+		return false;
+	if (folio_test_writeback(folio))
+		return false;
+
+	/*
+	 * Once hibernation has begun to create its image of memory,
+	 * there's a danger that one of the calls to folio_free_swap()
+	 * - most probably a call from __try_to_reclaim_swap() while
+	 * hibernation is allocating its own swap pages for the image,
+	 * but conceivably even a call from memory reclaim - will free
+	 * the swap from a folio which has already been recorded in the
+	 * image as a clean swapcache folio, and then reuse its swap for
+	 * another page of the image.  On waking from hibernation, the
+	 * original folio might be freed under memory pressure, then
+	 * later read back in from swap, now with the wrong data.
+	 *
+	 * Hibernation suspends storage while it is writing the image
+	 * to disk so check that here.
+	 */
+	if (pm_suspended_storage())
+		return false;
+
+	return true;
+}
+
+/**
+ * folio_free_swap() - Free the swap space used for this folio.
+ * @folio: The folio to remove.
+ *
+ * If swap is getting full, or if there are no more mappings of this folio,
+ * then call folio_free_swap to free its swap space.
+ *
+ * Return: true if we were able to release the swap space.
+ */
+bool folio_free_swap(struct folio *folio)
+{
+	if (!folio_swapcache_freeable(folio))
+		return false;
+	if (folio_swapped(folio))
+		return false;
+
+	delete_from_swap_cache(folio);
+	folio_set_dirty(folio);
+	return true;
+}
+
 #ifdef CONFIG_HIBERNATION
 
 swp_entry_t get_swap_page_of_type(int type)
@@ -1957,6 +1957,37 @@ unsigned int count_swap_pages(int type, int free)
 }
 #endif /* CONFIG_HIBERNATION */
 
+/*
+ * Scan swap_map from current position to next entry still in use.
+ * Return 0 if there are no inuse entries after prev till end of
+ * the map.
+ */
+static unsigned int find_next_to_unuse(struct swap_info_struct *si,
+					unsigned int prev)
+{
+	unsigned int i;
+	unsigned char count;
+
+	/*
+	 * No need for swap_lock here: we're just looking
+	 * for whether an entry is in use, not modifying it; false
+	 * hits are okay, and sys_swapoff() has already prevented new
+	 * allocations from this area (while holding swap_lock).
+	 */
+	for (i = prev + 1; i < si->max; i++) {
+		count = READ_ONCE(si->swap_map[i]);
+		if (count && swap_count(count) != SWAP_MAP_BAD)
+			break;
+		if ((i % LATENCY_LIMIT) == 0)
+			cond_resched();
+	}
+
+	if (i == si->max)
+		i = 0;
+
+	return i;
+}
+
 static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
 {
 	return pte_same(pte_swp_clear_flags(pte), swp_pte);
@@ -2241,37 +2272,6 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 	return ret;
 }
 
-/*
- * Scan swap_map from current position to next entry still in use.
- * Return 0 if there are no inuse entries after prev till end of
- * the map.
- */
-static unsigned int find_next_to_unuse(struct swap_info_struct *si,
-					unsigned int prev)
-{
-	unsigned int i;
-	unsigned char count;
-
-	/*
-	 * No need for swap_lock here: we're just looking
-	 * for whether an entry is in use, not modifying it; false
-	 * hits are okay, and sys_swapoff() has already prevented new
-	 * allocations from this area (while holding swap_lock).
-	 */
-	for (i = prev + 1; i < si->max; i++) {
-		count = READ_ONCE(si->swap_map[i]);
-		if (count && swap_count(count) != SWAP_MAP_BAD)
-			break;
-		if ((i % LATENCY_LIMIT) == 0)
-			cond_resched();
-	}
-
-	if (i == si->max)
-		i = 0;
-
-	return i;
-}
-
 static int try_to_unuse(unsigned int type)
 {
 	struct mm_struct *prev_mm;
@@ -3525,6 +3525,26 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
+struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+{
+	return swap_type_to_swap_info(swp_type(entry));
+}
+
+/*
+ * out-of-line methods to avoid include hell.
+ */
+struct address_space *swapcache_mapping(struct folio *folio)
+{
+	return swp_swap_info(folio->swap)->swap_file->f_mapping;
+}
+EXPORT_SYMBOL_GPL(swapcache_mapping);
+
+pgoff_t __folio_swap_cache_index(struct folio *folio)
+{
+	return swap_cache_index(folio->swap);
+}
+EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
+
 /*
  * Verify that nr swap entries are valid and increment their swap map counts.
  *
@@ -3658,26 +3678,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return swap_type_to_swap_info(swp_type(entry));
-}
-
-/*
- * out-of-line methods to avoid include hell.
- */
-struct address_space *swapcache_mapping(struct folio *folio)
-{
-	return swp_swap_info(folio->swap)->swap_file->f_mapping;
-}
-EXPORT_SYMBOL_GPL(swapcache_mapping);
-
-pgoff_t __folio_swap_cache_index(struct folio *folio)
-{
-	return swap_cache_index(folio->swap);
-}
-EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 03/18] swapfile: rearrange freeing steps
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 01/18] swap: rearrange the swap header file Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 02/18] swapfile: rearrange functions Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out swapoff Nhat Pham
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

In the swap free path, certain steps (cgroup uncharging and shadow
clearing) will be handled at the virtual layer eventually. To facilitate
this change, rearrange these functions a bit in their caller. There
should not be any functional change.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 426674d35983..e717d0e7ae6b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1129,6 +1129,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
 
+	clear_shadow_from_swap_cache(si->type, begin, end);
+
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
@@ -1149,7 +1151,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	clear_shadow_from_swap_cache(si->type, begin, end);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1502,6 +1503,8 @@ static void swap_entry_range_free(struct swap_info_struct *si,
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 
+	mem_cgroup_uncharge_swap(entry, nr_pages);
+
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_empty(ci));
@@ -1513,7 +1516,6 @@ static void swap_entry_range_free(struct swap_info_struct *si,
 		*map = 0;
 	} while (++map < map_end);
 
-	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
 
 	if (!ci->count)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out swapoff
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (2 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 03/18] swapfile: rearrange freeing steps Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap slots Nhat Pham
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Currently, we get a reference to the backing swap device in order to
lock out swapoff and ensure its validity. This does not make sense in
the new virtual swap design, especially after the swap backends are
decoupled - a swap entry might not have any backing swap device at all,
and its backend might change at any time during its lifetime.

In preparation for this, abstract away the swapoff locking out behavior
into a generic API.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h | 12 ++++++++++++
 mm/memory.c          | 13 +++++++------
 mm/shmem.c           |  7 +++----
 mm/swap_state.c      | 10 ++++------
 mm/userfaultfd.c     | 11 ++++++-----
 5 files changed, 32 insertions(+), 21 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8b8c10356a5c..23eaf44791d4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -709,5 +709,17 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+static inline bool trylock_swapoff(swp_entry_t entry,
+				struct swap_info_struct **si)
+{
+	return get_swap_device(entry);
+}
+
+static inline void unlock_swapoff(swp_entry_t entry,
+				struct swap_info_struct *si)
+{
+	put_swap_device(si);
+}
+
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc75167..e92914df5ca7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4305,6 +4305,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
+	bool swapoff_locked = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4365,8 +4366,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* Prevent swapoff from happening to us. */
-	si = get_swap_device(entry);
-	if (unlikely(!si))
+	swapoff_locked = trylock_swapoff(entry, &si);
+	if (unlikely(!swapoff_locked))
 		goto out;
 
 	folio = swap_cache_get_folio(entry, vma, vmf->address);
@@ -4713,8 +4714,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4732,8 +4733,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 	return ret;
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..8ef72dcc592e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2262,8 +2262,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (is_poisoned_swp_entry(swap))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	if (!si) {
+	if (!trylock_swapoff(swap, &si)) {
 		if (!shmem_confirm_swap(mapping, index, swap))
 			return -EEXIST;
 		else
@@ -2411,7 +2410,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
-	put_swap_device(si);
+	unlock_swapoff(swap, si);
 
 	*foliop = folio;
 	return 0;
@@ -2428,7 +2427,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_unlock(folio);
 		folio_put(folio);
 	}
-	put_swap_device(si);
+	unlock_swapoff(swap, si);
 
 	return error;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ca42b2be64d9..81f69b2df550 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -419,12 +419,11 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	if (non_swap_entry(swp))
 		return ERR_PTR(-ENOENT);
 	/* Prevent swapoff from happening to us */
-	si = get_swap_device(swp);
-	if (!si)
+	if (!trylock_swapoff(swp, &si))
 		return ERR_PTR(-ENOENT);
 	index = swap_cache_index(swp);
 	folio = filemap_get_folio(swap_address_space(swp), index);
-	put_swap_device(si);
+	unlock_swapoff(swp, si);
 	return folio;
 }
 
@@ -439,8 +438,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	void *shadow = NULL;
 
 	*new_page_allocated = false;
-	si = get_swap_device(entry);
-	if (!si)
+	if (!trylock_swapoff(entry, &si))
 		return NULL;
 
 	for (;;) {
@@ -538,7 +536,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	put_swap_folio(new_folio, entry);
 	folio_unlock(new_folio);
 put_and_return:
-	put_swap_device(si);
+	unlock_swapoff(entry, si);
 	if (!(*new_page_allocated) && new_folio)
 		folio_put(new_folio);
 	return result;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index d06453fa8aba..f40bbfd09fd5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1161,6 +1161,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	struct folio *src_folio = NULL;
 	struct anon_vma *src_anon_vma = NULL;
 	struct mmu_notifier_range range;
+	bool swapoff_locked = false;
 	int err = 0;
 
 	flush_cache_range(src_vma, src_addr, src_addr + PAGE_SIZE);
@@ -1367,8 +1368,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 			goto out;
 		}
 
-		si = get_swap_device(entry);
-		if (unlikely(!si)) {
+		swapoff_locked = trylock_swapoff(entry, &si);
+		if (unlikely(!swapoff_locked)) {
 			err = -EAGAIN;
 			goto out;
 		}
@@ -1399,7 +1400,7 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				pte_unmap(src_pte);
 				pte_unmap(dst_pte);
 				src_pte = dst_pte = NULL;
-				put_swap_device(si);
+				unlock_swapoff(entry, si);
 				si = NULL;
 				/* now we can block and wait */
 				folio_lock(src_folio);
@@ -1425,8 +1426,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 	if (src_pte)
 		pte_unmap(src_pte);
 	mmu_notifier_invalidate_range_end(&range);
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 
 	return err;
 }
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap slots
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (3 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out swapoff Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap implementation Nhat Pham
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

In preparation for swap virtualization, add a new type to represent the
physical swap slots of swapfile. This allows us to separates:

1. The logical view of the swap entry (i.e what is stored in page table
   entries and used to index into the swap cache), represented by the
   old swp_entry_t type.

from:

2. Its physical backing state (i.e the actual backing slot on the swap
   device), represented by the new swp_slot_t type.

The functions that operate at the physical level (i.e on the swp_slot_t
types) are also renamed where appropriate (prefixed with swp_slot_* for
e.g).

Note that we have not made any behavioral change - the mapping between
the two types is the identity mapping. In later patches, we shall
dynamically allocate a virtual swap slot (of type swp_entry_t) for each
swapped out page to store in the page table entry, and associate it with
a backing store. A physical swap slot (i.e a slot on a physical swap
device) is one of the backing options.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/mm_types.h   |   7 ++
 include/linux/swap.h       |  70 +++++++++--
 include/linux/swap_slots.h |   2 +-
 include/linux/swapops.h    |  25 ++++
 kernel/power/swap.c        |   6 +-
 mm/internal.h              |  10 +-
 mm/memory.c                |   7 +-
 mm/page_io.c               |  33 +++--
 mm/shmem.c                 |  21 +++-
 mm/swap.h                  |  17 +--
 mm/swap_cgroup.c           |  10 +-
 mm/swap_slots.c            |  28 ++---
 mm/swap_state.c            |  28 +++--
 mm/swapfile.c              | 243 ++++++++++++++++++++-----------------
 14 files changed, 324 insertions(+), 183 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..7d93bb2c3dae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -283,6 +283,13 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/*
+ * Physical (i.e disk-based) swap slot handle.
+ */
+typedef struct {
+	unsigned long val;
+} swp_slot_t;
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 23eaf44791d4..567fd2ebb0d3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -277,7 +277,7 @@ enum swap_cluster_flags {
  * cluster to which it belongs being marked free. Therefore 0 is safe to use as
  * a sentinel to indicate an entry is not valid.
  */
-#define SWAP_ENTRY_INVALID	0
+#define SWAP_SLOT_INVALID	0
 
 #ifdef CONFIG_THP_SWAP
 #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
@@ -471,12 +471,16 @@ static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
+
 void free_page_and_swap_cache(struct page *);
 void free_pages_and_swap_cache(struct encoded_page **, int);
 void free_swap_cache(struct folio *folio);
 int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 void exit_swap_address_space(unsigned int type);
 
+/* Swap slot cache API (mm/swap_slot.c) */
+swp_slot_t folio_alloc_swap_slot(struct folio *folio);
+
 /* Physical swap allocator and swap device API (mm/swapfile.c) */
 int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
 		unsigned long nr_pages, sector_t start_block);
@@ -500,36 +504,37 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
-swp_entry_t get_swap_page_of_type(int);
-int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
+swp_slot_t swap_slot_alloc_of_type(int);
+int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
 int add_swap_count_continuation(swp_entry_t, gfp_t);
-void swapcache_free_entries(swp_entry_t *entries, int n);
+void swap_slot_cache_free_slots(swp_slot_t *slots, int n);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 unsigned int count_swap_pages(int, int);
 sector_t swapdev_block(int, pgoff_t);
-struct swap_info_struct *swp_swap_info(swp_entry_t entry);
+struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot);
 struct backing_dev_info;
-struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot);
 sector_t swap_folio_sector(struct folio *folio);
 
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
 }
 
 #else /* CONFIG_SWAP */
-static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
 	return NULL;
 }
 
-static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 {
 	return NULL;
 }
 
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
 {
 }
 
@@ -578,7 +583,7 @@ static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 }
 
@@ -609,12 +614,24 @@ static inline bool folio_free_swap(struct folio *folio)
 	return false;
 }
 
+static inline swp_slot_t folio_alloc_swap_slot(struct folio *folio)
+{
+	swp_slot_t slot;
+
+	slot.val = 0;
+	return slot;
+}
+
 static inline int add_swap_extent(struct swap_info_struct *sis,
 				  unsigned long start_page,
 				  unsigned long nr_pages, sector_t start_block)
 {
 	return -EINVAL;
 }
+
+static inline void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
+{
+}
 #endif /* CONFIG_SWAP */
 
 static inline void free_swap_and_cache(swp_entry_t entry)
@@ -709,16 +726,43 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ *                         virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+static inline swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+	return (swp_slot_t) { entry.val };
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ *                         physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+	return (swp_entry_t) { slot.val };
+}
+
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
 {
-	return get_swap_device(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	*si = swap_slot_tryget_swap_info(slot);
+	return *si;
 }
 
 static inline void unlock_swapoff(swp_entry_t entry,
 				struct swap_info_struct *si)
 {
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 }
 
 #endif /* __KERNEL__*/
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 840aec3523b2..1ac926d46389 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -13,7 +13,7 @@
 struct swap_slots_cache {
 	bool		lock_initialized;
 	struct mutex	alloc_lock; /* protects slots, nr, cur */
-	swp_entry_t	*slots;
+	swp_slot_t	*slots;
 	int		nr;
 	int		cur;
 	int		n_ret;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 96f26e29fefe..2a4101c9bba4 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -618,5 +618,30 @@ static inline int non_swap_entry(swp_entry_t entry)
 	return swp_type(entry) >= MAX_SWAPFILES;
 }
 
+/* Physical swap slots operations */
+
+/*
+ * Store a swap device type + offset into a swp_slot_t handle.
+ */
+static inline swp_slot_t swp_slot(unsigned long type, pgoff_t offset)
+{
+	swp_slot_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT) | (offset & SWP_OFFSET_MASK);
+	return ret;
+}
+
+/* Extract the `type' field from a swp_slot_t. */
+static inline unsigned swp_slot_type(swp_slot_t slot)
+{
+	return (slot.val >> SWP_TYPE_SHIFT);
+}
+
+/* Extract the `offset' field from a swp_slot_t. */
+static inline pgoff_t swp_slot_offset(swp_slot_t slot)
+{
+	return slot.val & SWP_OFFSET_MASK;
+}
+
 #endif /* CONFIG_MMU */
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 82b884b67152..32b236a81dbb 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -178,10 +178,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_slot_offset(swap_slot_alloc_of_type(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_slot_free_nr(swp_slot(swap, offset), 1);
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -203,7 +203,7 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
+		swap_slot_free_nr(swp_slot(swap, ext->start),
 			     ext->end - ext->start + 1);
 
 		kfree(ext);
diff --git a/mm/internal.h b/mm/internal.h
index 20b3535935a3..2d63f6537e35 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -275,9 +275,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  */
 static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
-	swp_entry_t entry = pte_to_swp_entry(pte);
-	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
-						   (swp_offset(entry) + delta)));
+	swp_entry_t entry = pte_to_swp_entry(pte), new_entry;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	pte_t new;
+
+	new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+			swp_slot_offset(slot) + delta));
+	new = swp_entry_to_pte(new_entry);
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
diff --git a/mm/memory.c b/mm/memory.c
index e92914df5ca7..c44e845b5320 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4125,8 +4125,9 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
+	pgoff_t offset = swp_slot_offset(slot);
 	int i;
 
 	/*
@@ -4308,6 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	bool swapoff_locked = false;
 	bool exclusive = false;
 	swp_entry_t entry;
+	swp_slot_t slot;
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
@@ -4369,6 +4371,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapoff_locked = trylock_swapoff(entry, &si);
 	if (unlikely(!swapoff_locked))
 		goto out;
+	slot = swp_entry_to_swp_slot(entry);
 
 	folio = swap_cache_get_folio(entry, vma, vmf->address);
 	if (folio)
diff --git a/mm/page_io.c b/mm/page_io.c
index 9b983de351f9..182851c47f43 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,14 +204,17 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		set_bit(swp_offset(entry), sis->zeromap);
+		slot = swp_entry_to_swp_slot(entry);
+		set_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 
 	count_vm_events(SWPOUT_ZERO, nr_pages);
@@ -223,13 +226,16 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		clear_bit(swp_offset(entry), sis->zeromap);
+		slot = swp_entry_to_swp_slot(entry);
+		clear_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 }
 
@@ -358,7 +364,8 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 		 * messages.
 		 */
 		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
-				   ret, swap_dev_pos(page_swap_entry(page)));
+				   ret,
+				   swap_slot_pos(swp_entry_to_swp_slot(page_swap_entry(page))));
 		for (p = 0; p < sio->pages; p++) {
 			page = sio->bvec[p].bv_page;
 			set_page_dirty(page);
@@ -374,10 +381,11 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 
 static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
 	struct swap_iocb *sio = NULL;
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct file *swap_file = sis->swap_file;
-	loff_t pos = swap_dev_pos(folio->swap);
+	loff_t pos = swap_slot_pos(slot);
 
 	count_swpout_vm_event(folio);
 	folio_start_writeback(folio);
@@ -452,7 +460,8 @@ static void swap_writepage_bdev_async(struct folio *folio,
 
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	/*
@@ -543,9 +552,10 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct swap_iocb *sio = NULL;
-	loff_t pos = swap_dev_pos(folio->swap);
+	loff_t pos = swap_slot_pos(slot);
 
 	if (plug)
 		sio = *plug;
@@ -614,7 +624,8 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset = folio_test_workingset(folio);
 	unsigned long pflags;
diff --git a/mm/shmem.c b/mm/shmem.c
index 8ef72dcc592e..f8efa49eb499 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1387,6 +1387,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 	XA_STATE(xas, &mapping->i_pages, start);
 	struct folio *folio;
 	swp_entry_t entry;
+	swp_slot_t slot;
 
 	rcu_read_lock();
 	xas_for_each(&xas, folio, ULONG_MAX) {
@@ -1397,11 +1398,13 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 			continue;
 
 		entry = radix_to_swp_entry(folio);
+		slot = swp_entry_to_swp_slot(entry);
+
 		/*
 		 * swapin error entries can be found in the mapping. But they're
 		 * deliberately ignored here as we've done everything we can do.
 		 */
-		if (swp_type(entry) != type)
+		if (swp_slot_type(slot) != type)
 			continue;
 
 		indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1619,7 +1622,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!swap.val) {
 		if (nr_pages > 1)
 			goto try_split;
-
 		goto redirty;
 	}
 
@@ -2164,6 +2166,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
 	void *alloced_shadow = NULL;
 	int alloced_order = 0, i;
+	swp_slot_t slot = swp_entry_to_swp_slot(swap);
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
@@ -2202,11 +2205,14 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			 */
 			for (i = 0; i < 1 << order; i++) {
 				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp;
+				swp_entry_t tmp_entry;
+				swp_slot_t tmp_slot;
 
-				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+				tmp_slot =
+					swp_slot(swp_slot_type(slot), swp_slot_offset(slot) + i);
+				tmp_entry = swp_slot_to_swp_entry(tmp_slot);
 				__xa_store(&mapping->i_pages, aligned_index + i,
-					   swp_to_radix_entry(tmp), 0);
+					   swp_to_radix_entry(tmp_entry), 0);
 			}
 		}
 
@@ -2253,10 +2259,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
 	swp_entry_t swap;
+	swp_slot_t slot;
 	int error, nr_pages, order, split_order;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
+	slot = swp_entry_to_swp_slot(swap);
 	*foliop = NULL;
 
 	if (is_poisoned_swp_entry(swap))
@@ -2328,7 +2336,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
 
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+			swap = swp_slot_to_swp_entry(swp_slot(
+					swp_slot_type(slot), swp_slot_offset(slot) + offset));
 		}
 
 		/* Here we actually start the io */
diff --git a/mm/swap.h b/mm/swap.h
index ad2f121de970..d5f8effa8015 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -32,12 +32,10 @@ extern struct address_space *swapper_spaces[];
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
-/*
- * Return the swap device position of the swap entry.
- */
-static inline loff_t swap_dev_pos(swp_entry_t entry)
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
 {
-	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
 }
 
 /*
@@ -78,7 +76,9 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->flags;
+	swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
+
+	return swap_slot_swap_info(swp_slot)->flags;
 }
 
 /*
@@ -89,8 +89,9 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		bool *is_zeromap)
 {
-	struct swap_info_struct *sis = swp_swap_info(entry);
-	unsigned long start = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
+	unsigned long start = swp_slot_offset(slot);
 	unsigned long end = start + max_nr;
 	bool first_bit;
 
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 1007c30f12e2..5e4c91d694a0 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -65,11 +65,12 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
 			swp_entry_t ent)
 {
 	unsigned int nr_ents = folio_nr_pages(folio);
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
 	struct swap_cgroup *map;
 	pgoff_t offset, end;
 	unsigned short old;
 
-	offset = swp_offset(ent);
+	offset = swp_slot_offset(slot);
 	end = offset + nr_ents;
 	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
@@ -92,12 +93,12 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
  */
 unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 {
-	pgoff_t offset = swp_offset(ent);
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
+	pgoff_t offset = swp_slot_offset(slot);
 	pgoff_t end = offset + nr_ents;
 	struct swap_cgroup *map;
 	unsigned short old, iter = 0;
 
-	offset = swp_offset(ent);
 	end = offset + nr_ents;
 	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
@@ -120,12 +121,13 @@ unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	struct swap_cgroup_ctrl *ctrl;
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
+	return __swap_cgroup_id_lookup(ctrl->map, swp_slot_offset(slot));
 }
 
 int swap_cgroup_swapon(int type, unsigned long max_pages)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9c7c171df7ba..4ec2de0c2756 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -111,14 +111,14 @@ static bool check_cache_active(void)
 static int alloc_swap_slot_cache(unsigned int cpu)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots;
+	swp_slot_t *slots;
 
 	/*
 	 * Do allocation outside swap_slots_cache_mutex
 	 * as kvzalloc could trigger reclaim and folio_alloc_swap,
 	 * which can lock swap_slots_cache_mutex.
 	 */
-	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
+	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_slot_t),
 			 GFP_KERNEL);
 	if (!slots)
 		return -ENOMEM;
@@ -160,7 +160,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
 	cache = &per_cpu(swp_slots, cpu);
 	if (cache->slots) {
 		mutex_lock(&cache->alloc_lock);
-		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
+		swap_slot_cache_free_slots(cache->slots + cache->cur, cache->nr);
 		cache->cur = 0;
 		cache->nr = 0;
 		if (free_slots && cache->slots) {
@@ -238,22 +238,22 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 
 	cache->cur = 0;
 	if (swap_slot_cache_active)
-		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
+		cache->nr = swap_slot_alloc(SWAP_SLOTS_CACHE_SIZE,
 					   cache->slots, 0);
 
 	return cache->nr;
 }
 
-swp_entry_t folio_alloc_swap(struct folio *folio)
+swp_slot_t folio_alloc_swap_slot(struct folio *folio)
 {
-	swp_entry_t entry;
+	swp_slot_t slot;
 	struct swap_slots_cache *cache;
 
-	entry.val = 0;
+	slot.val = 0;
 
 	if (folio_test_large(folio)) {
 		if (IS_ENABLED(CONFIG_THP_SWAP))
-			get_swap_pages(1, &entry, folio_order(folio));
+			swap_slot_alloc(1, &slot, folio_order(folio));
 		goto out;
 	}
 
@@ -273,7 +273,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (cache->slots) {
 repeat:
 			if (cache->nr) {
-				entry = cache->slots[cache->cur];
+				slot = cache->slots[cache->cur];
 				cache->slots[cache->cur++].val = 0;
 				cache->nr--;
 			} else if (refill_swap_slots_cache(cache)) {
@@ -281,15 +281,11 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 			}
 		}
 		mutex_unlock(&cache->alloc_lock);
-		if (entry.val)
+		if (slot.val)
 			goto out;
 	}
 
-	get_swap_pages(1, &entry, 0);
+	swap_slot_alloc(1, &slot, 0);
 out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val = 0;
-	}
-	return entry;
+	return slot;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 81f69b2df550..cbd1532b6b24 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -167,6 +167,19 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	swp_slot_t slot = folio_alloc_swap_slot(folio);
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+
+	if (entry.val && mem_cgroup_try_charge_swap(folio, entry)) {
+		put_swap_folio(folio, entry);
+		entry.val = 0;
+	}
+
+	return entry;
+}
+
 /**
  * add_to_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -548,8 +561,8 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
  *
- * get/put_swap_device() aren't needed to call this function, because
- * __read_swap_cache_async() call them and swap_read_folio() holds the
+ * swap_slot_(tryget|put)_swap_info() aren't needed to call this function,
+ * because __read_swap_cache_async() call them and swap_read_folio() holds the
  * swap cache folio lock.
  */
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -654,11 +667,12 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 				    struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct folio *folio;
-	unsigned long entry_offset = swp_offset(entry);
-	unsigned long offset = entry_offset;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long slot_offset = swp_slot_offset(slot);
+	unsigned long offset = slot_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
@@ -679,13 +693,13 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		folio = __read_swap_cache_async(
-				swp_entry(swp_type(entry), offset),
+				swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot), offset)),
 				gfp_mask, mpol, ilx, &page_allocated, false);
 		if (!folio)
 			continue;
 		if (page_allocated) {
 			swap_read_folio(folio, &splug);
-			if (offset != entry_offset) {
+			if (offset != slot_offset) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
 			}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e717d0e7ae6b..17cbf14bac72 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,9 +53,9 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entry_range_free(struct swap_info_struct *si,
+static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  swp_entry_t entry, unsigned int nr_pages);
+				  swp_slot_t slot, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
@@ -203,7 +203,8 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
-	swp_entry_t entry = swp_entry(si->type, offset);
+	swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
+	swp_slot_t slot;
 	struct address_space *address_space = swap_address_space(entry);
 	struct swap_cluster_info *ci;
 	struct folio *folio;
@@ -229,7 +230,8 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 
 	/* offset could point to the middle of a large folio */
 	entry = folio->swap;
-	offset = swp_offset(entry);
+	slot = swp_entry_to_swp_slot(entry);
+	offset = swp_slot_offset(slot);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
 			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
@@ -263,7 +265,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_set_dirty(folio);
 
 	ci = lock_cluster(si, offset);
-	swap_entry_range_free(si, ci, entry, nr_pages);
+	swap_slot_range_free(si, ci, slot, nr_pages);
 	unlock_cluster(ci);
 	ret = nr_pages;
 out_unlock:
@@ -344,12 +346,12 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
 
 sector_t swap_folio_sector(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct swap_extent *se;
 	sector_t sector;
-	pgoff_t offset;
+	pgoff_t offset = swp_slot_offset(slot);
 
-	offset = swp_offset(folio->swap);
 	se = offset_to_swap_extent(sis, offset);
 	sector = se->start_block + (offset - se->start_page);
 	return sector << (PAGE_SHIFT - 9);
@@ -387,15 +389,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #ifdef CONFIG_THP_SWAP
 #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
 
-#define swap_entry_order(order)	(order)
+#define swap_slot_order(order)	(order)
 #else
 #define SWAPFILE_CLUSTER	256
 
 /*
- * Define swap_entry_order() as constant to let compiler to optimize
+ * Define swap_slot_order() as constant to let compiler to optimize
  * out some code if !CONFIG_THP_SWAP
  */
-#define swap_entry_order(order)	0
+#define swap_slot_order(order)	0
 #endif
 #define LATENCY_LIMIT		256
 
@@ -779,7 +781,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    unsigned int order,
 					    unsigned char usage)
 {
-	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	unsigned int next = SWAP_SLOT_INVALID, found = SWAP_SLOT_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
@@ -883,7 +885,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * pool (a cluster). This might involve allocating a new cluster for current CPU
  * too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+static unsigned long cluster_alloc_swap_slot(struct swap_info_struct *si, int order,
 					      unsigned char usage)
 {
 	struct swap_cluster_info *ci;
@@ -1137,7 +1139,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
-		zswap_invalidate(swp_entry(si->type, offset + i));
+		zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
 	}
 
 	if (si->flags & SWP_BLKDEV)
@@ -1163,16 +1165,16 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 
 static int cluster_alloc_swap(struct swap_info_struct *si,
 			     unsigned char usage, int nr,
-			     swp_entry_t slots[], int order)
+			     swp_slot_t slots[], int order)
 {
 	int n_ret = 0;
 
 	while (n_ret < nr) {
-		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
+		unsigned long offset = cluster_alloc_swap_slot(si, order, usage);
 
 		if (!offset)
 			break;
-		slots[n_ret++] = swp_entry(si->type, offset);
+		slots[n_ret++] = swp_slot(si->type, offset);
 	}
 
 	return n_ret;
@@ -1180,7 +1182,7 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
-			       swp_entry_t slots[], int order)
+			       swp_slot_t slots[], int order)
 {
 	unsigned int nr_pages = 1 << order;
 
@@ -1232,9 +1234,9 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
+int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 {
-	int order = swap_entry_order(entry_order);
+	int order = swap_slot_order(entry_order);
 	unsigned long size = 1 << order;
 	struct swap_info_struct *si, *next;
 	long avail_pgs;
@@ -1261,8 +1263,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_entries, order);
-			put_swap_device(si);
+					n_goal, swp_slots, order);
+			swap_slot_put_swap_info(si);
 			if (n_ret || size > 1)
 				goto check_out;
 		}
@@ -1293,36 +1295,36 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	return n_ret;
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
 {
 	struct swap_info_struct *si;
 	unsigned long offset;
 
-	if (!entry.val)
+	if (!slot.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (!si)
 		goto bad_nofile;
 	if (data_race(!(si->flags & SWP_USED)))
 		goto bad_device;
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	if (offset >= si->max)
 		goto bad_offset;
-	if (data_race(!si->swap_map[swp_offset(entry)]))
+	if (data_race(!si->swap_map[swp_slot_offset(slot)]))
 		goto bad_free;
 	return si;
 
 bad_free:
-	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Unused_offset, slot.val);
 	goto out;
 bad_offset:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
 	goto out;
 bad_device:
-	pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Unused_file, slot.val);
 	goto out;
 bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
 out:
 	return NULL;
 }
@@ -1332,8 +1334,9 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * prevent swapoff, such as the folio in swap cache is locked, RCU
  * reader side is locked, etc., the swap entry may become invalid
  * because of swapoff.  Then, we need to enclose all swap related
- * functions with get_swap_device() and put_swap_device(), unless the
- * swap functions call get/put_swap_device() by themselves.
+ * functions with swap_slot_tryget_swap_info() and
+ * swap_slot_put_swap_info(), unless the swap functions call
+ * swap_slot_(tryget|put)_swap_info by themselves.
  *
  * RCU reader side lock (including any spinlock) is sufficient to
  * prevent swapoff, because synchronize_rcu() is called in swapoff()
@@ -1342,11 +1345,11 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * Check whether swap entry is valid in the swap device.  If so,
  * return pointer to swap_info_struct, and keep the swap entry valid
  * via preventing the swap device from being swapoff, until
- * put_swap_device() is called.  Otherwise return NULL.
+ * swap_slot_put_swap_info() is called.  Otherwise return NULL.
  *
  * Notice that swapoff or swapoff+swapon can still happen before the
- * percpu_ref_tryget_live() in get_swap_device() or after the
- * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * percpu_ref_tryget_live() in swap_slot_tryget_swap_info() or after the
+ * percpu_ref_put() in swap_slot_put_swap_info() if there isn't any other way
  * to prevent swapoff.  The caller must be prepared for that.  For
  * example, the following situation is possible.
  *
@@ -1366,34 +1369,34 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * changed with the page table locked to check whether the swap device
  * has been swapoff or swapoff+swapon.
  */
-struct swap_info_struct *get_swap_device(swp_entry_t entry)
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 {
 	struct swap_info_struct *si;
 	unsigned long offset;
 
-	if (!entry.val)
+	if (!slot.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (!si)
 		goto bad_nofile;
 	if (!get_swap_device_info(si))
 		goto out;
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	if (offset >= si->max)
 		goto put_out;
 
 	return si;
 bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
 out:
 	return NULL;
 put_out:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
 	percpu_ref_put(&si->users);
 	return NULL;
 }
 
-static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
+static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
 {
@@ -1433,27 +1436,27 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
 	return usage;
 }
 
-static unsigned char __swap_entry_free(struct swap_info_struct *si,
-				       swp_entry_t entry)
+static unsigned char __swap_slot_free(struct swap_info_struct *si,
+				       swp_slot_t slot)
 {
 	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	unsigned char usage;
 
 	ci = lock_cluster(si, offset);
-	usage = __swap_entry_free_locked(si, offset, 1);
+	usage = __swap_slot_free_locked(si, offset, 1);
 	if (!usage)
-		swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+		swap_slot_range_free(si, ci, swp_slot(si->type, offset), 1);
 	unlock_cluster(ci);
 
 	return usage;
 }
 
-static bool __swap_entries_free(struct swap_info_struct *si,
-		swp_entry_t entry, int nr)
+static bool __swap_slots_free(struct swap_info_struct *si,
+		swp_slot_t slot, int nr)
 {
-	unsigned long offset = swp_offset(entry);
-	unsigned int type = swp_type(entry);
+	unsigned long offset = swp_slot_offset(slot);
+	unsigned int type = swp_slot_type(slot);
 	struct swap_cluster_info *ci;
 	bool has_cache = false;
 	unsigned char count;
@@ -1473,7 +1476,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
 	if (!has_cache)
-		swap_entry_range_free(si, ci, entry, nr);
+		swap_slot_range_free(si, ci, slot, nr);
 	unlock_cluster(ci);
 
 	return has_cache;
@@ -1481,7 +1484,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 fallback:
 	for (i = 0; i < nr; i++) {
 		if (data_race(si->swap_map[offset + i])) {
-			count = __swap_entry_free(si, swp_entry(type, offset + i));
+			count = __swap_slot_free(si, swp_slot(type, offset + i));
 			if (count == SWAP_HAS_CACHE)
 				has_cache = true;
 		} else {
@@ -1495,13 +1498,14 @@ static bool __swap_entries_free(struct swap_info_struct *si,
  * Drop the last HAS_CACHE flag of swap entries, caller have to
  * ensure all entries belong to the same cgroup.
  */
-static void swap_entry_range_free(struct swap_info_struct *si,
+static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  swp_entry_t entry, unsigned int nr_pages)
+				  swp_slot_t slot, unsigned int nr_pages)
 {
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 
@@ -1533,23 +1537,19 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 
 	ci = lock_cluster(si, offset);
 	do {
-		if (!__swap_entry_free_locked(si, offset, usage))
-			swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+		if (!__swap_slot_free_locked(si, offset, usage))
+			swap_slot_range_free(si, ci, swp_slot(si->type, offset), 1);
 	} while (++offset < end);
 	unlock_cluster(ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
 {
 	int nr;
 	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 
-	sis = _swap_info_get(entry);
+	sis = _swap_info_get(slot);
 	if (!sis)
 		return;
 
@@ -1561,27 +1561,37 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 	}
 }
 
+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	int size = 1 << swap_entry_order(folio_order(folio));
+	int size = 1 << swap_slot_order(folio_order(folio));
 
-	si = _swap_info_get(entry);
+	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
-		swap_entry_range_free(si, ci, entry, size);
+		swap_slot_range_free(si, ci, slot, size);
 	else {
-		for (int i = 0; i < size; i++, entry.val++) {
-			if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE))
-				swap_entry_range_free(si, ci, entry, 1);
+		for (int i = 0; i < size; i++, slot.val++) {
+			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
+				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
@@ -1589,8 +1599,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
+	pgoff_t offset = swp_slot_offset(slot);
 
 	return swap_count(si->swap_map[offset]);
 }
@@ -1602,7 +1613,8 @@ int __swap_count(swp_entry_t entry)
  */
 int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	pgoff_t offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	int count;
 
@@ -1618,6 +1630,7 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
  */
 int swp_swapcount(swp_entry_t entry)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	int count, tmp_count, n;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
@@ -1625,11 +1638,11 @@ int swp_swapcount(swp_entry_t entry)
 	pgoff_t offset;
 	unsigned char *map;
 
-	si = _swap_info_get(entry);
+	si = _swap_info_get(slot);
 	if (!si)
 		return 0;
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 
 	ci = lock_cluster(si, offset);
 
@@ -1661,10 +1674,11 @@ int swp_swapcount(swp_entry_t entry)
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 					 swp_entry_t entry, int order)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned int nr_pages = 1 << order;
-	unsigned long roffset = swp_offset(entry);
+	unsigned long roffset = swp_slot_offset(slot);
 	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
@@ -1689,7 +1703,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 static bool folio_swapped(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
-	struct swap_info_struct *si = _swap_info_get(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = _swap_info_get(slot);
 
 	if (!si)
 		return false;
@@ -1712,7 +1727,8 @@ static bool folio_swapped(struct folio *folio)
  */
 void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 {
-	const unsigned long start_offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	const unsigned long start_offset = swp_slot_offset(slot);
 	const unsigned long end_offset = start_offset + nr;
 	struct swap_info_struct *si;
 	bool any_only_cache = false;
@@ -1721,7 +1737,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	if (non_swap_entry(entry))
 		return;
 
-	si = get_swap_device(entry);
+	si = swap_slot_tryget_swap_info(slot);
 	if (!si)
 		return;
 
@@ -1731,7 +1747,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	/*
 	 * First free all entries in the range.
 	 */
-	any_only_cache = __swap_entries_free(si, entry, nr);
+	any_only_cache = __swap_slots_free(si, slot, nr);
 
 	/*
 	 * Short-circuit the below loop if none of the entries had their
@@ -1744,7 +1760,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	 * Now go back over the range trying to reclaim the swap cache. This is
 	 * more efficient for large folios because we will only try to reclaim
 	 * the swap once per folio in the common case. If we do
-	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+	 * __swap_slot_free() and __try_to_reclaim_swap() in the same loop, the
 	 * latter will get a reference and lock the folio for every individual
 	 * page but will only succeed once the swap slot for every subpage is
 	 * zero.
@@ -1771,10 +1787,10 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	}
 
 out:
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
+void swap_slot_cache_free_slots(swp_slot_t *slots, int n)
 {
 	int i;
 	struct swap_cluster_info *ci;
@@ -1784,10 +1800,10 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 		return;
 
 	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
+		si = _swap_info_get(slots[i]);
 		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
+			ci = lock_cluster(si, swp_slot_offset(slots[i]));
+			swap_slot_range_free(si, ci, slots[i], 1);
 			unlock_cluster(ci);
 		}
 	}
@@ -1846,22 +1862,22 @@ bool folio_free_swap(struct folio *folio)
 
 #ifdef CONFIG_HIBERNATION
 
-swp_entry_t get_swap_page_of_type(int type)
+swp_slot_t swap_slot_alloc_of_type(int type)
 {
 	struct swap_info_struct *si = swap_type_to_swap_info(type);
-	swp_entry_t entry = {0};
+	swp_slot_t slot = {0};
 
 	if (!si)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &slot, 0))
 			atomic_long_dec(&nr_swap_pages);
-		put_swap_device(si);
+		swap_slot_put_swap_info(si);
 	}
 fail:
-	return entry;
+	return slot;
 }
 
 /*
@@ -2114,6 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long offset;
 		unsigned char swp_count;
 		swp_entry_t entry;
+		swp_slot_t slot;
 		int ret;
 		pte_t ptent;
 
@@ -2129,10 +2146,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
-		if (swp_type(entry) != type)
+		slot = swp_entry_to_swp_slot(entry);
+
+		if (swp_slot_type(slot) != type)
 			continue;
 
-		offset = swp_offset(entry);
+		offset = swp_slot_offset(slot);
 		pte_unmap(pte);
 		pte = NULL;
 
@@ -2283,6 +2302,7 @@ static int try_to_unuse(unsigned int type)
 	struct swap_info_struct *si = swap_info[type];
 	struct folio *folio;
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	if (!swap_usage_in_pages(si))
@@ -2330,7 +2350,8 @@ static int try_to_unuse(unsigned int type)
 	       !signal_pending(current) &&
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
-		entry = swp_entry(type, i);
+		slot = swp_slot(type, i);
+		entry = swp_slot_to_swp_entry(slot);
 		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
 		if (IS_ERR(folio))
 			continue;
@@ -2739,7 +2760,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	reenable_swap_slots_cache_unlock();
 
 	/*
-	 * Wait for swap operations protected by get/put_swap_device()
+	 * Wait for swap operations protected by swap_slot_(tryget|put)_swap_info()
 	 * to complete.  Because of synchronize_rcu() here, all swap
 	 * operations protected by RCU reader side lock (including any
 	 * spinlock) will be waited too.  This makes it easy to
@@ -3198,7 +3219,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 
 			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 			for (i = 0; i < SWAP_NR_ORDERS; i++)
-				cluster->next[i] = SWAP_ENTRY_INVALID;
+				cluster->next[i] = SWAP_SLOT_INVALID;
 			local_lock_init(&cluster->lock);
 		}
 	} else {
@@ -3207,7 +3228,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		if (!si->global_cluster)
 			goto err_free;
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
+			si->global_cluster->next[i] = SWAP_SLOT_INVALID;
 		spin_lock_init(&si->global_cluster_lock);
 	}
 
@@ -3527,9 +3548,9 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
-	return swap_type_to_swap_info(swp_type(entry));
+	return swap_type_to_swap_info(swp_slot_type(slot));
 }
 
 /*
@@ -3537,7 +3558,8 @@ struct swap_info_struct *swp_swap_info(swp_entry_t entry)
  */
 struct address_space *swapcache_mapping(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->swap_file->f_mapping;
+	return swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap))
+				->swap_file->f_mapping;
 }
 EXPORT_SYMBOL_GPL(swapcache_mapping);
 
@@ -3560,6 +3582,7 @@ EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
  */
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
@@ -3567,13 +3590,13 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	unsigned char has_cache;
 	int err, i;
 
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (WARN_ON_ONCE(!si)) {
 		pr_err("%s%08lx\n", Bad_file, entry.val);
 		return -EINVAL;
 	}
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
 	ci = lock_cluster(si, offset);
@@ -3675,7 +3698,8 @@ int swapcache_prepare(swp_entry_t entry, int nr)
 
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long offset = swp_slot_offset(slot);
 
 	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
@@ -3704,6 +3728,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	struct page *list_page;
 	pgoff_t offset;
 	unsigned char count;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	int ret = 0;
 
 	/*
@@ -3712,7 +3737,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = get_swap_device(entry);
+	si = swap_slot_tryget_swap_info(slot);
 	if (!si) {
 		/*
 		 * An acceptable race has occurred since the failing
@@ -3721,7 +3746,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 		goto outer;
 	}
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 
 	ci = lock_cluster(si, offset);
 
@@ -3784,7 +3809,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 outer:
 	if (page)
 		__free_page(page);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap implementation
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (4 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap slots Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support for virtualized swap Nhat Pham
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

In prepration for the implementation of swap virtualization, add new
scaffolds for the new code:

1. Add a new mm/vswap.c source file, which currently only holds the
   logic to set up the (for now, empty) vswap debugfs directory. Hook
   this up in the swap setup step in mm/swap_state.c. Add a new
   maintainer entry for the new source file.

2. Add a new config option (CONFIG_VIRTUAL_SWAP). We will only get new
   behavior when the kernel is built with this config option. The entry
   for the config option in mm/Kconfig summarizes the pros and cons of
   the new virtual swap design, which the remainder of the patch series
   will implement.

3. Set up vswap compilation in the Makefile.

Other than the debugfs directory, no behavioral change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 MAINTAINERS          |  7 +++++++
 include/linux/swap.h |  9 +++++++++
 mm/Kconfig           | 25 +++++++++++++++++++++++++
 mm/Makefile          |  1 +
 mm/swap_state.c      |  6 ++++++
 mm/vswap.c           | 35 +++++++++++++++++++++++++++++++++++
 6 files changed, 83 insertions(+)
 create mode 100644 mm/vswap.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 00e94bec401e..65108bf2a5f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25290,6 +25290,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/iio/light/vishay,veml6030.yaml
 F:	drivers/iio/light/veml6030.c
 
+VIRTUAL SWAP SPACE
+M:	Nhat Pham <nphamcs@gmail.com>
+M:	Johannes Weiner <hannes@cmpxchg.org>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/vswap.c
+
 VISHAY VEML6075 UVA AND UVB LIGHT SENSOR DRIVER
 M:	Javier Carrasco <javier.carrasco.cruz@gmail.com>
 S:	Maintained
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 567fd2ebb0d3..328f6aec9313 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -726,6 +726,15 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int vswap_init(void);
+#else /* CONFIG_VIRTUAL_SWAP */
+static inline int vswap_init(void)
+{
+	return 0;
+}
+#endif /* CONFIG_VIRTUAL_SWAP */
+
 /**
  * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
  *                         virtual swap slot.
diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..2e8eb66c5888 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -22,6 +22,31 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config VIRTUAL_SWAP
+	bool "Swap space virtualization"
+	depends on SWAP
+	default n
+	help
+		When this is selected, the kernel is built with the new swap
+		design, where each swap entry is associated with a virtual swap
+		slot that is decoupled from a specific physical backing storage
+		location. As a result, swap entries that are:
+
+		1. Zero-filled
+
+		2. Stored in the zswap pool.
+
+		3. Rejected by zswap/zram but cannot be written back to a
+		   backing swap device.
+
+		no longer take up any disk storage (i.e they do not occupy any
+		slot in the backing swap device).
+
+		Swapoff is also more efficient.
+
+		There might be more lock contentions with heavy swap use, since
+		the swap cache is no longer range partitioned.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..b7216c714fa1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ ifdef CONFIG_MMU
 endif
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_slots.o
+obj-$(CONFIG_VIRTUAL_SWAP)	+= vswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
diff --git a/mm/swap_state.c b/mm/swap_state.c
index cbd1532b6b24..1607d23a3d7b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -930,6 +930,12 @@ static int __init swap_init_sysfs(void)
 	int err;
 	struct kobject *swap_kobj;
 
+	err = vswap_init();
+	if (err) {
+		pr_err("failed to initialize virtual swap space\n");
+		return err;
+	}
+
 	swap_kobj = kobject_create_and_add("swap", mm_kobj);
 	if (!swap_kobj) {
 		pr_err("failed to create swap kobject\n");
diff --git a/mm/vswap.c b/mm/vswap.c
new file mode 100644
index 000000000000..b9c28e819cca
--- /dev/null
+++ b/mm/vswap.c
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual swap space
+ *
+ * Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
+ */
+ #include <linux/swap.h>
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *vswap_debugfs_root;
+
+static int vswap_debug_fs_init(void)
+{
+	if (!debugfs_initialized())
+		return -ENODEV;
+
+	vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+	return 0;
+}
+#else
+static int vswap_debug_fs_init(void)
+{
+	return 0;
+}
+#endif
+
+int vswap_init(void)
+{
+	if (vswap_debug_fs_init())
+		pr_warn("Failed to initialize vswap debugfs\n");
+
+	return 0;
+}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support for virtualized swap
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (5 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap implementation Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Currently, the swap cache code assumes that the swap space is of a fixed
size. The virtual swap space is dynamically sized, so the existing
partitioning code cannot be easily reused.  A dynamic partitioning is
planned, but for now keep the design simple and just use a flat
swapcache for vswap.

Similar to swap cache, the zswap tree code, specifically the range
partition logic, can no longer easily be reused for the new virtual swap
space design. Use a simple unified zswap tree in the new implementation
for now. As in the case of swap cache, range partitioning is planned as
a follow up work.

Since the vswap's implementation has begun to diverge from the old
implementation, we also introduce a new build config
(CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the
old implementation, with no behavioral change.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swap.h       | 22 ++++++++++++++--------
 mm/swap_state.c | 44 +++++++++++++++++++++++++++++++++++---------
 mm/zswap.c      | 38 ++++++++++++++++++++++++++++++++------
 3 files changed, 81 insertions(+), 23 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index d5f8effa8015..06e20b1d79c4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
 
-/* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
+{
+	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
+}
+
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
+
+/* linux/mm/swap_state.c */
+#ifdef CONFIG_VIRTUAL_SWAP
+extern struct address_space *swap_address_space(swp_entry_t entry);
+#define swap_cache_index(entry) entry.val
+#else
+/* One swap address space for each 64M swap space */
 extern struct address_space *swapper_spaces[];
 #define swap_address_space(entry)			    \
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
-/* Return the swap device position of the swap slot. */
-static inline loff_t swap_slot_pos(swp_slot_t slot)
-{
-	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
-}
-
 /*
  * Return the swap cache index of the swap entry.
  */
@@ -46,6 +51,7 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
 	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
 }
+#endif
 
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 1607d23a3d7b..f677ebf9c5d0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -38,8 +38,18 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static struct address_space swapper_space __read_mostly;
+
+struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return &swapper_space;
+}
+#else
 struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
+#endif
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -718,23 +728,34 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
+static void init_swapper_space(struct address_space *space)
+{
+	xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+	atomic_set(&space->i_mmap_writable, 0);
+	space->a_ops = &swap_aops;
+	/* swap cache doesn't use writeback related tags */
+	mapping_set_no_writeback_tags(space);
+}
+
+#ifdef CONFIG_VIRTUAL_SWAP
+int init_swap_address_space(unsigned int type, unsigned long nr_pages)
+{
+	return 0;
+}
+
+void exit_swap_address_space(unsigned int type) {}
+#else
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 {
-	struct address_space *spaces, *space;
+	struct address_space *spaces;
 	unsigned int i, nr;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
 	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
 	if (!spaces)
 		return -ENOMEM;
-	for (i = 0; i < nr; i++) {
-		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
-		atomic_set(&space->i_mmap_writable, 0);
-		space->a_ops = &swap_aops;
-		/* swap cache doesn't use writeback related tags */
-		mapping_set_no_writeback_tags(space);
-	}
+	for (i = 0; i < nr; i++)
+		init_swapper_space(spaces + i);
 	nr_swapper_spaces[type] = nr;
 	swapper_spaces[type] = spaces;
 
@@ -752,6 +773,7 @@ void exit_swap_address_space(unsigned int type)
 	nr_swapper_spaces[type] = 0;
 	swapper_spaces[type] = NULL;
 }
+#endif
 
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
@@ -930,6 +952,10 @@ static int __init swap_init_sysfs(void)
 	int err;
 	struct kobject *swap_kobj;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	init_swapper_space(&swapper_space);
+#endif
+
 	err = vswap_init();
 	if (err) {
 		pr_err("failed to initialize virtual swap space\n");
diff --git a/mm/zswap.c b/mm/zswap.c
index 23365e76a3ce..c1327569ce80 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -203,8 +203,6 @@ struct zswap_entry {
 	struct list_head lru;
 };
 
-static struct xarray *zswap_trees[MAX_SWAPFILES];
-static unsigned int nr_zswap_trees[MAX_SWAPFILES];
 
 /* RCU-protected iteration */
 static LIST_HEAD(zswap_pools);
@@ -231,12 +229,28 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static DEFINE_XARRAY(zswap_tree);
+
+static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
+{
+	return &zswap_tree;
+}
+
+#define zswap_tree_index(entry)	entry.val
+#else
+static struct xarray *zswap_trees[MAX_SWAPFILES];
+static unsigned int nr_zswap_trees[MAX_SWAPFILES];
+
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
 		>> SWAP_ADDRESS_SPACE_SHIFT];
 }
 
+#define zswap_tree_index(entry)	swp_offset(entry)
+#endif
+
 #define zswap_pool_debug(msg, p)				\
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpool))
@@ -1047,7 +1061,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 				 swp_entry_t swpentry)
 {
 	struct xarray *tree;
-	pgoff_t offset = swp_offset(swpentry);
+	pgoff_t offset = zswap_tree_index(swpentry);
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
@@ -1463,7 +1477,7 @@ static bool zswap_store_page(struct page *page,
 		goto compress_failed;
 
 	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
+		       zswap_tree_index(page_swpentry),
 		       entry, GFP_KERNEL);
 	if (xa_is_err(old)) {
 		int err = xa_err(old);
@@ -1612,7 +1626,7 @@ bool zswap_store(struct folio *folio)
 bool zswap_load(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
+	pgoff_t offset = zswap_tree_index(swp);
 	bool swapcache = folio_test_swapcache(folio);
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
@@ -1670,7 +1684,7 @@ bool zswap_load(struct folio *folio)
 
 void zswap_invalidate(swp_entry_t swp)
 {
-	pgoff_t offset = swp_offset(swp);
+	pgoff_t offset = zswap_tree_index(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
@@ -1682,6 +1696,16 @@ void zswap_invalidate(swp_entry_t swp)
 		zswap_entry_free(entry);
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int zswap_swapon(int type, unsigned long nr_pages)
+{
+	return 0;
+}
+
+void zswap_swapoff(int type)
+{
+}
+#else
 int zswap_swapon(int type, unsigned long nr_pages)
 {
 	struct xarray *trees, *tree;
@@ -1718,6 +1742,8 @@ void zswap_swapoff(int type)
 	nr_zswap_trees[type] = 0;
 	zswap_trees[type] = NULL;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
+
 
 /*********************************
 * debugfs functions
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each swapped out page
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (6 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support for virtualized swap Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using virtual swap Nhat Pham
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

For the new virtual swap space design, dynamically allocate a virtual
slot (as well as an associated metadata structure) for each swapped out
page, and associate it to the (physical) swap slot on the swapfile/swap
partition.

For now, there is always a physical slot in the swapfile associated for
each virtual swap slot (except those about to be freed). The virtual
swap slot's lifetime is still tied to the lifetime of its physical swap
slot.

We also maintain a backward map to look up the virtual swap slot from
its associated physical swap slot on swapfile. This is used in cluster
readahead, as well as several swapfile operations, such as the swap slot
reclamation that happens when the swapfile is almost full.  It will also
be used in a future patch that simplifies swapoff.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h    |  17 +-
 include/linux/swapops.h |  12 ++
 mm/internal.h           |  43 ++++-
 mm/shmem.c              |  10 +-
 mm/swap.h               |   2 +
 mm/swap_state.c         |  29 +++-
 mm/swapfile.c           |  24 ++-
 mm/vswap.c              | 342 +++++++++++++++++++++++++++++++++++++++-
 8 files changed, 457 insertions(+), 22 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 328f6aec9313..0f1337431e27 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -456,7 +456,6 @@ extern void __meminit kswapd_stop(int nid);
 /* Lifetime swap API (mm/swapfile.c) */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
 int swapcache_prepare(swp_entry_t entry, int nr);
@@ -504,6 +503,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
+void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
@@ -728,12 +728,19 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
+void vswap_exit(void);
+void vswap_free(swp_entry_t entry);
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
 	return 0;
 }
-#endif /* CONFIG_VIRTUAL_SWAP */
+
+static inline void vswap_exit(void)
+{
+}
 
 /**
  * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
@@ -758,6 +765,12 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
 	return (swp_entry_t) { slot.val };
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
+
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
+}
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 2a4101c9bba4..ba7364e1400a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -27,6 +27,18 @@
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
 #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
 
+#ifdef CONFIG_VIRTUAL_SWAP
+#if SWP_TYPE_SHIFT > 32
+#define MAX_VSWAP	U32_MAX
+#else
+/*
+ * The range of virtual swap slots is the same as the range of physical swap
+ * slots.
+ */
+#define MAX_VSWAP	(((MAX_SWAPFILES - 1) << SWP_TYPE_SHIFT) | SWP_OFFSET_MASK)
+#endif
+#endif
+
 /*
  * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
  * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
diff --git a/mm/internal.h b/mm/internal.h
index 2d63f6537e35..ca28729f822a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -262,6 +262,40 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 	return min(ptep - start_ptep, max_nr);
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+	return (swp_entry_t) { entry.val + n };
+}
+
+/* similar to swap_nth, but check the backing physical slots as well. */
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
+	swp_entry_t next_entry = swap_nth(entry, delta);
+
+	next_slot = swp_entry_to_swp_slot(next_entry);
+	if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
+			swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
+		next_entry.val = 0;
+
+	return next_entry;
+}
+#else
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	return swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+			swp_slot_offset(slot) + n));
+}
+
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	return swap_nth(entry, delta);
+}
+#endif
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
@@ -275,13 +309,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  */
 static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
-	swp_entry_t entry = pte_to_swp_entry(pte), new_entry;
-	swp_slot_t slot = swp_entry_to_swp_slot(entry);
-	pte_t new;
-
-	new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
-			swp_slot_offset(slot) + delta));
-	new = swp_entry_to_pte(new_entry);
+	swp_entry_t entry = pte_to_swp_entry(pte);
+	pte_t new = swp_entry_to_pte(swap_move(entry, delta));
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
diff --git a/mm/shmem.c b/mm/shmem.c
index f8efa49eb499..4c00b4673468 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2166,7 +2166,6 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
 	void *alloced_shadow = NULL;
 	int alloced_order = 0, i;
-	swp_slot_t slot = swp_entry_to_swp_slot(swap);
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
@@ -2205,12 +2204,8 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			 */
 			for (i = 0; i < 1 << order; i++) {
 				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp_entry;
-				swp_slot_t tmp_slot;
+				swp_entry_t tmp_entry = swap_nth(swap, i);
 
-				tmp_slot =
-					swp_slot(swp_slot_type(slot), swp_slot_offset(slot) + i);
-				tmp_entry = swp_slot_to_swp_entry(tmp_slot);
 				__xa_store(&mapping->i_pages, aligned_index + i,
 					   swp_to_radix_entry(tmp_entry), 0);
 			}
@@ -2336,8 +2331,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
 
-			swap = swp_slot_to_swp_entry(swp_slot(
-					swp_slot_type(slot), swp_slot_offset(slot) + offset));
+			swap = swap_nth(swap, offset);
 		}
 
 		/* Here we actually start the io */
diff --git a/mm/swap.h b/mm/swap.h
index 06e20b1d79c4..31c94671cb44 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -36,6 +36,8 @@ static inline loff_t swap_slot_pos(swp_slot_t slot)
 #ifdef CONFIG_VIRTUAL_SWAP
 extern struct address_space *swap_address_space(swp_entry_t entry);
 #define swap_cache_index(entry) entry.val
+
+void virt_clear_shadow_from_swap_cache(swp_entry_t entry);
 #else
 /* One swap address space for each 64M swap space */
 extern struct address_space *swapper_spaces[];
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f677ebf9c5d0..16abdb5ce07a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -177,6 +177,7 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_slot_t slot = folio_alloc_swap_slot(folio);
@@ -189,6 +190,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	return entry;
 }
+#endif
 
 /**
  * add_to_swap - allocate swap space for a folio
@@ -270,6 +272,27 @@ void delete_from_swap_cache(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/*
+ * In the virtual swap implementation, we index the swap cache by virtual swap
+ * slots rather than physical ones. As a result, we only clear the shadow when
+ * the virtual swap slot is freed (via virt_clear_shadow_from_swap_cache()),
+ * not when the physical swap slot is freed (via clear_shadow_from_swap_cache()
+ * in the old implementation).
+ */
+void virt_clear_shadow_from_swap_cache(swp_entry_t entry)
+{
+	struct address_space *address_space = swap_address_space(entry);
+	pgoff_t index = swap_cache_index(entry);
+	XA_STATE(xas, &address_space->i_pages, index);
+
+	xas_set_update(&xas, workingset_update_node);
+	xa_lock_irq(&address_space->i_pages);
+	if (xa_is_value(xas_load(&xas)))
+		xas_store(&xas, NULL);
+	xa_unlock_irq(&address_space->i_pages);
+}
+#else
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				unsigned long end)
 {
@@ -300,6 +323,7 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
 			break;
 	}
 }
+#endif
 
 /*
  * If we are the only user, then try to free up the swap cache.
@@ -965,7 +989,8 @@ static int __init swap_init_sysfs(void)
 	swap_kobj = kobject_create_and_add("swap", mm_kobj);
 	if (!swap_kobj) {
 		pr_err("failed to create swap kobject\n");
-		return -ENOMEM;
+		err = -ENOMEM;
+		goto vswap_exit;
 	}
 	err = sysfs_create_group(swap_kobj, &swap_attr_group);
 	if (err) {
@@ -976,6 +1001,8 @@ static int __init swap_init_sysfs(void)
 
 delete_obj:
 	kobject_put(swap_kobj);
+vswap_exit:
+	vswap_exit();
 	return err;
 }
 subsys_initcall(swap_init_sysfs);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 17cbf14bac72..849525810bbe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1126,12 +1126,18 @@ static void swap_range_alloc(struct swap_info_struct *si,
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			    unsigned int nr_entries)
 {
-	unsigned long begin = offset;
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
+#ifndef CONFIG_VIRTUAL_SWAP
+	unsigned long begin = offset;
 
+	/*
+	 * In the virtual swap design, the swap cache is indexed by virtual swap
+	 * slots. We will clear the shadow when the virtual swap slots are freed.
+	 */
 	clear_shadow_from_swap_cache(si->type, begin, end);
+#endif
 
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
@@ -1506,8 +1512,21 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+#ifdef CONFIG_VIRTUAL_SWAP
+	int i;
 
+	/* release all the associated (virtual) swap slots */
+	for (i = 0; i < nr_pages; i++) {
+		vswap_free(entry);
+		entry.val++;
+	}
+#else
+	/*
+	 * In the new (i.e virtual swap) implementation, we will let the virtual
+	 * swap layer handle the cgroup swap accounting and charging.
+	 */
 	mem_cgroup_uncharge_swap(entry, nr_pages);
+#endif
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1573,9 +1592,8 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
+void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 {
-	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
diff --git a/mm/vswap.c b/mm/vswap.c
index b9c28e819cca..23a05c3393d8 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -4,7 +4,75 @@
  *
  * Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
  */
- #include <linux/swap.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/swap_cgroup.h>
+#include "swap.h"
+
+/*
+ * Virtual Swap Space
+ *
+ * We associate with each swapped out page a virtual swap slot. This will allow
+ * us to change the backing state of a swapped out page without having to
+ * update every single page table entries referring to it.
+ *
+ * For now, there is a one-to-one correspondence between a virtual swap slot
+ * and its associated physical swap slot.
+ */
+
+/**
+ * Swap descriptor - metadata of a swapped out page.
+ *
+ * @slot: The handle to the physical swap slot backing this page.
+ * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ */
+struct swp_desc {
+	swp_slot_t slot;
+	struct rcu_head rcu;
+};
+
+/* Virtual swap space - swp_entry_t -> struct swp_desc */
+static DEFINE_XARRAY_FLAGS(vswap_map, XA_FLAGS_TRACK_FREE);
+
+static const struct xa_limit vswap_map_limit = {
+	.max = MAX_VSWAP,
+	/* reserve the 0 virtual swap slot to indicate errors */
+	.min = 1,
+};
+
+/* Physical (swp_slot_t) to virtual (swp_entry_t) swap slots. */
+static DEFINE_XARRAY(vswap_rmap);
+
+/*
+ * For swapping large folio of size n, we reserve an empty PMD-sized cluster
+ * of contiguous and aligned virtual swap slots, then allocate the first n
+ * virtual swap slots from the cluster.
+ */
+#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
+#define VSWAP_CLUSTER_SIZE (1UL << VSWAP_CLUSTER_SHIFT)
+
+/*
+ * Map from a cluster id to the number of allocated virtual swap slots in the
+ * (PMD-sized) cluster. This allows us to quickly allocate an empty cluster
+ * for a large folio being swapped out.
+ */
+static DEFINE_XARRAY_FLAGS(vswap_cluster_map, XA_FLAGS_TRACK_FREE);
+
+static const struct xa_limit vswap_cluster_map_limit = {
+	/* Do not allocate from the last cluster if it does not have enough slots. */
+	.max = (((MAX_VSWAP + 1) >> (VSWAP_CLUSTER_SHIFT)) - 1),
+	/*
+	 * First cluster is never handed out for large folios, since the 0 virtual
+	 * swap slot is reserved for errors.
+	 */
+	.min = 1,
+};
+
+static struct kmem_cache *swp_desc_cache;
+static atomic_t vswap_alloc_reject;
+static atomic_t vswap_used;
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -17,6 +85,10 @@ static int vswap_debug_fs_init(void)
 		return -ENODEV;
 
 	vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+	debugfs_create_atomic_t("alloc_reject", 0444,
+		vswap_debugfs_root, &vswap_alloc_reject);
+	debugfs_create_atomic_t("used", 0444, vswap_debugfs_root, &vswap_used);
+
 	return 0;
 }
 #else
@@ -26,10 +98,278 @@ static int vswap_debug_fs_init(void)
 }
 #endif
 
+/* Allolcate a contiguous range of virtual swap slots */
+static swp_entry_t vswap_alloc(int nr)
+{
+	struct swp_desc **descs;
+	swp_entry_t entry;
+	u32 index, cluster_id;
+	void *cluster_entry;
+	unsigned long cluster_count;
+	int i, err;
+
+	entry.val = 0;
+	descs = kcalloc(nr, sizeof(*descs), GFP_KERNEL);
+	if (!descs) {
+		atomic_add(nr, &vswap_alloc_reject);
+		return (swp_entry_t){0};
+	}
+
+	if (unlikely(!kmem_cache_alloc_bulk(
+					swp_desc_cache, GFP_KERNEL, nr, (void **)descs))) {
+		atomic_add(nr, &vswap_alloc_reject);
+		kfree(descs);
+		return (swp_entry_t){0};
+	}
+
+	for (i = 0; i < nr; i++)
+		descs[i]->slot.val = 0;
+
+	xa_lock(&vswap_map);
+	if (nr == 1) {
+		if (__xa_alloc(&vswap_map, &index, descs[0], vswap_map_limit,
+				GFP_KERNEL))
+			goto unlock;
+		else {
+			/*
+			 * Increment the allocation count of the cluster which the
+			 * allocated virtual swap slot belongs to.
+			 */
+			cluster_id = index >> VSWAP_CLUSTER_SHIFT;
+			cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
+			cluster_count = cluster_entry ? xa_to_value(cluster_entry) : 0;
+			cluster_count++;
+			VM_WARN_ON(cluster_count > VSWAP_CLUSTER_SIZE);
+
+			if (xa_err(xa_store(&vswap_cluster_map, cluster_id,
+					xa_mk_value(cluster_count), GFP_KERNEL))) {
+				__xa_erase(&vswap_map, index);
+				goto unlock;
+			}
+		}
+	} else {
+		/* allocate an unused cluster */
+		cluster_entry = xa_mk_value(nr);
+		if (xa_alloc(&vswap_cluster_map, &cluster_id, cluster_entry,
+				vswap_cluster_map_limit, GFP_KERNEL))
+			goto unlock;
+
+		index = cluster_id << VSWAP_CLUSTER_SHIFT;
+
+		for (i = 0; i < nr; i++) {
+			err = __xa_insert(&vswap_map, index + i, descs[i], GFP_KERNEL);
+			VM_WARN_ON(err == -EBUSY);
+			if (err) {
+				while (--i >= 0)
+					__xa_erase(&vswap_map, index + i);
+				xa_erase(&vswap_cluster_map, cluster_id);
+				goto unlock;
+			}
+		}
+	}
+
+	VM_WARN_ON(!index);
+	VM_WARN_ON(index + nr - 1 > MAX_VSWAP);
+	entry.val = index;
+	atomic_add(nr, &vswap_used);
+unlock:
+	xa_unlock(&vswap_map);
+	if (!entry.val) {
+		atomic_add(nr, &vswap_alloc_reject);
+		kmem_cache_free_bulk(swp_desc_cache, nr, (void **)descs);
+	}
+	kfree(descs);
+	return entry;
+}
+
+static inline void release_vswap_slot(unsigned long index)
+{
+	unsigned long cluster_id = index >> VSWAP_CLUSTER_SHIFT, cluster_count;
+	void *cluster_entry;
+
+	xa_lock(&vswap_map);
+	__xa_erase(&vswap_map, index);
+	cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
+	VM_WARN_ON(!cluster_entry);
+	cluster_count = xa_to_value(cluster_entry);
+	cluster_count--;
+
+	VM_WARN_ON(cluster_count < 0);
+
+	if (cluster_count)
+		xa_store(&vswap_cluster_map, cluster_id,
+			xa_mk_value(cluster_count), GFP_KERNEL);
+	else
+		xa_erase(&vswap_cluster_map, cluster_id);
+	xa_unlock(&vswap_map);
+	atomic_dec(&vswap_used);
+}
+
+/**
+ * vswap_free - free a virtual swap slot.
+ * @id: the virtual swap slot to free
+ */
+void vswap_free(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+
+	if (!entry.val)
+		return;
+
+	/* do not immediately erase the virtual slot to prevent its reuse */
+	desc = xa_load(&vswap_map, entry.val);
+	if (!desc)
+		return;
+
+	virt_clear_shadow_from_swap_cache(entry);
+
+	if (desc->slot.val) {
+		/* we only charge after linkage was established */
+		mem_cgroup_uncharge_swap(entry, 1);
+		xa_erase(&vswap_rmap, desc->slot.val);
+	}
+
+	/* erase forward mapping and release the virtual slot for reallocation */
+	release_vswap_slot(entry.val);
+	kfree_rcu(desc, rcu);
+}
+
+/**
+ * folio_alloc_swap - allocate virtual swap slots for a folio.
+ * @folio: the folio.
+ *
+ * Return: the first allocated slot if success, or the zero virtuals swap slot
+ * on failure.
+ */
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	int i, err, nr = folio_nr_pages(folio);
+	bool manual_freeing = true;
+	struct swp_desc *desc;
+	swp_entry_t entry;
+	swp_slot_t slot;
+
+	entry = vswap_alloc(nr);
+	if (!entry.val)
+		return entry;
+
+	/*
+	 * XXX: for now, we always allocate a physical swap slot for each virtual
+	 * swap slot, and their lifetime are coupled. This will change once we
+	 * decouple virtual swap slots from their backing states, and only allocate
+	 * physical swap slots for them on demand (i.e on zswap writeback, or
+	 * fallback from zswap store failure).
+	 */
+	slot = folio_alloc_swap_slot(folio);
+	if (!slot.val)
+		goto vswap_free;
+
+	/* establish the vrtual <-> physical swap slots linkages. */
+	for (i = 0; i < nr; i++) {
+		err = xa_insert(&vswap_rmap, slot.val + i,
+				xa_mk_value(entry.val + i), GFP_KERNEL);
+		VM_WARN_ON(err == -EBUSY);
+		if (err) {
+			while (--i >= 0)
+				xa_erase(&vswap_rmap, slot.val + i);
+			goto put_physical_swap;
+		}
+	}
+
+	i = 0;
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		desc->slot.val = slot.val + i;
+		i++;
+	}
+	rcu_read_unlock();
+
+	manual_freeing = false;
+	/*
+	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+	 * swap slots allocation. This is acceptable because as noted above, each
+	 * virtual swap slot corresponds to a physical swap slot. Once we have
+	 * decoupled virtual and physical swap slots, we will only charge when we
+	 * actually allocate a physical swap slot.
+	 */
+	if (!mem_cgroup_try_charge_swap(folio, entry))
+		return entry;
+
+put_physical_swap:
+	/*
+	 * There is no any linkage between virtual and physical swap slots yet. We
+	 * have to manually and separately free the allocated virtual and physical
+	 * swap slots.
+	 */
+	swap_slot_put_folio(slot, folio);
+vswap_free:
+	if (manual_freeing) {
+		for (i = 0; i < nr; i++)
+			vswap_free((swp_entry_t){entry.val + i});
+	}
+	entry.val = 0;
+	return entry;
+}
+
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ *                         virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+
+	if (!entry.val)
+		return (swp_slot_t){0};
+
+	desc = xa_load(&vswap_map, entry.val);
+	return desc ? desc->slot : (swp_slot_t){0};
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ *                         physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+	void *entry = xa_load(&vswap_rmap, slot.val);
+
+	/*
+	 * entry can be NULL if we fail to link the virtual and physical swap slot
+	 * during the swap slot allocation process.
+	 */
+	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
+}
+
 int vswap_init(void)
 {
+	swp_desc_cache = KMEM_CACHE(swp_desc, 0);
+	if (!swp_desc_cache)
+		return -ENOMEM;
+
+	if (xa_insert(&vswap_cluster_map, 0, xa_mk_value(1), GFP_KERNEL)) {
+		kmem_cache_destroy(swp_desc_cache);
+		return -ENOMEM;
+	}
+
 	if (vswap_debug_fs_init())
 		pr_warn("Failed to initialize vswap debugfs\n");
 
 	return 0;
 }
+
+void vswap_exit(void)
+{
+	kmem_cache_destroy(swp_desc_cache);
+}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using virtual swap
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (7 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Once we decouple a swap entry from its backing store via the virtual
swap, we can no longer statically allocate an array to store the swap
entries' cgroup information. Move it to the swap descriptor.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/Makefile |  2 ++
 mm/vswap.c  | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/mm/Makefile b/mm/Makefile
index b7216c714fa1..35f2f282c8da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -101,8 +101,10 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
+ifndef CONFIG_VIRTUAL_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/vswap.c b/mm/vswap.c
index 23a05c3393d8..3792fa7f766b 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -27,10 +27,14 @@
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ * @memcgid: The memcg id of the owning memcg, if any.
  */
 struct swp_desc {
 	swp_slot_t slot;
 	struct rcu_head rcu;
+#ifdef CONFIG_MEMCG
+	atomic_t memcgid;
+#endif
 };
 
 /* Virtual swap space - swp_entry_t -> struct swp_desc */
@@ -122,8 +126,10 @@ static swp_entry_t vswap_alloc(int nr)
 		return (swp_entry_t){0};
 	}
 
-	for (i = 0; i < nr; i++)
+	for (i = 0; i < nr; i++) {
 		descs[i]->slot.val = 0;
+		atomic_set(&descs[i]->memcgid, 0);
+	}
 
 	xa_lock(&vswap_map);
 	if (nr == 1) {
@@ -352,6 +358,70 @@ swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
 }
 
+#ifdef CONFIG_MEMCG
+static unsigned short vswap_cgroup_record(swp_entry_t entry,
+				unsigned short memcgid, unsigned int nr_ents)
+{
+	struct swp_desc *desc;
+	unsigned short oldid, iter = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr_ents - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		oldid = atomic_xchg(&desc->memcgid, memcgid);
+		if (!iter)
+			iter = oldid;
+		VM_WARN_ON(iter != oldid);
+	}
+	rcu_read_unlock();
+
+	return oldid;
+}
+
+void swap_cgroup_record(struct folio *folio, unsigned short memcgid,
+			swp_entry_t entry)
+{
+	unsigned short oldid =
+		vswap_cgroup_record(entry, memcgid, folio_nr_pages(folio));
+
+	VM_WARN_ON(oldid);
+}
+
+unsigned short swap_cgroup_clear(swp_entry_t entry, unsigned int nr_ents)
+{
+	return vswap_cgroup_record(entry, 0, nr_ents);
+}
+
+unsigned short lookup_swap_cgroup_id(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	unsigned short ret;
+
+	/*
+	 * Note that the virtual swap slot can be freed under us, for instance in
+	 * the invocation of mem_cgroup_swapin_charge_folio. We need to wrap the
+	 * entire lookup in RCU read-side critical section, and double check the
+	 * existence of the swap descriptor.
+	 */
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	ret = desc ? atomic_read(&desc->memcgid) : 0;
+	rcu_read_unlock();
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+void swap_cgroup_swapoff(int type) {}
+#endif /* CONFIG_MEMCG */
+
 int vswap_init(void)
 {
 	swp_desc_cache = KMEM_CACHE(swp_desc, 0);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual swap layer
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (8 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using virtual swap Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and batched freeing swap Nhat Pham
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

This patch moves the swap entry lifetime to the virtual swap layer (if
we enable swap virtualization), by adding to the swap descriptor an
atomic field named "swap_refs" that takes into account:

1. Whether the swap entry is in swap cache (or about to be added). This
   is indicated by the last bit of the field.
2. The swap count of the swap entry, which counts the number of page
   table entries at which the swap entry is inserted. This is given by
   the remaining bits of the field.

We also re-implement all of the swap entry lifetime API
(swap_duplicate(), swap_free_nr(), swapcache_prepare(), etc.) in the
virtual swap layer.

For now, we do not implement swap count continuation - the swap_count
field in the swap descriptor is big enough to hold the maximum number of
swap counts. This vastly simplifies the logic.

Note that the swapfile's swap map can be now be reduced under the virtual swap
implementation, as each slot can now only have 3 states: unallocated,
allocated, and bad slot. However, I leave this simplification to future work,
to minimize the amount of code change for review here.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  40 ++++-
 mm/memory.c          |   6 +
 mm/swapfile.c        | 124 +++++++++++---
 mm/vswap.c           | 400 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 536 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0f1337431e27..798adfbd43cb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -225,6 +225,11 @@ enum {
 #define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/* Swapfile's swap map state*/
+#define SWAP_MAP_ALLOCATED	0x01	/* Page is allocated */
+#define SWAP_MAP_BAD	0x02	/* Page is bad */
+#else
 /* Bit flag in swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
 #define COUNT_CONTINUED	0x80	/* Flag swap_map continuation for full count */
@@ -236,6 +241,7 @@ enum {
 
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
@@ -453,7 +459,7 @@ extern void __meminit kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
 
-/* Lifetime swap API (mm/swapfile.c) */
+/* Lifetime swap API (mm/swapfile.c or mm/vswap.c) */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
 void swap_shmem_alloc(swp_entry_t, int);
@@ -507,7 +513,9 @@ void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
+#ifndef CONFIG_VIRTUAL_SWAP
 int add_swap_count_continuation(swp_entry_t, gfp_t);
+#endif
 void swap_slot_cache_free_slots(swp_slot_t *slots, int n);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
@@ -560,10 +568,12 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
 	return 0;
 }
+#endif
 
 static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
 {
@@ -729,9 +739,14 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
 void vswap_exit(void);
-void vswap_free(swp_entry_t entry);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
+bool vswap_tryget(swp_entry_t entry);
+void vswap_put(swp_entry_t entry);
+bool folio_swapped(struct folio *folio);
+bool vswap_swapcache_only(swp_entry_t entry, int nr);
+int non_swapcache_batch(swp_entry_t entry, int nr);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
@@ -765,26 +780,41 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
 	return (swp_entry_t) { slot.val };
 }
-#endif /* CONFIG_VIRTUAL_SWAP */
 
 static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 
+	/*
+	 * Note that in the virtual swap implementation, we do not need to do anything
+	 * to guard against concurrent swapoff for the swap entry's metadata:
+	 *
+	 * 1. The swap descriptor (struct swp_desc) has its existence guaranteed by
+	 *    RCU + its reference count.
+	 *
+	 * 2. Swap cache, zswap trees, etc. are all statically declared, and never
+	 *    freed.
+	 *
+	 * We do, however, need a reference to the swap device itself, because we
+	 * need swap device's metadata in certain scenarios, for example when we
+	 * need to inspect the swap device flag in do_swap_page().
+	 */
 	*si = swap_slot_tryget_swap_info(slot);
-	return *si;
+	return IS_ENABLED(CONFIG_VIRTUAL_SWAP) || *si;
 }
 
 static inline void unlock_swapoff(swp_entry_t entry,
 				struct swap_info_struct *si)
 {
-	swap_slot_put_swap_info(si);
+	if (si)
+		swap_slot_put_swap_info(si);
 }
 
 #endif /* __KERNEL__*/
diff --git a/mm/memory.c b/mm/memory.c
index c44e845b5320..a8c418104f28 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1202,10 +1202,14 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 
 	if (ret == -EIO) {
 		VM_WARN_ON_ONCE(!entry.val);
+		/* virtual swap implementation does not have swap count continuation */
+		VM_WARN_ON_ONCE(IS_ENABLED(CONFIG_VIRTUAL_SWAP));
+#ifndef CONFIG_VIRTUAL_SWAP
 		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
 			ret = -ENOMEM;
 			goto out;
 		}
+#endif
 		entry.val = 0;
 	} else if (ret == -EBUSY || unlikely(ret == -EHWPOISON)) {
 		goto out;
@@ -4123,6 +4127,7 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifndef CONFIG_VIRTUAL_SWAP
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
@@ -4143,6 +4148,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 	return i;
 }
+#endif
 
 /*
  * Check if the PTEs within a range are contiguous swap entries
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 849525810bbe..c09011867263 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -50,8 +50,10 @@
 #include "internal.h"
 #include "swap.h"
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
+#endif
 static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
@@ -156,6 +158,25 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim directly, bypass the slot cache and don't touch device lock */
 #define TTRS_DIRECT		0x8
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent;
+}
+
+static bool swap_is_has_cache(struct swap_info_struct *si,
+			      unsigned long offset, int nr_pages)
+{
+	swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
+
+	return vswap_swapcache_only(entry, nr_pages);
+}
+
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
+{
+	return swap_is_has_cache(si, offset, 1);
+}
+#else
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -176,6 +197,11 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
 	return true;
 }
 
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
+{
+	return READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE;
+}
+
 static bool swap_is_last_map(struct swap_info_struct *si,
 		unsigned long offset, int nr_pages, bool *has_cache)
 {
@@ -194,6 +220,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 	*has_cache = !!(count & SWAP_HAS_CACHE);
 	return true;
 }
+#endif
 
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
@@ -250,7 +277,11 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	if (!(flags & TTRS_DIRECT)) {
+	/*
+	 * For now, virtual swap implementation only supports freeing through the
+	 * swap slot cache...
+	 */
+	if (!(flags & TTRS_DIRECT) || IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
 		/* Free through slot cache */
 		delete_from_swap_cache(folio);
 		folio_set_dirty(folio);
@@ -700,7 +731,12 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 		case 0:
 			offset++;
 			break;
+#ifdef CONFIG_VIRTUAL_SWAP
+		/* __try_to_reclaim_swap() checks if the slot is in-cache only */
+		case SWAP_MAP_ALLOCATED:
+#else
 		case SWAP_HAS_CACHE:
+#endif
 			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
 			if (nr_reclaim > 0)
 				offset += nr_reclaim;
@@ -731,19 +767,20 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 {
 	unsigned long offset, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
+	unsigned char count;
 
 	for (offset = start; offset < end; offset++) {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
+		count = READ_ONCE(map[offset]);
+		if (!count)
 			continue;
-		case SWAP_HAS_CACHE:
+
+		if (swap_cache_only(si, offset)) {
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
 			continue;
-		default:
-			return false;
 		}
+		return false;
 	}
 
 	return true;
@@ -836,7 +873,6 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	long to_scan = 1;
 	unsigned long offset, end;
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
 	int nr_reclaim;
 
 	if (force)
@@ -848,7 +884,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+			if (swap_cache_only(si, offset)) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY | TTRS_DIRECT);
@@ -1175,6 +1211,10 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 {
 	int n_ret = 0;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	VM_WARN_ON(usage != SWAP_MAP_ALLOCATED);
+#endif
+
 	while (n_ret < nr) {
 		unsigned long offset = cluster_alloc_swap_slot(si, order, usage);
 
@@ -1192,6 +1232,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 {
 	unsigned int nr_pages = 1 << order;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	VM_WARN_ON(usage != SWAP_MAP_ALLOCATED);
+#endif
+
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
 	 * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
@@ -1248,7 +1292,13 @@ int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 	long avail_pgs;
 	int n_ret = 0;
 	int node;
+	unsigned char usage;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
+	usage = SWAP_HAS_CACHE;
+#endif
 	spin_lock(&swap_avail_lock);
 
 	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
@@ -1268,8 +1318,7 @@ int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_slots, order);
+			n_ret = scan_swap_map_slots(si, usage, n_goal, swp_slots, order);
 			swap_slot_put_swap_info(si);
 			if (n_ret || size > 1)
 				goto check_out;
@@ -1402,6 +1451,17 @@ struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 	return NULL;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
+					      unsigned long offset,
+					      unsigned char usage)
+{
+	VM_WARN_ON(usage != 1);
+	VM_WARN_ON(si->swap_map[offset] != SWAP_MAP_ALLOCATED);
+
+	return 0;
+}
+#else
 static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
@@ -1499,6 +1559,7 @@ static bool __swap_slots_free(struct swap_info_struct *si,
 	}
 	return has_cache;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * Drop the last HAS_CACHE flag of swap entries, caller have to
@@ -1511,21 +1572,17 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 	unsigned long offset = swp_slot_offset(slot);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
-	swp_entry_t entry = swp_slot_to_swp_entry(slot);
-#ifdef CONFIG_VIRTUAL_SWAP
-	int i;
+	unsigned char usage;
 
-	/* release all the associated (virtual) swap slots */
-	for (i = 0; i < nr_pages; i++) {
-		vswap_free(entry);
-		entry.val++;
-	}
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
 #else
 	/*
 	 * In the new (i.e virtual swap) implementation, we will let the virtual
 	 * swap layer handle the cgroup swap accounting and charging.
 	 */
-	mem_cgroup_uncharge_swap(entry, nr_pages);
+	mem_cgroup_uncharge_swap(swp_slot_to_swp_entry(slot), nr_pages);
+	usage = SWAP_HAS_CACHE;
 #endif
 
 	/* It should never free entries across different clusters */
@@ -1535,7 +1592,7 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 
 	ci->count -= nr_pages;
 	do {
-		VM_BUG_ON(*map != SWAP_HAS_CACHE);
+		VM_BUG_ON(*map != usage);
 		*map = 0;
 	} while (++map < map_end);
 
@@ -1580,6 +1637,7 @@ void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
 	}
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 /*
  * Caller has made sure that the swap device corresponding to entry
  * is still around or has not been recycled.
@@ -1588,9 +1646,11 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
 }
+#endif
 
 /*
- * Called after dropping swapcache to decrease refcnt to swap entries.
+ * This should only be called in contexts in which the slot has
+ * been allocated but not associated with any swap entries.
  */
 void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 {
@@ -1598,23 +1658,31 @@ void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	int size = 1 << swap_slot_order(folio_order(folio));
+	unsigned char usage;
 
 	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
+	usage = SWAP_HAS_CACHE;
+#endif
+
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
 		swap_slot_range_free(si, ci, slot, size);
 	else {
 		for (int i = 0; i < size; i++, slot.val++) {
-			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
+			if (!__swap_slot_free_locked(si, offset + i, usage))
 				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 int __swap_count(swp_entry_t entry)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
@@ -1785,7 +1853,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	 */
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
-		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+		if (swap_cache_only(si, offset)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
@@ -1807,6 +1875,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 out:
 	swap_slot_put_swap_info(si);
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 void swap_slot_cache_free_slots(swp_slot_t *slots, int n)
 {
@@ -3587,6 +3656,14 @@ pgoff_t __folio_swap_cache_index(struct folio *folio)
 }
 EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/*
+ * We do not use continuation in virtual swap implementation.
+ */
+static void free_swap_count_continuations(struct swap_info_struct *si)
+{
+}
+#else /* CONFIG_VIRTUAL_SWAP */
 /*
  * Verify that nr swap entries are valid and increment their swap map counts.
  *
@@ -3944,6 +4021,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 		}
 	}
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
 void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
diff --git a/mm/vswap.c b/mm/vswap.c
index 3792fa7f766b..513d000a134c 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -18,8 +18,23 @@
  * us to change the backing state of a swapped out page without having to
  * update every single page table entries referring to it.
  *
- * For now, there is a one-to-one correspondence between a virtual swap slot
- * and its associated physical swap slot.
+ *
+ * I. Swap Entry Lifetime
+ *
+ * The swap entry's lifetime is now managed at the virtual swap layer. We
+ * assign each virtual swap slot a reference count, which includes:
+ *
+ * 1. The number of page table entries that refer to the virtual swap slot, i.e
+ *    its swap count.
+ *
+ * 2. Whether the virtual swap slot has been added to the swap cache - if so,
+ *    its reference count is incremented by 1.
+ *
+ * Each virtual swap slot starts out with a reference count of 1 (since it is
+ * about to be added to the swap cache). Its reference count is incremented or
+ * decremented every time it is mapped to or unmapped from a PTE, as well as
+ * when it is added to or removed from the swap cache. Finally, when its
+ * reference count reaches 0, the virtual swap slot is freed.
  */
 
 /**
@@ -27,14 +42,24 @@
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ * @lock: The lock protecting the swap slot backing field.
  * @memcgid: The memcg id of the owning memcg, if any.
+ * @swap_refs: This field stores all the references to the swap entry. The
+ *             least significant bit indicates whether the swap entry is (about
+ *             to be) pinned in swap cache. The remaining bits tell us the
+ *             number of page table entries that refer to the swap entry.
  */
 struct swp_desc {
 	swp_slot_t slot;
 	struct rcu_head rcu;
+
+	rwlock_t lock;
+
 #ifdef CONFIG_MEMCG
 	atomic_t memcgid;
 #endif
+
+	atomic_t swap_refs;
 };
 
 /* Virtual swap space - swp_entry_t -> struct swp_desc */
@@ -78,6 +103,11 @@ static struct kmem_cache *swp_desc_cache;
 static atomic_t vswap_alloc_reject;
 static atomic_t vswap_used;
 
+/* least significant bit is for swap cache pin, the rest is for swap count. */
+#define SWAP_CACHE_SHIFT 1
+#define SWAP_CACHE_INC 1
+#define SWAP_COUNT_INC 2
+
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
 
@@ -129,6 +159,9 @@ static swp_entry_t vswap_alloc(int nr)
 	for (i = 0; i < nr; i++) {
 		descs[i]->slot.val = 0;
 		atomic_set(&descs[i]->memcgid, 0);
+		/* swap entry is about to be added to the swap cache */
+		atomic_set(&descs[i]->swap_refs, 1);
+		rwlock_init(&descs[i]->lock);
 	}
 
 	xa_lock(&vswap_map);
@@ -215,7 +248,7 @@ static inline void release_vswap_slot(unsigned long index)
  * vswap_free - free a virtual swap slot.
  * @id: the virtual swap slot to free
  */
-void vswap_free(swp_entry_t entry)
+static void vswap_free(swp_entry_t entry)
 {
 	struct swp_desc *desc;
 
@@ -233,6 +266,7 @@ void vswap_free(swp_entry_t entry)
 		/* we only charge after linkage was established */
 		mem_cgroup_uncharge_swap(entry, 1);
 		xa_erase(&vswap_rmap, desc->slot.val);
+		swap_slot_free_nr(desc->slot, 1);
 	}
 
 	/* erase forward mapping and release the virtual slot for reallocation */
@@ -332,12 +366,24 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 {
 	struct swp_desc *desc;
+	swp_slot_t slot;
 
 	if (!entry.val)
 		return (swp_slot_t){0};
 
+	rcu_read_lock();
 	desc = xa_load(&vswap_map, entry.val);
-	return desc ? desc->slot : (swp_slot_t){0};
+	if (!desc) {
+		rcu_read_unlock();
+		return (swp_slot_t){0};
+	}
+
+	read_lock(&desc->lock);
+	slot = desc->slot;
+	read_unlock(&desc->lock);
+	rcu_read_unlock();
+
+	return slot;
 }
 
 /**
@@ -349,13 +395,355 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
  */
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
-	void *entry = xa_load(&vswap_rmap, slot.val);
+	swp_entry_t ret;
+	void *entry;
 
+	rcu_read_lock();
 	/*
 	 * entry can be NULL if we fail to link the virtual and physical swap slot
 	 * during the swap slot allocation process.
 	 */
-	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
+	entry = xa_load(&vswap_rmap, slot.val);
+	if (!entry)
+		ret.val = 0;
+	else
+		ret = (swp_entry_t){xa_to_value(entry)};
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Decrease the swap count of nr contiguous swap entries by 1 (when the swap
+ * entries are removed from a range of PTEs), and check if any of the swap
+ * entries are in swap cache only after its swap count is decreased.
+ *
+ * The check is racy, but it is OK because free_swap_and_cache_nr() only use
+ * the result as a hint.
+ */
+static bool vswap_free_nr_any_cache_only(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	bool ret = false;
+	int end = entry.val + nr - 1;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, end) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		/* 1 page table entry ref + 1 swap cache ref == 11 (binary) */
+		ret |= (atomic_read(&desc->swap_refs) == 3);
+		if (atomic_sub_and_test(SWAP_COUNT_INC, &desc->swap_refs))
+			vswap_free(entry);
+		entry.val++;
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * swap_free_nr - decrease the swap count of nr contiguous swap entries by 1
+ *                (when the swap entries are removed from a range of PTEs).
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+void swap_free_nr(swp_entry_t entry, int nr)
+{
+	vswap_free_nr_any_cache_only(entry, nr);
+}
+
+static int swap_duplicate_nr(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || !atomic_add_unless(&desc->swap_refs, SWAP_COUNT_INC, 0))
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && i < nr)
+		swap_free_nr(entry, i);
+
+	return i == nr ? 0 : -ENOENT;
+}
+
+/**
+ * swap_duplicate - increase the swap count of the swap entry by 1 (i.e when
+ *                  the swap entry is stored at a new PTE).
+ * @entry: the swap entry.
+ *
+ * Return: 0 (always).
+ *
+ * Note that according to the existing API, we ALWAYS returns 0 unless a swap
+ * continuation is required (which is no longer the case in the new design).
+ */
+int swap_duplicate(swp_entry_t entry)
+{
+	swap_duplicate_nr(entry, 1);
+	return 0;
+}
+
+static int vswap_swap_count(atomic_t *swap_refs)
+{
+	return atomic_read(swap_refs) >> SWAP_CACHE_SHIFT;
+}
+
+bool folio_swapped(struct folio *folio)
+{
+	swp_entry_t entry = folio->swap;
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+	bool swapped = false;
+
+	if (!entry.val)
+		return false;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (desc && vswap_swap_count(&desc->swap_refs)) {
+			swapped = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return swapped;
+}
+
+/**
+ * swp_swapcount - return the swap count of the swap entry.
+ * @id: the swap entry.
+ *
+ * Note that all the swap count functions are identical in the new design,
+ * since we no longer need swap count continuation.
+ *
+ * Return: the swap count of the swap entry.
+ */
+int swp_swapcount(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	unsigned int ret;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	ret = desc ? vswap_swap_count(&desc->swap_refs) : 0;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+int __swap_count(swp_entry_t entry)
+{
+	return swp_swapcount(entry);
+}
+
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
+{
+	return swp_swapcount(entry);
+}
+
+void swap_shmem_alloc(swp_entry_t entry, int nr)
+{
+	swap_duplicate_nr(entry, nr);
+}
+
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int end = entry.val + nr - 1;
+
+	if (!nr)
+		return;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, end) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (atomic_dec_and_test(&desc->swap_refs))
+			vswap_free(entry);
+		entry.val++;
+	}
+	rcu_read_unlock();
+}
+
+int swapcache_prepare(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int old, new, i = 0, ret = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc) {
+			ret = -ENOENT;
+			goto done;
+		}
+
+		old = atomic_read(&desc->swap_refs);
+		do {
+			new = old;
+			ret = 0;
+
+			if (!old)
+				ret = -ENOENT;
+			else if (old & SWAP_CACHE_INC)
+				ret = -EEXIST;
+			else
+				new += SWAP_CACHE_INC;
+		} while (!atomic_try_cmpxchg(&desc->swap_refs, &old, new));
+
+		if (ret)
+			goto done;
+
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && i < nr)
+		swapcache_clear(NULL, entry, i);
+	if (i < nr && !ret)
+		ret = -ENOENT;
+	return ret;
+}
+
+/**
+ * vswap_swapcache_only - check if all the slots in the range are still valid,
+ *                        and are in swap cache only (i.e not stored in any
+ *                        PTEs).
+ * @entry: the first slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * Return: true if all the slots in the range are still valid, and are in swap
+ * cache only, or false otherwise.
+ */
+bool vswap_swapcache_only(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || atomic_read(&desc->swap_refs) != SWAP_CACHE_INC)
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	return i == nr;
+}
+
+/**
+ * non_swapcache_batch - count the longest range starting from a particular
+ *                       swap slot that are stil valid, but not in swap cache.
+ * @entry: the first slot to check.
+ * @max_nr: the maximum number of slots to check.
+ *
+ * Return: the number of slots in the longest range that are still valid, but
+ * not in swap cache.
+ */
+int non_swapcache_batch(swp_entry_t entry, int max_nr)
+{
+	struct swp_desc *desc;
+	int swap_refs, i = 0;
+
+	if (!entry.val)
+		return 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + max_nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		swap_refs = atomic_read(&desc->swap_refs);
+		if (!(swap_refs & SWAP_CACHE_INC) && (swap_refs >> SWAP_CACHE_SHIFT))
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	return i;
+}
+
+/**
+ * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
+ *                            reclaim their cache if no more references remain.
+ * @entry: First entry of range.
+ * @nr: Number of entries in range.
+ *
+ * For each swap entry in the contiguous range, release a swap count. If any
+ * swap entries have their swap count decremented to zero, try to reclaim their
+ * associated swap cache pages.
+ */
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+	int i = 0, incr = 1;
+	struct folio *folio;
+
+	if (non_swap_entry(entry))
+		return;
+
+	if (vswap_free_nr_any_cache_only(entry, nr)) {
+		while (i < nr) {
+			incr = 1;
+			if (vswap_swapcache_only(entry, 1)) {
+				folio = filemap_get_folio(swap_address_space(entry),
+							swap_cache_index(entry));
+				if (IS_ERR(folio))
+					goto next;
+				if (!folio_trylock(folio)) {
+					folio_put(folio);
+					goto next;
+				}
+				incr = folio_nr_pages(folio);
+				folio_free_swap(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+			}
+next:
+			i += incr;
+			entry.val += incr;
+		}
+	}
+}
+
+/*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+	int nr = folio_nr_pages(folio);
+
+	VM_WARN_ON(!folio_test_locked(folio));
+	swapcache_clear(NULL, entry, nr);
 }
 
 #ifdef CONFIG_MEMCG
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and batched freeing swap
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (9 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from backing store Nhat Pham
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Disable THP swapin on virtual swap implementation, for now. Similarly,
only operate on one swap entry at a time when we zap a PTE range. There
is no real reason why we cannot build support for this in the new
design. It is simply to make the following patch, which decouples swap
backends, smaller and more manageable for reviewers - these capabilities
will be restored in a following patch.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/internal.h | 16 ++++++++--------
 mm/memory.c   |  4 +++-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index ca28729f822a..51061691a731 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -268,17 +268,12 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 	return (swp_entry_t) { entry.val + n };
 }
 
-/* similar to swap_nth, but check the backing physical slots as well. */
+/* temporary disallow batched swap operations */
 static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
 {
-	swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
-	swp_entry_t next_entry = swap_nth(entry, delta);
-
-	next_slot = swp_entry_to_swp_slot(next_entry);
-	if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
-			swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
-		next_entry.val = 0;
+	swp_entry_t next_entry;
 
+	next_entry.val = 0;
 	return next_entry;
 }
 #else
@@ -349,6 +344,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  * max_nr must be at least one and must be limited by the caller so scanning
  * cannot exceed a single page table.
  *
+ * Note that for virtual swap space, we will not batch anything for now.
+ *
  * Return: the number of table entries in the batch.
  */
 static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
@@ -363,6 +360,9 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(entry));
 
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+		return 1;
+
 	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
diff --git a/mm/memory.c b/mm/memory.c
index a8c418104f28..2a8fd26fb31d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4230,8 +4230,10 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * A large swapped out folio could be partially or fully in zswap. We
 	 * lack handling for such cases, so fallback to swapping in order-0
 	 * folio.
+	 *
+	 * We also disable THP swapin on the virtual swap implementation, for now.
 	 */
-	if (!zswap_never_enabled())
+	if (!zswap_never_enabled() || IS_ENABLED(CONFIG_VIRTUAL_SWAP))
 		goto fallback;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from backing store
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (10 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and batched freeing swap Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

This patch presents the first real use case of the new virtual swap
design. It leverages the virtualization of the swap space to decouple a
swap entry and its backing storage. A swap entry can now be backed by
one of the following options:

1. A slot on a physical swapfile/swap partition.
2. A "zero swap page".
3. A compressed object in the zswap pool.
4. An in-memory page. This can happen when a page is loaded
   (exclusively) from the zswap pool, or if the page is rejected by
   zswap and zswap writeback is disabled.

This allows us to use zswap and the zero swap page optimization, without
having to reserved a slot on a swapfile, or a swapfile at all. This
translates to tens to hundreds of GBs of disk saving on hosts and
workloads that have high memory usage, as well as removes this spurious
limit on the usage of these optimizations.

For now, we still charge virtual swap slots towards the memcg's swap
usage. In a following patch, we will change this behavior and only
charge physical (i.e on swapfile) swap slots towards the memcg's swap
usage.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  66 +++++-
 mm/huge_memory.c     |   5 +-
 mm/memcontrol.c      |  70 ++++--
 mm/memory.c          |  69 ++++--
 mm/migrate.c         |   1 +
 mm/page_io.c         |  31 ++-
 mm/shmem.c           |   7 +-
 mm/swap.h            |  10 +
 mm/swap_state.c      |  23 +-
 mm/swapfile.c        |  22 +-
 mm/vmscan.c          |  26 ++-
 mm/vswap.c           | 528 ++++++++++++++++++++++++++++++++++++++-----
 mm/zswap.c           |  34 ++-
 13 files changed, 743 insertions(+), 149 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 798adfbd43cb..9c92a982d546 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -462,6 +462,7 @@ extern void __meminit kswapd_stop(int nid);
 /* Lifetime swap API (mm/swapfile.c or mm/vswap.c) */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
 int swapcache_prepare(swp_entry_t entry, int nr);
@@ -509,7 +510,6 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
-void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
@@ -736,9 +736,12 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+struct zswap_entry;
+
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
 void vswap_exit(void);
+swp_slot_t vswap_alloc_swap_slot(struct folio *folio);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
 bool vswap_tryget(swp_entry_t entry);
@@ -746,7 +749,13 @@ void vswap_put(swp_entry_t entry);
 bool folio_swapped(struct folio *folio);
 bool vswap_swapcache_only(swp_entry_t entry, int nr);
 int non_swapcache_batch(swp_entry_t entry, int nr);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
+void vswap_split_huge_page(struct folio *head, struct folio *subpage);
+void vswap_migrate(struct folio *src, struct folio *dst);
+bool vswap_disk_backed(swp_entry_t entry, int nr);
+bool vswap_folio_backed(swp_entry_t entry, int nr);
+void vswap_store_folio(swp_entry_t entry, struct folio *folio);
+void swap_zeromap_folio_set(struct folio *folio);
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
@@ -781,9 +790,37 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return (swp_entry_t) { slot.val };
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
+static inline swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
+{
+	return swp_entry_to_swp_slot(folio->swap);
+}
+
+static inline void vswap_split_huge_page(struct folio *head,
+				struct folio *subpage)
+{
+}
+
+static inline void vswap_migrate(struct folio *src, struct folio *dst)
+{
+}
+
+static inline bool vswap_disk_backed(swp_entry_t entry, int nr)
+{
+	return true;
+}
+
+static inline bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+	return false;
+}
+
+static inline void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+}
+
+static inline void vswap_assoc_zswap(swp_entry_t entry,
+				struct zswap_entry *zswap_entry)
 {
-	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
 #endif /* CONFIG_VIRTUAL_SWAP */
 
@@ -802,11 +839,22 @@ static inline bool trylock_swapoff(swp_entry_t entry,
 	 * 2. Swap cache, zswap trees, etc. are all statically declared, and never
 	 *    freed.
 	 *
-	 * We do, however, need a reference to the swap device itself, because we
+	 * However, this function does not provide any guarantee that the virtual
+	 * swap slot's backing state will be stable. This has several implications:
+	 *
+	 * 1. We have to obtain a reference to the swap device itself, because we
 	 * need swap device's metadata in certain scenarios, for example when we
 	 * need to inspect the swap device flag in do_swap_page().
+	 *
+	 * 2. The swap device we are looking up here might be outdated by the time we
+	 * return to the caller. It is perfectly OK, if the swap_info_struct is only
+	 * used in a best-effort manner (i.e optimization). If we need the precise
+	 * backing state, we need to re-check after the entry is pinned in swapcache.
 	 */
-	*si = swap_slot_tryget_swap_info(slot);
+	if (vswap_disk_backed(entry, 1))
+		*si = swap_slot_tryget_swap_info(slot);
+	else
+		*si = NULL;
 	return IS_ENABLED(CONFIG_VIRTUAL_SWAP) || *si;
 }
 
@@ -817,5 +865,11 @@ static inline void unlock_swapoff(swp_entry_t entry,
 		swap_slot_put_swap_info(si);
 }
 
+static inline struct swap_info_struct *vswap_get_device(swp_entry_t entry)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	return slot.val ? swap_slot_tryget_swap_info(slot) : NULL;
+}
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 373781b21e5c..e6832ec2b07a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3172,6 +3172,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 {
 	struct page *head = &folio->page;
 	struct page *page_tail = head + tail;
+
 	/*
 	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
 	 * Don't pass it around before clear_compound_head().
@@ -3227,8 +3228,10 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 		VM_WARN_ON_ONCE_PAGE(true, page_tail);
 		page_tail->private = 0;
 	}
-	if (folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
 		new_folio->swap.val = folio->swap.val + tail;
+		vswap_split_huge_page(folio, new_folio);
+	}
 
 	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a037ec92881d..126b2d0e6aaa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5095,10 +5095,23 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_unlock();
 }
 
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
+
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
-	long nr_swap_pages = get_nr_swap_pages();
+	long nr_swap_pages, nr_zswap_pages = 0;
+
+	/*
+	 * If swap is virtualized and zswap is enabled, we can still use zswap even
+	 * if there is no space left in any swap file/partition.
+	 */
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled() &&
+			(mem_cgroup_disabled() || do_memsw_account() ||
+				mem_cgroup_may_zswap(memcg))) {
+		nr_zswap_pages = PAGE_COUNTER_MAX;
+	}
 
+	nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
 	if (mem_cgroup_disabled() || do_memsw_account())
 		return nr_swap_pages;
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
@@ -5267,6 +5280,29 @@ static struct cftype swap_files[] = {
 };
 
 #ifdef CONFIG_ZSWAP
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		unsigned long max = READ_ONCE(memcg->zswap_max);
+		unsigned long pages;
+
+		if (max == PAGE_COUNTER_MAX)
+			continue;
+		if (max == 0)
+			return false;
+
+		/* Force flush to get accurate stats for charging */
+		__mem_cgroup_flush_stats(memcg, true);
+		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
+		if (pages >= max)
+			return false;
+	}
+	return true;
+}
+
 /**
  * obj_cgroup_may_zswap - check if this cgroup can zswap
  * @objcg: the object cgroup
@@ -5281,34 +5317,15 @@ static struct cftype swap_files[] = {
  */
 bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 {
-	struct mem_cgroup *memcg, *original_memcg;
+	struct mem_cgroup *memcg;
 	bool ret = true;
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return true;
 
-	original_memcg = get_mem_cgroup_from_objcg(objcg);
-	for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
-	     memcg = parent_mem_cgroup(memcg)) {
-		unsigned long max = READ_ONCE(memcg->zswap_max);
-		unsigned long pages;
-
-		if (max == PAGE_COUNTER_MAX)
-			continue;
-		if (max == 0) {
-			ret = false;
-			break;
-		}
-
-		/* Force flush to get accurate stats for charging */
-		__mem_cgroup_flush_stats(memcg, true);
-		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
-		if (pages < max)
-			continue;
-		ret = false;
-		break;
-	}
-	mem_cgroup_put(original_memcg);
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	ret = mem_cgroup_may_zswap(memcg);
+	mem_cgroup_put(memcg);
 	return ret;
 }
 
@@ -5452,6 +5469,11 @@ static struct cftype zswap_files[] = {
 	},
 	{ }	/* terminate */
 };
+#else
+static inline bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+	return false;
+}
 #endif /* CONFIG_ZSWAP */
 
 static int __init mem_cgroup_swap_init(void)
diff --git a/mm/memory.c b/mm/memory.c
index 2a8fd26fb31d..d9c382a5e157 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4311,12 +4311,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct folio *swapcache, *folio = NULL;
 	DECLARE_WAITQUEUE(wait, current);
 	struct page *page;
-	struct swap_info_struct *si = NULL;
+	struct swap_info_struct *si = NULL, *stable_si;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
 	bool swapoff_locked = false;
 	bool exclusive = false;
-	swp_entry_t entry;
+	swp_entry_t orig_entry, entry;
 	swp_slot_t slot;
 	pte_t pte;
 	vm_fault_t ret = 0;
@@ -4330,6 +4330,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * entry might change if we get a large folio - remember the original entry
+	 * for unlocking swapoff etc.
+	 */
+	orig_entry = entry;
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
@@ -4387,7 +4392,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+		if (si && data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			folio = alloc_swap_folio(vmf);
@@ -4597,27 +4602,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 * swapcache -> certainly exclusive.
 			 */
 			exclusive = true;
-		} else if (exclusive && folio_test_writeback(folio) &&
-			  data_race(si->flags & SWP_STABLE_WRITES)) {
+		} else if (exclusive && folio_test_writeback(folio)) {
 			/*
-			 * This is tricky: not all swap backends support
-			 * concurrent page modifications while under writeback.
-			 *
-			 * So if we stumble over such a page in the swapcache
-			 * we must not set the page exclusive, otherwise we can
-			 * map it writable without further checks and modify it
-			 * while still under writeback.
-			 *
-			 * For these problematic swap backends, simply drop the
-			 * exclusive marker: this is perfectly fine as we start
-			 * writeback only if we fully unmapped the page and
-			 * there are no unexpected references on the page after
-			 * unmapping succeeded. After fully unmapped, no
-			 * further GUP references (FOLL_GET and FOLL_PIN) can
-			 * appear, so dropping the exclusive marker and mapping
-			 * it only R/O is fine.
+			 * We need to look up the swap device again here, for the virtual
+			 * swap case. The si we got from trylock_swapoff() is not
+			 * guaranteed to be stable, as at that time we have not pinned
+			 * the virtual swap slot's backing storage. With the folio locked
+			 * and loaded into the swap cache, we can now guarantee a stable
+			 * backing state.
 			 */
-			exclusive = false;
+			if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+				stable_si = vswap_get_device(entry);
+			else
+				stable_si = si;
+			if (stable_si && data_race(stable_si->flags & SWP_STABLE_WRITES)) {
+				/*
+				 * This is tricky: not all swap backends support
+				 * concurrent page modifications while under writeback.
+				 *
+				 * So if we stumble over such a page in the swapcache
+				 * we must not set the page exclusive, otherwise we can
+				 * map it writable without further checks and modify it
+				 * while still under writeback.
+				 *
+				 * For these problematic swap backends, simply drop the
+				 * exclusive marker: this is perfectly fine as we start
+				 * writeback only if we fully unmapped the page and
+				 * there are no unexpected references on the page after
+				 * unmapping succeeded. After fully unmapped, no
+				 * further GUP references (FOLL_GET and FOLL_PIN) can
+				 * appear, so dropping the exclusive marker and mapping
+				 * it only R/O is fine.
+				 */
+				exclusive = false;
+			}
+
+			if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && stable_si)
+				swap_slot_put_swap_info(stable_si);
 		}
 	}
 
@@ -4726,7 +4747,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			wake_up(&swapcache_wq);
 	}
 	if (swapoff_locked)
-		unlock_swapoff(entry, si);
+		unlock_swapoff(orig_entry, si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4745,7 +4766,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			wake_up(&swapcache_wq);
 	}
 	if (swapoff_locked)
-		unlock_swapoff(entry, si);
+		unlock_swapoff(orig_entry, si);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 97f0edf0c032..3a2cf62f47ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -523,6 +523,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	if (folio_test_swapcache(folio)) {
 		folio_set_swapcache(newfolio);
 		newfolio->private = folio_get_private(folio);
+		vswap_migrate(folio, newfolio);
 		entries = nr;
 	} else {
 		entries = 1;
diff --git a/mm/page_io.c b/mm/page_io.c
index 182851c47f43..83fc4a466db8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -201,6 +201,12 @@ static bool is_folio_zero_filled(struct folio *folio)
 	return true;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static void swap_zeromap_folio_clear(struct folio *folio)
+{
+	vswap_store_folio(folio->swap, folio);
+}
+#else
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
@@ -238,6 +244,7 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 		clear_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * We may have stale swap cache pages in memory: notice
@@ -246,6 +253,7 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 int swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct folio *folio = page_folio(page);
+	swp_slot_t slot;
 	int ret;
 
 	if (folio_free_swap(folio)) {
@@ -275,9 +283,8 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return 0;
 	} else {
 		/*
-		 * Clear bits this folio occupies in the zeromap to prevent
-		 * zero data being read in from any previous zero writes that
-		 * occupied the same swap entries.
+		 * Clear the zeromap state to prevent zero data being read in from any
+		 * previous zero writes that occupied the same swap entries.
 		 */
 		swap_zeromap_folio_clear(folio);
 	}
@@ -291,6 +298,13 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
 
+	/* fall back to physical swap device */
+	slot = vswap_alloc_swap_slot(folio);
+	if (!slot.val) {
+		folio_mark_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	__swap_writepage(folio, wbc);
 	return 0;
 }
@@ -624,14 +638,11 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis =
-		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
-	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
-	bool workingset = folio_test_workingset(folio);
+	struct swap_info_struct *sis;
+	bool synchronous, workingset = folio_test_workingset(folio);
 	unsigned long pflags;
 	bool in_thrashing;
 
-	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
 
@@ -657,6 +668,10 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
 
+	sis = swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
+	synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
+
 	if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_read_folio_fs(folio, plug);
 	} else if (synchronous) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 4c00b4673468..609971a2b365 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1404,7 +1404,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 		 * swapin error entries can be found in the mapping. But they're
 		 * deliberately ignored here as we've done everything we can do.
 		 */
-		if (swp_slot_type(slot) != type)
+		if (!slot.val || swp_slot_type(slot) != type)
 			continue;
 
 		indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1554,7 +1554,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if ((info->flags & VM_LOCKED) || sbinfo->noswap)
 		goto redirty;
 
-	if (!total_swap_pages)
+	if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP) && !total_swap_pages)
 		goto redirty;
 
 	/*
@@ -2295,7 +2295,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			fallback_order0 = true;
 
 		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+		if (!fallback_order0 && si &&
+				  data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
diff --git a/mm/swap.h b/mm/swap.h
index 31c94671cb44..411282d08a15 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -86,9 +86,18 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
 
+	/*
+	 * In the virtual swap implementation, the folio might not be backed by any
+	 * physical swap slots (for e.g zswap-backed only).
+	 */
+	if (!swp_slot.val)
+		return 0;
 	return swap_slot_swap_info(swp_slot)->flags;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap);
+#else
 /*
  * Return the count of contiguous swap entries that share the same
  * zeromap status as the starting entry. If is_zeromap is not NULL,
@@ -114,6 +123,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 	else
 		return find_next_bit(sis->zeromap, end, start) - start;
 }
+#endif
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 16abdb5ce07a..19c0c01f3c6b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -490,6 +490,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	for (;;) {
 		int err;
+
 		/*
 		 * First check the swap cache.  Since this is normally
 		 * called after swap_cache_get_folio() failed, re-calling
@@ -527,8 +528,20 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * Swap entry may have been freed since our caller observed it.
 		 */
 		err = swapcache_prepare(entry, 1);
-		if (!err)
+		if (!err) {
+			/* This might be invoked by swap_cluster_readahead(), which can
+			 * race with shmem_swapin_folio(). The latter might have already
+			 * called delete_from_swap_cache(), allowing swapcache_prepare()
+			 * to succeed here. This can lead to reading bogus data to populate
+			 * the page. To prevent this, skip folio-backed virtual swap slots,
+			 * and let caller retry if necessary.
+			 */
+			if (vswap_folio_backed(entry, 1)) {
+				swapcache_clear(si, entry, 1);
+				goto put_and_return;
+			}
 			break;
+		}
 		else if (err != -EEXIST)
 			goto put_and_return;
 
@@ -711,6 +724,14 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
 
+	/*
+	 * If swap is virtualized, the swap entry might not be backed by any
+	 * physical swap slot. In that case, just skip readahead and bring in the
+	 * target entry.
+	 */
+	if (!slot.val)
+		goto skip;
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c09011867263..83016d86eb1c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1164,8 +1164,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 {
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
-	unsigned int i;
 #ifndef CONFIG_VIRTUAL_SWAP
+	unsigned int i;
 	unsigned long begin = offset;
 
 	/*
@@ -1173,16 +1173,20 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 * slots. We will clear the shadow when the virtual swap slots are freed.
 	 */
 	clear_shadow_from_swap_cache(si->type, begin, end);
-#endif
 
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
+	 *
+	 * Note that in the virtual swap implementation, we do not need to perform
+	 * these operations, since zswap and zero-filled pages are not backed by
+	 * physical swapfile.
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
 		zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
 	}
+#endif
 
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1646,43 +1650,35 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
 }
-#endif
 
 /*
  * This should only be called in contexts in which the slot has
  * been allocated but not associated with any swap entries.
  */
-void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	int size = 1 << swap_slot_order(folio_order(folio));
-	unsigned char usage;
 
 	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
-#ifdef CONFIG_VIRTUAL_SWAP
-	usage = SWAP_MAP_ALLOCATED;
-#else
-	usage = SWAP_HAS_CACHE;
-#endif
-
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
 		swap_slot_range_free(si, ci, slot, size);
 	else {
 		for (int i = 0; i < size; i++, slot.val++) {
-			if (!__swap_slot_free_locked(si, offset + i, usage))
+			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
 				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
 }
 
-#ifndef CONFIG_VIRTUAL_SWAP
 int __swap_count(swp_entry_t entry)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7..db4178bf5f6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -341,10 +341,15 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
 {
 	if (memcg == NULL) {
 		/*
-		 * For non-memcg reclaim, is there
-		 * space in any swap device?
+		 * For non-memcg reclaim:
+		 *
+		 * If swap is virtualized, we can still use zswap even if there is no
+		 * space left in any swap file/partition.
+		 *
+		 * Otherwise, check if there is space in any swap device?
 		 */
-		if (get_nr_swap_pages() > 0)
+		if ((IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled()) ||
+				get_nr_swap_pages() > 0)
 			return true;
 	} else {
 		/* Is the memcg below its swap limit? */
@@ -2611,12 +2616,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 static bool can_age_anon_pages(struct pglist_data *pgdat,
 			       struct scan_control *sc)
 {
-	/* Aging the anon LRU is valuable if swap is present: */
-	if (total_swap_pages > 0)
-		return true;
-
-	/* Also valuable if anon pages can be demoted: */
-	return can_demote(pgdat->node_id, sc);
+	/*
+	 * Aging the anon LRU is valuable if:
+	 * 1. Swap is virtualized and zswap is enabled.
+	 * 2. There are physical swap slots available.
+	 * 3. Anon pages can be demoted
+	 */
+	return (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled()) ||
+			total_swap_pages > 0 ||
+			can_demote(pgdat->node_id, sc);
 }
 
 #ifdef CONFIG_LRU_GEN
diff --git a/mm/vswap.c b/mm/vswap.c
index 513d000a134c..a42d346b7e93 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -34,26 +34,59 @@
  * about to be added to the swap cache). Its reference count is incremented or
  * decremented every time it is mapped to or unmapped from a PTE, as well as
  * when it is added to or removed from the swap cache. Finally, when its
- * reference count reaches 0, the virtual swap slot is freed.
+ * reference count reaches 0, the virtual swap slot is freed, and its backing
+ * store released.
+ *
+ *
+ * II. Backing State
+ *
+ * Each virtual swap slot be backed by:
+ *
+ * 1. A slot on a physical swap device (i.e a swapfile or a swap partition).
+ * 2. A swapped out zero-filled page.
+ * 3. A compressed object in zswap.
+ * 4. An in-memory folio, that is not backed by neither a physical swap device
+ *    nor zswap (i.e only in swap cache). This is used for pages that are
+ *    rejected by zswap, but not (yet) backed by a physical swap device,
+ *    (for e.g, due to zswap.writeback = 0), or for pages that were previously
+ *    stored in zswap, but has since been loaded back into memory (and has its
+ *    zswap copy invalidated).
  */
 
+/* The backing state options of a virtual swap slot */
+enum swap_type {
+	VSWAP_SWAPFILE,
+	VSWAP_ZERO,
+	VSWAP_ZSWAP,
+	VSWAP_FOLIO
+};
+
 /**
  * Swap descriptor - metadata of a swapped out page.
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
  * @lock: The lock protecting the swap slot backing field.
+ * @folio: The folio that backs the virtual swap slot.
+ * @zswap_entry: The zswap entry that backs the virtual swap slot.
+ * @lock: The lock protecting the swap slot backing fields.
  * @memcgid: The memcg id of the owning memcg, if any.
+ * @type: The backing store type of the swap entry.
  * @swap_refs: This field stores all the references to the swap entry. The
  *             least significant bit indicates whether the swap entry is (about
  *             to be) pinned in swap cache. The remaining bits tell us the
  *             number of page table entries that refer to the swap entry.
  */
 struct swp_desc {
-	swp_slot_t slot;
+	union {
+		swp_slot_t slot;
+		struct folio *folio;
+		struct zswap_entry *zswap_entry;
+	};
 	struct rcu_head rcu;
 
 	rwlock_t lock;
+	enum swap_type type;
 
 #ifdef CONFIG_MEMCG
 	atomic_t memcgid;
@@ -157,6 +190,7 @@ static swp_entry_t vswap_alloc(int nr)
 	}
 
 	for (i = 0; i < nr; i++) {
+		descs[i]->type = VSWAP_SWAPFILE;
 		descs[i]->slot.val = 0;
 		atomic_set(&descs[i]->memcgid, 0);
 		/* swap entry is about to be added to the swap cache */
@@ -244,6 +278,72 @@ static inline void release_vswap_slot(unsigned long index)
 	atomic_dec(&vswap_used);
 }
 
+/*
+ * Caller needs to handle races with other operations themselves.
+ *
+ * For instance, this function is safe to be called in contexts where the swap
+ * entry has been added to the swap cache and the associated folio is locked.
+ * We cannot race with other accessors, and the swap entry is guaranteed to be
+ * valid the whole time (since swap cache implies one refcount).
+ *
+ * We also need to make sure the backing state of the entire range matches.
+ * This is usually already checked by upstream callers.
+ */
+static inline void release_backing(swp_entry_t entry, int nr)
+{
+	swp_slot_t slot = (swp_slot_t){0};
+	struct swap_info_struct *si;
+	struct folio *folio = NULL;
+	enum swap_type type;
+	struct swp_desc *desc;
+	int i = 0;
+
+	VM_WARN_ON(!entry.val);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		VM_WARN_ON(!desc);
+		write_lock(&desc->lock);
+		if (!i) {
+			type = desc->type;
+			if (type == VSWAP_FOLIO)
+				folio = desc->folio;
+			else if (type == VSWAP_SWAPFILE)
+				slot = desc->slot;
+		} else {
+			VM_WARN_ON(type != desc->type);
+			VM_WARN_ON(type == VSWAP_FOLIO && desc->folio != folio);
+			VM_WARN_ON(type == VSWAP_SWAPFILE && slot.val &&
+				desc->slot.val != slot.val + i);
+		}
+
+		if (desc->type == VSWAP_ZSWAP)
+			zswap_invalidate((swp_entry_t){entry.val + i});
+		else if (desc->type == VSWAP_SWAPFILE) {
+			if (desc->slot.val) {
+				xa_erase(&vswap_rmap, desc->slot.val);
+				desc->slot.val = 0;
+			}
+		}
+		write_unlock(&desc->lock);
+		i++;
+	}
+	rcu_read_unlock();
+
+	if (slot.val) {
+		si = swap_slot_tryget_swap_info(slot);
+		if (si) {
+			swap_slot_free_nr(slot, nr);
+			swap_slot_put_swap_info(si);
+		}
+	}
+}
+
 /**
  * vswap_free - free a virtual swap slot.
  * @id: the virtual swap slot to free
@@ -257,52 +357,88 @@ static void vswap_free(swp_entry_t entry)
 
 	/* do not immediately erase the virtual slot to prevent its reuse */
 	desc = xa_load(&vswap_map, entry.val);
-	if (!desc)
-		return;
 
 	virt_clear_shadow_from_swap_cache(entry);
-
-	if (desc->slot.val) {
-		/* we only charge after linkage was established */
-		mem_cgroup_uncharge_swap(entry, 1);
-		xa_erase(&vswap_rmap, desc->slot.val);
-		swap_slot_free_nr(desc->slot, 1);
-	}
-
+	release_backing(entry, 1);
+	mem_cgroup_uncharge_swap(entry, 1);
 	/* erase forward mapping and release the virtual slot for reallocation */
 	release_vswap_slot(entry.val);
 	kfree_rcu(desc, rcu);
 }
 
 /**
- * folio_alloc_swap - allocate virtual swap slots for a folio.
- * @folio: the folio.
+ * folio_alloc_swap - allocate virtual swap slots for a folio, and
+ *                    set their backing store to the folio.
+ * @folio: the folio to allocate virtual swap slots for.
  *
  * Return: the first allocated slot if success, or the zero virtuals swap slot
  * on failure.
  */
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
-	int i, err, nr = folio_nr_pages(folio);
-	bool manual_freeing = true;
-	struct swp_desc *desc;
 	swp_entry_t entry;
-	swp_slot_t slot;
+	struct swp_desc *desc;
+	int i, nr = folio_nr_pages(folio);
 
 	entry = vswap_alloc(nr);
 	if (!entry.val)
 		return entry;
 
 	/*
-	 * XXX: for now, we always allocate a physical swap slot for each virtual
-	 * swap slot, and their lifetime are coupled. This will change once we
-	 * decouple virtual swap slots from their backing states, and only allocate
-	 * physical swap slots for them on demand (i.e on zswap writeback, or
-	 * fallback from zswap store failure).
+	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+	 * swap slots allocation. This will be changed soon - we will only charge on
+	 * physical swap slots allocation.
+	 */
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		for (i = 0; i < nr; i++) {
+			vswap_free(entry);
+			entry.val++;
+		}
+		atomic_add(nr, &vswap_alloc_reject);
+		entry.val = 0;
+		return entry;
+	}
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		desc->folio = folio;
+		desc->type = VSWAP_FOLIO;
+	}
+	rcu_read_unlock();
+	return entry;
+}
+
+/**
+ * vswap_alloc_swap_slot - allocate physical swap space for a folio that is
+ *                         already associated with virtual swap slots.
+ * @folio: folio we want to allocate physical swap space for.
+ *
+ * Return: the first allocated physical swap slot, if succeeds.
+ */
+swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
+{
+	int i, err, nr = folio_nr_pages(folio);
+	swp_slot_t slot = { .val = 0 };
+	swp_entry_t entry = folio->swap;
+	struct swp_desc *desc;
+	bool fallback = false;
+
+	/*
+	 * We might have already allocated a backing physical swap slot in past
+	 * attempts (for instance, when we disable zswap).
 	 */
+	slot = swp_entry_to_swp_slot(entry);
+	if (slot.val)
+		return slot;
+
 	slot = folio_alloc_swap_slot(folio);
 	if (!slot.val)
-		goto vswap_free;
+		return slot;
 
 	/* establish the vrtual <-> physical swap slots linkages. */
 	for (i = 0; i < nr; i++) {
@@ -312,7 +448,13 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (err) {
 			while (--i >= 0)
 				xa_erase(&vswap_rmap, slot.val + i);
-			goto put_physical_swap;
+			/*
+			 * We have not updated the backing type of the virtual swap slot.
+			 * Simply free up the physical swap slots here!
+			 */
+			swap_slot_free_nr(slot, nr);
+			slot.val = 0;
+			return slot;
 		}
 	}
 
@@ -324,36 +466,31 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (xas_retry(&xas, desc))
 			continue;
 
+		write_lock(&desc->lock);
+		if (desc->type == VSWAP_FOLIO) {
+			/* case 1: fallback from zswap store failure */
+			fallback = true;
+			if (!folio)
+				folio = desc->folio;
+			else
+				VM_WARN_ON(folio != desc->folio);
+		} else {
+			/*
+			 * Case 2: zswap writeback.
+			 *
+			 * No need to free zswap entry here - it will be freed once zswap
+			 * writeback suceeds.
+			 */
+			VM_WARN_ON(desc->type != VSWAP_ZSWAP);
+			VM_WARN_ON(fallback);
+		}
+		desc->type = VSWAP_SWAPFILE;
 		desc->slot.val = slot.val + i;
+		write_unlock(&desc->lock);
 		i++;
 	}
 	rcu_read_unlock();
-
-	manual_freeing = false;
-	/*
-	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
-	 * swap slots allocation. This is acceptable because as noted above, each
-	 * virtual swap slot corresponds to a physical swap slot. Once we have
-	 * decoupled virtual and physical swap slots, we will only charge when we
-	 * actually allocate a physical swap slot.
-	 */
-	if (!mem_cgroup_try_charge_swap(folio, entry))
-		return entry;
-
-put_physical_swap:
-	/*
-	 * There is no any linkage between virtual and physical swap slots yet. We
-	 * have to manually and separately free the allocated virtual and physical
-	 * swap slots.
-	 */
-	swap_slot_put_folio(slot, folio);
-vswap_free:
-	if (manual_freeing) {
-		for (i = 0; i < nr; i++)
-			vswap_free((swp_entry_t){entry.val + i});
-	}
-	entry.val = 0;
-	return entry;
+	return slot;
 }
 
 /**
@@ -361,7 +498,9 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
  *                         virtual swap slot.
  * @entry: the virtual swap slot.
  *
- * Return: the physical swap slot corresponding to the virtual swap slot.
+ * Return: the physical swap slot corresponding to the virtual swap slot, if
+ * exists, or the zero physical swap slot if the virtual swap slot is not
+ * backed by any physical slot on a swapfile.
  */
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 {
@@ -379,7 +518,10 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 	}
 
 	read_lock(&desc->lock);
-	slot = desc->slot;
+	if (desc->type != VSWAP_SWAPFILE)
+		slot = (swp_slot_t){0};
+	else
+		slot = desc->slot;
 	read_unlock(&desc->lock);
 	rcu_read_unlock();
 
@@ -693,6 +835,286 @@ int non_swapcache_batch(swp_entry_t entry, int max_nr)
 	return i;
 }
 
+/**
+ * vswap_split_huge_page - update a subpage's swap descriptor to point to the
+ *                         recently split out subpage folio descriptor.
+ * @head: the original head's folio descriptor.
+ * @subpage: the subpage's folio descriptor.
+ */
+void vswap_split_huge_page(struct folio *head, struct folio *subpage)
+{
+	struct swp_desc *desc = xa_load(&vswap_map, subpage->swap.val);
+
+	write_lock(&desc->lock);
+	if (desc->type == VSWAP_FOLIO) {
+		VM_WARN_ON(desc->folio != head);
+		desc->folio = subpage;
+	}
+	write_unlock(&desc->lock);
+}
+
+/**
+ * vswap_migrate - update the swap entries of the original folio to refer to
+ *                 the new folio for migration.
+ * @old: the old folio.
+ * @new: the new folio.
+ */
+void vswap_migrate(struct folio *src, struct folio *dst)
+{
+	long nr = folio_nr_pages(src), nr_folio_backed = 0;
+	struct swp_desc *desc;
+
+	VM_WARN_ON(!folio_test_locked(src));
+	VM_WARN_ON(!folio_test_swapcache(src));
+
+	XA_STATE(xas, &vswap_map, src->swap.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, src->swap.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		if (desc->type == VSWAP_FOLIO) {
+			VM_WARN_ON(desc->folio != src);
+			desc->folio = dst;
+			nr_folio_backed++;
+		}
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+
+	/* we should not see mixed backing states for swap entries in swap cache */
+	VM_WARN_ON(nr_folio_backed && nr_folio_backed != nr);
+}
+
+/**
+ * vswap_store_folio - set a folio as the backing of a range of virtual swap
+ *                     slots.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio.
+ */
+void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+
+	VM_BUG_ON(!folio_test_locked(folio));
+	VM_BUG_ON(folio->swap.val != entry.val);
+
+	release_backing(entry, nr);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		desc->type = VSWAP_FOLIO;
+		desc->folio = folio;
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * vswap_assoc_zswap - associate a virtual swap slot to a zswap entry.
+ * @entry: the virtual swap slot.
+ * @zswap_entry: the zswap entry.
+ */
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry)
+{
+	struct swp_desc *desc;
+
+	release_backing(entry, 1);
+
+	desc = xa_load(&vswap_map, entry.val);
+	write_lock(&desc->lock);
+	desc->type = VSWAP_ZSWAP;
+	desc->zswap_entry = zswap_entry;
+	write_unlock(&desc->lock);
+}
+
+/**
+ * swap_zeromap_folio_set - mark a range of virtual swap slots corresponding to
+ *                          a folio as zero-filled.
+ * @folio: the folio
+ */
+void swap_zeromap_folio_set(struct folio *folio)
+{
+	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
+	swp_entry_t entry = folio->swap;
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+
+	VM_BUG_ON(!folio_test_locked(folio));
+	VM_BUG_ON(!entry.val);
+
+	release_backing(entry, nr);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		desc->type = VSWAP_ZERO;
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+
+	count_vm_events(SWPOUT_ZERO, nr);
+	if (objcg) {
+		count_objcg_events(objcg, SWPOUT_ZERO, nr);
+		obj_cgroup_put(objcg);
+	}
+}
+
+/*
+ * Iterate through the entire range of virtual swap slots, returning the
+ * longest contiguous range of slots starting from the first slot that satisfies:
+ *
+ * 1. If the first slot is zero-mapped, the entire range should be
+ *    zero-mapped.
+ * 2. If the first slot is backed by a swapfile, the entire range should
+ *    be backed by a range of contiguous swap slots on the same swapfile.
+ * 3. If the first slot is zswap-backed, the entire range should be
+ *    zswap-backed.
+ * 4. If the first slot is backed by a folio, the entire range should
+ *    be backed by the same folio.
+ *
+ * Note that this check is racy unless we can ensure that the entire range
+ * has their backing state stable - for instance, if the caller was the one
+ * who set the in_swapcache flag of the entire field.
+ */
+static int vswap_check_backing(swp_entry_t entry, enum swap_type *type, int nr)
+{
+	unsigned int swapfile_type;
+	enum swap_type first_type;
+	struct swp_desc *desc;
+	pgoff_t first_offset;
+	struct folio *folio;
+	int i = 0;
+
+	if (!entry.val || non_swap_entry(entry))
+		return 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc)
+			goto done;
+
+		read_lock(&desc->lock);
+		if (!i) {
+			first_type = desc->type;
+			if (first_type == VSWAP_SWAPFILE) {
+				swapfile_type = swp_slot_type(desc->slot);
+				first_offset = swp_slot_offset(desc->slot);
+			} else if (first_type == VSWAP_FOLIO) {
+				folio = desc->folio;
+			}
+		} else if (desc->type != first_type) {
+			read_unlock(&desc->lock);
+			goto done;
+		} else if (first_type == VSWAP_SWAPFILE &&
+				(swp_slot_type(desc->slot) != swapfile_type ||
+					swp_slot_offset(desc->slot) != first_offset + i)) {
+			read_unlock(&desc->lock);
+			goto done;
+		} else if (first_type == VSWAP_FOLIO && desc->folio != folio) {
+			read_unlock(&desc->lock);
+			goto done;
+		}
+		read_unlock(&desc->lock);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (type)
+		*type = first_type;
+	return i;
+}
+
+/**
+ * vswap_disk_backed - check if the virtual swap slots are backed by physical
+ *                     swap slots.
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+bool vswap_disk_backed(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr
+				&& type == VSWAP_SWAPFILE;
+}
+
+/**
+ * vswap_folio_backed - check if the virtual swap slots are backed by in-memory
+ *                      pages.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ */
+bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr
+				&& type == VSWAP_FOLIO;
+}
+
+/*
+ * Return the count of contiguous swap entries that share the same
+ * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,
+ * it will return the VSWAP_ZERO status of the starting entry.
+ */
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap)
+{
+	struct swp_desc *desc;
+	int i = 0;
+	bool is_zero = false;
+
+	VM_WARN_ON(!entry.val || non_swap_entry(entry));
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + max_nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc)
+			goto done;
+
+		read_lock(&desc->lock);
+		if (!i) {
+			is_zero = (desc->type == VSWAP_ZERO);
+		} else {
+			if ((desc->type == VSWAP_ZERO) != is_zero) {
+				read_unlock(&desc->lock);
+				goto done;
+			}
+		}
+		read_unlock(&desc->lock);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && is_zeromap)
+		*is_zeromap = is_zero;
+
+	return i;
+}
+
 /**
  * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
  *                            reclaim their cache if no more references remain.
diff --git a/mm/zswap.c b/mm/zswap.c
index c1327569ce80..15429825d667 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1068,6 +1068,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
+	struct zswap_entry *new_entry;
+	swp_slot_t slot;
 
 	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
@@ -1088,6 +1090,10 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -EEXIST;
 	}
 
+	slot = vswap_alloc_swap_slot(folio);
+	if (!slot.val)
+		goto release_folio;
+
 	/*
 	 * folio is locked, and the swapcache is now secured against
 	 * concurrent swapping to and from the slot, and concurrent
@@ -1098,12 +1104,9 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	 * be dereferenced.
 	 */
 	tree = swap_zswap_tree(swpentry);
-	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
-		delete_from_swap_cache(folio);
-		folio_unlock(folio);
-		folio_put(folio);
-		return -ENOMEM;
-	}
+	new_entry = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
+	if (entry != new_entry)
+		goto fail;
 
 	zswap_decompress(entry, folio);
 
@@ -1124,6 +1127,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	folio_put(folio);
 
 	return 0;
+
+fail:
+	vswap_assoc_zswap(swpentry, new_entry);
+release_folio:
+	delete_from_swap_cache(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+	return -ENOMEM;
 }
 
 /*********************************
@@ -1487,6 +1498,8 @@ static bool zswap_store_page(struct page *page,
 		goto store_failed;
 	}
 
+	vswap_assoc_zswap(page_swpentry, entry);
+
 	/*
 	 * We may have had an existing entry that became stale when
 	 * the folio was redirtied and now the new version is being
@@ -1608,7 +1621,7 @@ bool zswap_store(struct folio *folio)
 	 */
 	if (!ret) {
 		unsigned type = swp_type(swp);
-		pgoff_t offset = swp_offset(swp);
+		pgoff_t offset = zswap_tree_index(swp);
 		struct zswap_entry *entry;
 		struct xarray *tree;
 
@@ -1618,6 +1631,12 @@ bool zswap_store(struct folio *folio)
 			if (entry)
 				zswap_entry_free(entry);
 		}
+
+		/*
+		 * We might have also partially associated some virtual swap slots with
+		 * zswap entries. Undo this.
+		 */
+		vswap_store_folio(swp, folio);
 	}
 
 	return ret;
@@ -1674,6 +1693,7 @@ bool zswap_load(struct folio *folio)
 		count_objcg_events(entry->objcg, ZSWPIN, 1);
 
 	if (swapcache) {
+		vswap_store_folio(swp, folio);
 		zswap_entry_free(entry);
 		folio_mark_dirty(folio);
 	}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is no physical swap slots
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (11 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from backing store Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 14/18] memcg: swap: only charge " Nhat Pham
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

When swap is virtualized, we no longer pre-allocate a slot on swapfile
for each zswap entry. Do not start the zswap shrinker if there is no
physical swap slots available.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 15429825d667..f2f412cc1911 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1277,6 +1277,14 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
 		return 0;
 
+	/*
+	 * When swap is virtualized, we do not have any swap slots on swapfile
+	 * preallocated for zswap objects. If there is no slot available, we
+	 * cannot writeback and should just bail out here.
+	 */
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && !get_nr_swap_pages())
+		return 0;
+
 	/*
 	 * The shrinker resumes swap writeback, which will enter block
 	 * and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 14/18] memcg: swap: only charge physical swap slots
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (12 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 15/18] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

Now that zswap and the zero-filled swap page optimization no longer
takes up any physical swap space, we should not charge towards the swap
usage and limits of the memcg in these case. We will only record the
memcg id on virtual swap slot allocation, and defer physical swap
charging (i.e towards memory.swap.current) until the virtual swap slot
is backed by an actual physical swap slot (on zswap store failure
fallback or zswap writeback).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  17 ++++++++
 mm/memcontrol.c      | 102 ++++++++++++++++++++++++++++++++++---------
 mm/vswap.c           |  43 ++++++++----------
 3 files changed, 118 insertions(+), 44 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9c92a982d546..a65b22de4cdd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -690,6 +690,23 @@ static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
 void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry);
+
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry);
+static inline void mem_cgroup_record_swap(struct folio *folio,
+		swp_entry_t entry)
+{
+	if (!mem_cgroup_disabled())
+		__mem_cgroup_record_swap(folio, entry);
+}
+
+void __mem_cgroup_unrecord_swap(swp_entry_t entry, unsigned int nr_pages);
+static inline void mem_cgroup_unrecord_swap(swp_entry_t entry,
+		unsigned int nr_pages)
+{
+	if (!mem_cgroup_disabled())
+		__mem_cgroup_unrecord_swap(entry, nr_pages);
+}
+
 int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry);
 static inline int mem_cgroup_try_charge_swap(struct folio *folio,
 		swp_entry_t entry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 126b2d0e6aaa..c6bee12f2016 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5020,6 +5020,46 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	css_put(&memcg->css);
 }
 
+/**
+ * __mem_cgroup_record_swap - record the folio's cgroup for the swap entries.
+ * @folio: folio being swapped out.
+ * @entry: the first swap entry in the range.
+ *
+ * In the virtual swap implementation, we only record the folio's cgroup
+ * for the virtual swap slots on their allocation. We will only charge
+ * physical swap slots towards the cgroup's swap usage, i.e when physical swap
+ * slots are allocated for zswap writeback or fallback from zswap store
+ * failure.
+ */
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+	struct mem_cgroup *memcg;
+
+	memcg = folio_memcg(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+	if (!memcg)
+		return;
+
+	memcg = mem_cgroup_id_get_online(memcg);
+	if (nr_pages > 1)
+		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+}
+
+void __mem_cgroup_unrecord_swap(swp_entry_t entry, unsigned int nr_pages)
+{
+	unsigned short id = swap_cgroup_clear(entry, nr_pages);
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_id(id);
+	if (memcg)
+		mem_cgroup_id_put_many(memcg, nr_pages);
+	rcu_read_unlock();
+}
+
 /**
  * __mem_cgroup_try_charge_swap - try charging swap space for a folio
  * @folio: folio being added to swap
@@ -5038,34 +5078,47 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	if (do_memsw_account())
 		return 0;
 
-	memcg = folio_memcg(folio);
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
+		/*
+		 * In the virtual swap implementation, we already record the cgroup
+		 * on virtual swap allocation. Note that the virtual swap slot holds
+		 * a reference to memcg, so this lookup should be safe.
+		 */
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(lookup_swap_cgroup_id(entry));
+		rcu_read_unlock();
+	} else {
+		memcg = folio_memcg(folio);
 
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
-		return 0;
+		VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+		if (!memcg)
+			return 0;
 
-	if (!entry.val) {
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		return 0;
-	}
+		if (!entry.val) {
+			memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+			return 0;
+		}
 
-	memcg = mem_cgroup_id_get_online(memcg);
+		memcg = mem_cgroup_id_get_online(memcg);
+	}
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		mem_cgroup_id_put(memcg);
+		if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+			mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	/* Get references for the tail pages, too */
-	if (nr_pages > 1)
-		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
+		/* Get references for the tail pages, too */
+		if (nr_pages > 1)
+			mem_cgroup_id_get_many(memcg, nr_pages - 1);
+		swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+	}
 	mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
 
-	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
-
 	return 0;
 }
 
@@ -5079,7 +5132,11 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	struct mem_cgroup *memcg;
 	unsigned short id;
 
-	id = swap_cgroup_clear(entry, nr_pages);
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+		id = lookup_swap_cgroup_id(entry);
+	else
+		id = swap_cgroup_clear(entry, nr_pages);
+
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
@@ -5090,7 +5147,8 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 				page_counter_uncharge(&memcg->swap, nr_pages);
 		}
 		mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages);
-		mem_cgroup_id_put_many(memcg, nr_pages);
+		if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+			mem_cgroup_id_put_many(memcg, nr_pages);
 	}
 	rcu_read_unlock();
 }
@@ -5099,7 +5157,7 @@ static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
 
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
-	long nr_swap_pages, nr_zswap_pages = 0;
+	long nr_swap_pages;
 
 	/*
 	 * If swap is virtualized and zswap is enabled, we can still use zswap even
@@ -5108,10 +5166,14 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled() &&
 			(mem_cgroup_disabled() || do_memsw_account() ||
 				mem_cgroup_may_zswap(memcg))) {
-		nr_zswap_pages = PAGE_COUNTER_MAX;
+		/*
+		 * No need to check swap cgroup limits, since zswap is not charged
+		 * towards swap consumption.
+		 */
+		return PAGE_COUNTER_MAX;
 	}
 
-	nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
+	nr_swap_pages = get_nr_swap_pages();
 	if (mem_cgroup_disabled() || do_memsw_account())
 		return nr_swap_pages;
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
diff --git a/mm/vswap.c b/mm/vswap.c
index a42d346b7e93..c51ff5c54480 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -341,6 +341,7 @@ static inline void release_backing(swp_entry_t entry, int nr)
 			swap_slot_free_nr(slot, nr);
 			swap_slot_put_swap_info(si);
 		}
+		mem_cgroup_uncharge_swap(entry, nr);
 	}
 }
 
@@ -360,7 +361,7 @@ static void vswap_free(swp_entry_t entry)
 
 	virt_clear_shadow_from_swap_cache(entry);
 	release_backing(entry, 1);
-	mem_cgroup_uncharge_swap(entry, 1);
+	mem_cgroup_unrecord_swap(entry, 1);
 	/* erase forward mapping and release the virtual slot for reallocation */
 	release_vswap_slot(entry.val);
 	kfree_rcu(desc, rcu);
@@ -378,27 +379,13 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_entry_t entry;
 	struct swp_desc *desc;
-	int i, nr = folio_nr_pages(folio);
+	int nr = folio_nr_pages(folio);
 
 	entry = vswap_alloc(nr);
 	if (!entry.val)
 		return entry;
 
-	/*
-	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
-	 * swap slots allocation. This will be changed soon - we will only charge on
-	 * physical swap slots allocation.
-	 */
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		for (i = 0; i < nr; i++) {
-			vswap_free(entry);
-			entry.val++;
-		}
-		atomic_add(nr, &vswap_alloc_reject);
-		entry.val = 0;
-		return entry;
-	}
-
+	mem_cgroup_record_swap(folio, entry);
 	XA_STATE(xas, &vswap_map, entry.val);
 
 	rcu_read_lock();
@@ -440,6 +427,9 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 	if (!slot.val)
 		return slot;
 
+	if (mem_cgroup_try_charge_swap(folio, entry))
+		goto free_phys_swap;
+
 	/* establish the vrtual <-> physical swap slots linkages. */
 	for (i = 0; i < nr; i++) {
 		err = xa_insert(&vswap_rmap, slot.val + i,
@@ -448,13 +438,7 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 		if (err) {
 			while (--i >= 0)
 				xa_erase(&vswap_rmap, slot.val + i);
-			/*
-			 * We have not updated the backing type of the virtual swap slot.
-			 * Simply free up the physical swap slots here!
-			 */
-			swap_slot_free_nr(slot, nr);
-			slot.val = 0;
-			return slot;
+			goto uncharge;
 		}
 	}
 
@@ -491,6 +475,17 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 	}
 	rcu_read_unlock();
 	return slot;
+
+uncharge:
+	mem_cgroup_uncharge_swap(entry, nr);
+free_phys_swap:
+	/*
+	 * We have not updated the backing type of the virtual swap slot.
+	 * Simply free up the physical swap slots here!
+	 */
+	swap_slot_free_nr(slot, nr);
+	slot.val = 0;
+	return slot;
 }
 
 /**
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 15/18] vswap: support THP swapin and batch free_swap_and_cache
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (13 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 14/18] memcg: swap: only charge " Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap Nhat Pham
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

This patch implements the required functionalities for THP swapin and
batched free_swap_and_cache() in the virtual swap space design.

The central requirement is the range of entries we are working with must
have no mixed backing states:

1. For now, zswap-backed entries are not supported for these batched
   operations.
2. All the entries must be backed by the same type.
3. If the swap entries in the batch are backed by in-memory folio, it
   must be the same folio (i.e they correspond to the subpages of that
   folio).
4. If the swap entries in the batch are backed by slots on swapfiles, it
   must be the same swapfile, and these physical swap slots must also be
   contiguous.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  6 +++
 mm/internal.h        | 14 +------
 mm/memory.c          | 16 ++++++--
 mm/vswap.c           | 91 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 110 insertions(+), 17 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index a65b22de4cdd..c5a16f1ca376 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -773,6 +773,7 @@ bool vswap_folio_backed(swp_entry_t entry, int nr);
 void vswap_store_folio(swp_entry_t entry, struct folio *folio);
 void swap_zeromap_folio_set(struct folio *folio);
 void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
@@ -839,6 +840,11 @@ static inline void vswap_assoc_zswap(swp_entry_t entry,
 				struct zswap_entry *zswap_entry)
 {
 }
+
+static inline bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+	return true;
+}
 #endif /* CONFIG_VIRTUAL_SWAP */
 
 static inline bool trylock_swapoff(swp_entry_t entry,
diff --git a/mm/internal.h b/mm/internal.h
index 51061691a731..6694e7a14745 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -268,14 +268,7 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 	return (swp_entry_t) { entry.val + n };
 }
 
-/* temporary disallow batched swap operations */
-static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
-{
-	swp_entry_t next_entry;
-
-	next_entry.val = 0;
-	return next_entry;
-}
+swp_entry_t swap_move(swp_entry_t entry, long delta);
 #else
 static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 {
@@ -344,8 +337,6 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  * max_nr must be at least one and must be limited by the caller so scanning
  * cannot exceed a single page table.
  *
- * Note that for virtual swap space, we will not batch anything for now.
- *
  * Return: the number of table entries in the batch.
  */
 static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
@@ -360,9 +351,6 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(entry));
 
-	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
-		return 1;
-
 	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
diff --git a/mm/memory.c b/mm/memory.c
index d9c382a5e157..b0b23348d9be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4230,10 +4230,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * A large swapped out folio could be partially or fully in zswap. We
 	 * lack handling for such cases, so fallback to swapping in order-0
 	 * folio.
-	 *
-	 * We also disable THP swapin on the virtual swap implementation, for now.
 	 */
-	if (!zswap_never_enabled() || IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+	if (!zswap_never_enabled())
 		goto fallback;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
@@ -4423,6 +4421,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				}
 				need_clear_cache = true;
 
+				/*
+				 * Recheck to make sure the entire range is still
+				 * THP-swapin-able. Note that before we call
+				 * swapcache_prepare(), entries in the range can
+				 * still have their backing status changed.
+				 */
+				if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) &&
+						!vswap_can_swapin_thp(entry, nr_pages)) {
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+
 				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
diff --git a/mm/vswap.c b/mm/vswap.c
index c51ff5c54480..4aeb144921b8 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -9,6 +9,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "internal.h"
 #include "swap.h"
 
 /*
@@ -984,7 +985,7 @@ void swap_zeromap_folio_set(struct folio *folio)
  *
  * Note that this check is racy unless we can ensure that the entire range
  * has their backing state stable - for instance, if the caller was the one
- * who set the in_swapcache flag of the entire field.
+ * who set the swap cache pin.
  */
 static int vswap_check_backing(swp_entry_t entry, enum swap_type *type, int nr)
 {
@@ -1067,6 +1068,94 @@ bool vswap_folio_backed(swp_entry_t entry, int nr)
 				&& type == VSWAP_FOLIO;
 }
 
+/**
+ * vswap_can_swapin_thp - check if the swap entries can be swapped in as a THP.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * For now, we can only swap in a THP if the entire range is zero-filled, or if
+ * the entire range is backed by a contiguous range of physical swap slots on a
+ * swapfile.
+ */
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr &&
+		(type == VSWAP_ZERO || type == VSWAP_SWAPFILE);
+}
+
+/**
+ * swap_move - increment the swap slot by delta, checking the backing state and
+ *             return 0 if the backing state does not match (i.e wrong backing
+ *             state type, or wrong offset on the backing stores).
+ * @entry: the original virtual swap slot.
+ * @delta: the offset to increment the original slot.
+ *
+ * Note that this function is racy unless we can pin the backing state of these
+ * swap slots down with swapcache_prepare().
+ *
+ * Caller should only rely on this function as a best-effort hint otherwise,
+ * and should double-check after ensuring the whole range is pinned down.
+ *
+ * Return: the incremented virtual swap slot if the backing state matches, or
+ *         0 if the backing state does not match.
+ */
+swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	struct swp_desc *desc, *next_desc;
+	swp_entry_t next_entry;
+	bool invalid = true;
+	struct folio *folio;
+	enum swap_type type;
+	swp_slot_t slot;
+
+	next_entry.val = entry.val + delta;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	next_desc = xa_load(&vswap_map, next_entry.val);
+
+	if (!desc || !next_desc) {
+		rcu_read_unlock();
+		return (swp_entry_t){0};
+	}
+
+	read_lock(&desc->lock);
+	if (desc->type == VSWAP_ZSWAP) {
+		read_unlock(&desc->lock);
+		goto rcu_unlock;
+	}
+
+	type = desc->type;
+	if (type == VSWAP_FOLIO)
+		folio = desc->folio;
+
+	if (type == VSWAP_SWAPFILE)
+		slot = desc->slot;
+	read_unlock(&desc->lock);
+
+	read_lock(&next_desc->lock);
+	if (next_desc->type != type)
+		goto next_unlock;
+
+	if (type == VSWAP_SWAPFILE &&
+			(swp_slot_type(next_desc->slot) != swp_slot_type(slot) ||
+				swp_slot_offset(next_desc->slot) !=
+							swp_slot_offset(slot) + delta))
+		goto next_unlock;
+
+	if (type == VSWAP_FOLIO && next_desc->folio != folio)
+		goto next_unlock;
+
+	invalid = false;
+next_unlock:
+	read_unlock(&next_desc->lock);
+rcu_unlock:
+	rcu_read_unlock();
+	return invalid ? (swp_entry_t){0} : next_entry;
+}
+
 /*
  * Return the count of contiguous swap entries that share the same
  * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (14 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 15/18] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 17/18] swapfile: move zeromap setup out of enable_swap_info Nhat Pham
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

This patch presents the second applications of virtual swap design -
simplifying and optimizing swapoff.

With virtual swap slots stored at page table entries and used as indices
to various swap-related data structures, we no longer have to perform a
page table walk in swapoff. Simply iterate through all the allocated
swap slots on the swapfile, invoke the backward map and fault them in.

This is significantly cleaner, as well as slightly more performant,
especially when there are a lot of unrelated VMAs (since the old swapoff
code would have to traverse through all of them).

In a simple benchmark, in which we swapoff a 32 GB swapfile that is 50%
full, and in which there is a process that maps a 128GB file into
memory:

Baseline:
real: 25.54s
user: 0.00s
sys: 11.48s

New Design:
real: 11.69s
user: 0.00s
sys: 9.96s

Disregarding the real time reduction (which is mostly due to more IO
asynchrony), the new design reduces the kernel CPU time by about 13%.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/shmem_fs.h |   3 +
 include/linux/swap.h     |   1 +
 mm/shmem.c               |   2 +
 mm/swapfile.c            | 127 +++++++++++++++++++++++++++++++++++++++
 mm/vswap.c               |  61 +++++++++++++++++++
 5 files changed, 194 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0b273a7b9f01..668b6add3b8f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -108,7 +108,10 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
+
+#ifndef CONFIG_VIRTUAL_SWAP
 int shmem_unuse(unsigned int type);
+#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 unsigned long shmem_allowable_huge_orders(struct inode *inode,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c5a16f1ca376..0c585103d228 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -774,6 +774,7 @@ void vswap_store_folio(swp_entry_t entry, struct folio *folio);
 void swap_zeromap_folio_set(struct folio *folio);
 void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
 bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
+void vswap_swapoff(swp_entry_t entry, struct folio *folio, swp_slot_t slot);
 #else /* CONFIG_VIRTUAL_SWAP */
 static inline int vswap_init(void)
 {
diff --git a/mm/shmem.c b/mm/shmem.c
index 609971a2b365..fa792769e422 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1380,6 +1380,7 @@ static void shmem_evict_inode(struct inode *inode)
 #endif
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static int shmem_find_swap_entries(struct address_space *mapping,
 				   pgoff_t start, struct folio_batch *fbatch,
 				   pgoff_t *indices, unsigned int type)
@@ -1525,6 +1526,7 @@ int shmem_unuse(unsigned int type)
 
 	return error;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * Move the page from the page cache to the swap cache.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 83016d86eb1c..3aa3df10c3be 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2089,6 +2089,132 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	return i;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+#define	for_each_allocated_offset(si, offset)	\
+	while (swap_usage_in_pages(si) && \
+		!signal_pending(current) && \
+		(offset = find_next_to_unuse(si, offset)) != 0)
+
+static struct folio *pagein(swp_entry_t entry, struct swap_iocb **splug,
+		struct mempolicy *mpol)
+{
+	bool folio_was_allocated;
+	struct folio *folio = __read_swap_cache_async(entry, GFP_KERNEL, mpol,
+			NO_INTERLEAVE_INDEX, &folio_was_allocated, false);
+
+	if (folio_was_allocated)
+		swap_read_folio(folio, splug);
+	return folio;
+}
+
+static int try_to_unuse(unsigned int type)
+{
+	struct swap_info_struct *si = swap_info[type];
+	struct swap_iocb *splug = NULL;
+	struct mempolicy *mpol;
+	struct blk_plug plug;
+	unsigned long offset;
+	struct folio *folio;
+	swp_entry_t entry;
+	swp_slot_t slot;
+	int ret = 0;
+
+	if (!atomic_long_read(&si->inuse_pages))
+		goto success;
+
+	mpol = get_task_policy(current);
+	blk_start_plug(&plug);
+
+	/* first round - submit the reads */
+	offset = 0;
+	for_each_allocated_offset(si, offset) {
+		slot = swp_slot(type, offset);
+		entry = swp_slot_to_swp_entry(slot);
+		if (!entry.val)
+			continue;
+
+		folio = pagein(entry, &splug, mpol);
+		if (folio)
+			folio_put(folio);
+	}
+	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
+	lru_add_drain();
+
+	/* second round - updating the virtual swap slots' backing state */
+	offset = 0;
+	for_each_allocated_offset(si, offset) {
+		slot = swp_slot(type, offset);
+retry:
+		entry = swp_slot_to_swp_entry(slot);
+		if (!entry.val)
+			continue;
+
+		/* try to allocate swap cache folio */
+		folio = pagein(entry, &splug, mpol);
+		if (!folio) {
+			if (!swp_slot_to_swp_entry(swp_slot(type, offset)).val)
+				continue;
+
+			ret = -ENOMEM;
+			pr_err("swapoff: unable to allocate swap cache folio for %lu\n",
+						entry.val);
+			goto finish;
+		}
+
+		folio_lock(folio);
+		/*
+		 * We need to check if the folio is still in swap cache. We can, for
+		 * instance, race with zswap writeback, obtaining the temporary folio
+		 * it allocated for decompression and writeback, which would be
+		 * promply deleted from swap cache. By the time we lock that folio,
+		 * it might have already contained stale data.
+		 *
+		 * Concurrent swap operations might have also come in before we
+		 * reobtain the lock, deleting the folio from swap cache, invalidating
+		 * the virtual swap slot, then swapping out the folio again.
+		 *
+		 * In all of these cases, we must retry the physical -> virtual lookup.
+		 *
+		 * Note that if everything is still valid, then virtual swap slot must
+		 * corresponds to the head page (since all previous swap slots are
+		 * freed).
+		 */
+		if (!folio_test_swapcache(folio) || folio->swap.val != entry.val) {
+			folio_unlock(folio);
+			folio_put(folio);
+			if (signal_pending(current))
+				break;
+			schedule_timeout_uninterruptible(1);
+			goto retry;
+		}
+
+		folio_wait_writeback(folio);
+		vswap_swapoff(entry, folio, slot);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+finish:
+	if (ret == -ENOMEM)
+		return ret;
+
+	/* concurrent swappers might still be releasing physical swap slots... */
+	while (swap_usage_in_pages(si)) {
+		if (signal_pending(current))
+			return -EINTR;
+		schedule_timeout_uninterruptible(1);
+	}
+
+success:
+	/*
+	 * Make sure that further cleanups after try_to_unuse() returns happen
+	 * after swap_range_free() reduces si->inuse_pages to 0.
+	 */
+	smp_mb();
+	return 0;
+}
+#else
 static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
 {
 	return pte_same(pte_swp_clear_flags(pte), swp_pte);
@@ -2479,6 +2605,7 @@ static int try_to_unuse(unsigned int type)
 	smp_mb();
 	return 0;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * After a successful try_to_unuse, if no swap is now in use, we know
diff --git a/mm/vswap.c b/mm/vswap.c
index 4aeb144921b8..35261b5664ee 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -1252,6 +1252,67 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	swapcache_clear(NULL, entry, nr);
 }
 
+/**
+ * vswap_swapoff - unlink a range of virtual swap slots from their backing
+ *                 physical swap slots on a swapfile that is being swapped off,
+ *                 and associate them with the swapped in folio.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio swapped in and loaded into swap cache.
+ * @slot: the first physical swap slot in the range.
+ */
+void vswap_swapoff(swp_entry_t entry, struct folio *folio, swp_slot_t slot)
+{
+	int i = 0, nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+	unsigned int type = swp_slot_type(slot);
+	unsigned int offset = swp_slot_offset(slot);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		/*
+		 * There might be concurrent swap operations that might invalidate the
+		 * originally obtained virtual swap slot, allowing it to be
+		 * re-allocated, or change its backing state.
+		 *
+		 * We must re-check here to make sure we are not performing bogus backing
+		 * store changes.
+		 */
+		if (desc->type != VSWAP_SWAPFILE ||
+				swp_slot_type(desc->slot) != type) {
+			/* there should not be mixed backing states among the subpages */
+			VM_WARN_ON(i);
+			write_unlock(&desc->lock);
+			break;
+		}
+
+		VM_WARN_ON(swp_slot_offset(desc->slot) != offset + i);
+
+		xa_erase(&vswap_rmap, desc->slot.val);
+		desc->type = VSWAP_FOLIO;
+		desc->folio = folio;
+		write_unlock(&desc->lock);
+		i++;
+	}
+	rcu_read_unlock();
+
+	if (i) {
+		/*
+		 * If we update the virtual swap slots' backing, mark the folio as
+		 * dirty so that reclaimers will try to page it out again.
+		 */
+		folio_mark_dirty(folio);
+		swap_slot_free_nr(slot, nr);
+		/* folio is in swap cache, so entries are guaranteed to be valid */
+		mem_cgroup_uncharge_swap(entry, nr);
+	}
+}
+
 #ifdef CONFIG_MEMCG
 static unsigned short vswap_cgroup_record(swp_entry_t entry,
 				unsigned short memcgid, unsigned int nr_ents)
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 17/18] swapfile: move zeromap setup out of enable_swap_info
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (15 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:38 ` [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap implementation Nhat Pham
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

In preparation for zeromap removal in virtual swap implementation, move
zeromap setup step out of enable_swap_info to its callers, where
necessary.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3aa3df10c3be..3ed7edc800fe 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2767,8 +2767,7 @@ static int swap_node(struct swap_info_struct *si)
 
 static void setup_swap_info(struct swap_info_struct *si, int prio,
 			    unsigned char *swap_map,
-			    struct swap_cluster_info *cluster_info,
-			    unsigned long *zeromap)
+			    struct swap_cluster_info *cluster_info)
 {
 	int i;
 
@@ -2793,7 +2792,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
 	}
 	si->swap_map = swap_map;
 	si->cluster_info = cluster_info;
-	si->zeromap = zeromap;
 }
 
 static void _enable_swap_info(struct swap_info_struct *si)
@@ -2825,7 +2823,8 @@ static void enable_swap_info(struct swap_info_struct *si, int prio,
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, prio, swap_map, cluster_info, zeromap);
+	setup_swap_info(si, prio, swap_map, cluster_info);
+	si->zeromap = zeromap;
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
 	/*
@@ -2843,7 +2842,7 @@ static void reinsert_swap_info(struct swap_info_struct *si)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap);
+	setup_swap_info(si, si->prio, si->swap_map, si->cluster_info);
 	_enable_swap_info(si);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap implementation
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (16 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 17/18] swapfile: move zeromap setup out of enable_swap_info Nhat Pham
@ 2025-04-29 23:38 ` Nhat Pham
  2025-04-29 23:51 ` [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
  2025-05-30  6:47 ` YoungJun Park
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:38 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

We are not using the zeromap for swapped out zero-filled pages in the
virtual swap implementation. Remove it. This saves about 1 bit per
physical swap slot.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  2 ++
 mm/swapfile.c        | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0c585103d228..408368d56dfb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -312,7 +312,9 @@ struct swap_info_struct {
 	signed char	type;		/* strange name for an index */
 	unsigned int	max;		/* extent of the swap_map */
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+#ifndef CONFIG_VIRTUAL_SWAP
 	unsigned long *zeromap;		/* kvmalloc'ed bitmap to track zero pages */
+#endif
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
 	struct list_head full_clusters; /* full clusters list */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3ed7edc800fe..3d99bd02ede9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2824,7 +2824,9 @@ static void enable_swap_info(struct swap_info_struct *si, int prio,
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
 	setup_swap_info(si, prio, swap_map, cluster_info);
+#ifndef CONFIG_VIRTUAL_SWAP
 	si->zeromap = zeromap;
+#endif
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
 	/*
@@ -2885,7 +2887,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
 	unsigned char *swap_map;
+#ifndef CONFIG_VIRTUAL_SWAP
 	unsigned long *zeromap;
+#endif
 	struct swap_cluster_info *cluster_info;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
@@ -3000,8 +3004,10 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
+#ifndef CONFIG_VIRTUAL_SWAP
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
+#endif
 	cluster_info = p->cluster_info;
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
@@ -3014,7 +3020,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
 	vfree(swap_map);
+#ifndef CONFIG_VIRTUAL_SWAP
 	kvfree(zeromap);
+#endif
 	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
@@ -3601,6 +3609,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 
+#ifndef CONFIG_VIRTUAL_SWAP
 	/*
 	 * Use kvmalloc_array instead of bitmap_zalloc as the allocation order might
 	 * be above MAX_PAGE_ORDER incase of a large swap file.
@@ -3611,6 +3620,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		error = -ENOMEM;
 		goto bad_swap_unlock_inode;
 	}
+#endif
 
 	if (si->bdev && bdev_stable_writes(si->bdev))
 		si->flags |= SWP_STABLE_WRITES;
@@ -3722,7 +3732,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	si->flags = 0;
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
+#ifndef CONFIG_VIRTUAL_SWAP
 	kvfree(zeromap);
+#endif
 	kvfree(cluster_info);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (17 preceding siblings ...)
  2025-04-29 23:38 ` [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap implementation Nhat Pham
@ 2025-04-29 23:51 ` Nhat Pham
  2025-05-30  6:47 ` YoungJun Park
  19 siblings, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-04-29 23:51 UTC (permalink / raw)
  To: linux-mm
  Cc: akpm, hannes, hughd, yosry.ahmed, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, len.brown, chengming.zhou, kasong,
	chrisl, huang.ying.caritas, ryan.roberts, viro, baohua, osalvador,
	lorenzo.stoakes, christophe.leroy, pavel, kernel-team,
	linux-kernel, cgroups, linux-pm, peterx

On Tue, Apr 29, 2025 at 4:38 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Changelog:
> * v2:
>         * Use a single atomic type (swap_refs) for reference counting
>           purpose. This brings the size of the swap descriptor from 64 KB
>           down to 48 KB (25% reduction). Suggested by Yosry Ahmed.

bytes, not kilobytes. 48KB would be an INSANE overhead :)

Apologies for the brainfart.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
                   ` (18 preceding siblings ...)
  2025-04-29 23:51 ` [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
@ 2025-05-30  6:47 ` YoungJun Park
  2025-05-30 16:52   ` Nhat Pham
  19 siblings, 1 reply; 30+ messages in thread
From: YoungJun Park @ 2025-05-30  6:47 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
	viro, baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> Changelog:
> * v2:
> 	* Use a single atomic type (swap_refs) for reference counting
> 	  purpose. This brings the size of the swap descriptor from 64 KB
> 	  down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> 	* Zeromap bitmap is removed in the virtual swap implementation.
> 	  This saves one bit per phyiscal swapfile slot.
> 	* Rearrange the patches and the code change to make things more
> 	  reviewable. Suggested by Johannes Weiner.
> 	* Update the cover letter a bit.

Hi Nhat,

Thank you for sharing this patch series.
I’ve read through it with great interest.

I’m part of a kernel team working on features related to multi-tier swapping,
and this patch set appears quite relevant
to our ongoing discussions and early-stage implementation.

I had a couple of questions regarding the future direction.

> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.

Based on the discussion in [5], it seems there was some exploration
around enabling per-cgroup selection of multiple tiers.
Do you envision the current design evolving in a similar direction
to those past discussions, or is there a different direction you're aiming for?

>   This idea is very similar to Kairui's work to optimize the (physical)
>   swap allocator. He is currently also working on a swap redesign (see
>   [11]) - perhaps we can combine the two efforts to take advantage of
>   the swap allocator's efficiency for virtual swap.

I noticed that your patch appears to be aligned with the work from Kairui.
It seems like the overall architecture may be headed toward introducing
a virtual swap device layer.
I'm curious if there’s already been any concrete discussion
around this abstraction, especially regarding how it might be layered over
multiple physical swap devices?

From a naive perspective, I imagine that while today’s swap devices
are in a 1:1 mapping with physical devices,
this virtual layer could introduce a 1:N relationship —
one virtual swap device mapped to multiple physical ones.
Would this virtual device behave as a new swappable block device
exposed via `swapon`, or is the plan to abstract it differently?

Thanks again for your work, 
and I would greatly appreciate any insights you could share.

Best regards,  
YoungJun Park

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-05-30  6:47 ` YoungJun Park
@ 2025-05-30 16:52   ` Nhat Pham
  2025-05-30 16:54     ` Nhat Pham
  2025-06-01 12:56     ` YoungJun Park
  0 siblings, 2 replies; 30+ messages in thread
From: Nhat Pham @ 2025-05-30 16:52 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
	viro, baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > Changelog:
> > * v2:
> >       * Use a single atomic type (swap_refs) for reference counting
> >         purpose. This brings the size of the swap descriptor from 64 KB
> >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> >       * Zeromap bitmap is removed in the virtual swap implementation.
> >         This saves one bit per phyiscal swapfile slot.
> >       * Rearrange the patches and the code change to make things more
> >         reviewable. Suggested by Johannes Weiner.
> >       * Update the cover letter a bit.
>
> Hi Nhat,
>
> Thank you for sharing this patch series.
> I’ve read through it with great interest.
>
> I’m part of a kernel team working on features related to multi-tier swapping,
> and this patch set appears quite relevant
> to our ongoing discussions and early-stage implementation.

May I ask - what's the use case you're thinking of here? Remote swapping?

>
> I had a couple of questions regarding the future direction.
>
> > * Multi-tier swapping (as mentioned in [5]), with transparent
> >   transferring (promotion/demotion) of pages across tiers (see [8] and
> >   [9]). Similar to swapoff, with the old design we would need to
> >   perform the expensive page table walk.
>
> Based on the discussion in [5], it seems there was some exploration
> around enabling per-cgroup selection of multiple tiers.
> Do you envision the current design evolving in a similar direction
> to those past discussions, or is there a different direction you're aiming for?

IIRC, that past design focused on the interface aspect of the problem,
but never actually touched the mechanism to implement a multi-tier
swapping solution.

The simple reason is it's impossible, or at least highly inefficient
to do it in the current design, i.e without virtualizing swap. Storing
the physical swap location in PTEs means that changing the swap
backend requires a full page table walk to update all the PTEs that
refer to the old physical swap location. So you have to pick your
poison - either:

1. Pick your backend at swap out time, and never change it. You might
not have sufficient information to decide at that time. It prevents
you from adapting to the change in workload dynamics and working set -
the access frequency of pages might change, so their physical location
should change accordingly.

2. Reserve the space in every tier, and associate them with the same
handle. This is kinda what zswap is doing. It is space efficient, and
create a lot of operational issues in production.

3. Bite the bullet and perform the page table walk. This is what
swapoff is doing, basically. Raise your hands if you're excited about
a full page table walk every time you want to evict a page from zswap
to disk swap. Booo.

This new design will give us an efficient way to perform tier transfer
- you need to figure out how to obtain the right to perform the
transfer (for now, through the swap cache - but you can perhaps
envision some sort of locks), and then you can simply make the change
at the virtual layer.

>
> >   This idea is very similar to Kairui's work to optimize the (physical)
> >   swap allocator. He is currently also working on a swap redesign (see
> >   [11]) - perhaps we can combine the two efforts to take advantage of
> >   the swap allocator's efficiency for virtual swap.
>
> I noticed that your patch appears to be aligned with the work from Kairui.
> It seems like the overall architecture may be headed toward introducing
> a virtual swap device layer.
> I'm curious if there’s already been any concrete discussion
> around this abstraction, especially regarding how it might be layered over
> multiple physical swap devices?
>
> From a naive perspective, I imagine that while today’s swap devices
> are in a 1:1 mapping with physical devices,
> this virtual layer could introduce a 1:N relationship —
> one virtual swap device mapped to multiple physical ones.
> Would this virtual device behave as a new swappable block device
> exposed via `swapon`, or is the plan to abstract it differently?

That was one of the ideas I was thinking of. Problem is this is a very
special "device", and I'm not entirely sure opting in through swapon
like that won't cause issues. Imagine the following scenario:

1. We swap on a normal swapfile.

2. Users swap things with the swapfile.

2. Sysadmin then swapon a virtual swap device.

It will be quite nightmarish to manage things - we need to be extra
vigilant in handling a physical swap slot for e.g, since it can back a
PTE or a virtual swap slot. Also, swapoff becomes less efficient
again. And the physical swap allocator, even with the swap table
change, doesn't quite work out of the box for virtual swap yet (see
[1]).

I think it's better to just keep it separate, for now, and adopt
elements from Kairui's work to make virtual swap allocation more
efficient. Not a hill I will die on though,

[1]: https://lore.kernel.org/linux-mm/CAKEwX=MmD___ukRrx=hLo7d_m1J_uG_Ke+us7RQgFUV2OSg38w@mail.gmail.com/

>
> Thanks again for your work,
> and I would greatly appreciate any insights you could share.
>
> Best regards,
> YoungJun Park
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-05-30 16:52   ` Nhat Pham
@ 2025-05-30 16:54     ` Nhat Pham
  2025-06-01 12:56     ` YoungJun Park
  1 sibling, 0 replies; 30+ messages in thread
From: Nhat Pham @ 2025-05-30 16:54 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
	viro, baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Fri, May 30, 2025 at 9:52 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > Changelog:
> > > * v2:
> > >       * Use a single atomic type (swap_refs) for reference counting
> > >         purpose. This brings the size of the swap descriptor from 64 KB
> > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > >         This saves one bit per phyiscal swapfile slot.
> > >       * Rearrange the patches and the code change to make things more
> > >         reviewable. Suggested by Johannes Weiner.
> > >       * Update the cover letter a bit.
> >
> > Hi Nhat,
> >
> > Thank you for sharing this patch series.
> > I’ve read through it with great interest.
> >
> > I’m part of a kernel team working on features related to multi-tier swapping,
> > and this patch set appears quite relevant
> > to our ongoing discussions and early-stage implementation.
>
> May I ask - what's the use case you're thinking of here? Remote swapping?
>
> >
> > I had a couple of questions regarding the future direction.
> >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > >   [9]). Similar to swapoff, with the old design we would need to
> > >   perform the expensive page table walk.
> >
> > Based on the discussion in [5], it seems there was some exploration
> > around enabling per-cgroup selection of multiple tiers.
> > Do you envision the current design evolving in a similar direction
> > to those past discussions, or is there a different direction you're aiming for?

To be extra clear, I don't have an issue with a cgroup-based interface
for swap tiering like that.

I think the only objections at the time is we do not really have a use
case in mind?

>
> IIRC, that past design focused on the interface aspect of the problem,
> but never actually touched the mechanism to implement a multi-tier
> swapping solution.
>
> The simple reason is it's impossible, or at least highly inefficient
> to do it in the current design, i.e without virtualizing swap. Storing
> the physical swap location in PTEs means that changing the swap
> backend requires a full page table walk to update all the PTEs that
> refer to the old physical swap location. So you have to pick your
> poison - either:
>
> 1. Pick your backend at swap out time, and never change it. You might
> not have sufficient information to decide at that time. It prevents
> you from adapting to the change in workload dynamics and working set -
> the access frequency of pages might change, so their physical location
> should change accordingly.
>
> 2. Reserve the space in every tier, and associate them with the same
> handle. This is kinda what zswap is doing. It is space efficient, and
> create a lot of operational issues in production.

s/efficient/inefficient

>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-05-30 16:52   ` Nhat Pham
  2025-05-30 16:54     ` Nhat Pham
@ 2025-06-01 12:56     ` YoungJun Park
  2025-06-01 16:14       ` Kairui Song
  2025-06-01 21:08       ` Nhat Pham
  1 sibling, 2 replies; 30+ messages in thread
From: YoungJun Park @ 2025-06-01 12:56 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
	viro, baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > Changelog:
> > > * v2:
> > >       * Use a single atomic type (swap_refs) for reference counting
> > >         purpose. This brings the size of the swap descriptor from 64 KB
> > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > >         This saves one bit per phyiscal swapfile slot.
> > >       * Rearrange the patches and the code change to make things more
> > >         reviewable. Suggested by Johannes Weiner.
> > >       * Update the cover letter a bit.
> >
> > Hi Nhat,
> >
> > Thank you for sharing this patch series.
> > I’ve read through it with great interest.
> >
> > I’m part of a kernel team working on features related to multi-tier swapping,
> > and this patch set appears quite relevant
> > to our ongoing discussions and early-stage implementation.
> 
> May I ask - what's the use case you're thinking of here? Remote swapping?
> 

Yes, that's correct.  
Our usage scenario includes remote swap, 
and we're experimenting with assigning swap tiers per cgroup 
in order to improve specific scene of our target device performance.

We’ve explored several approaches and PoCs around this, 
and in the process of evaluating 
whether our direction could eventually be aligned 
with the upstream kernel, 
I came across your patchset and wanted to ask whether 
similar efforts have been discussed or attempted before.

> >
> > I had a couple of questions regarding the future direction.
> >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > >   [9]). Similar to swapoff, with the old design we would need to
> > >   perform the expensive page table walk.
> >
> > Based on the discussion in [5], it seems there was some exploration
> > around enabling per-cgroup selection of multiple tiers.
> > Do you envision the current design evolving in a similar direction
> > to those past discussions, or is there a different direction you're aiming for?
> 
> IIRC, that past design focused on the interface aspect of the problem,
> but never actually touched the mechanism to implement a multi-tier
> swapping solution.
> 
> The simple reason is it's impossible, or at least highly inefficient
> to do it in the current design, i.e without virtualizing swap. Storing

As you pointed out, there are certainly inefficiencies 
in supporting this use case with the current design, 
but if there is a valid use case,
I believe there’s room for it to be supported in the current model
—possibly in a less optimized form—
until a virtual swap device becomes available 
and provides a more efficient solution.
What do you think about?

> the physical swap location in PTEs means that changing the swap
> backend requires a full page table walk to update all the PTEs that
> refer to the old physical swap location. So you have to pick your
> poison - either:
> 1. Pick your backend at swap out time, and never change it. You might
> not have sufficient information to decide at that time. It prevents
> you from adapting to the change in workload dynamics and working set -
> the access frequency of pages might change, so their physical location
> should change accordingly.
> 
> 2. Reserve the space in every tier, and associate them with the same
> handle. This is kinda what zswap is doing. It is space efficient, and
> create a lot of operational issues in production.
> 
> 3. Bite the bullet and perform the page table walk. This is what
> swapoff is doing, basically. Raise your hands if you're excited about
> a full page table walk every time you want to evict a page from zswap
> to disk swap. Booo.
> 
> This new design will give us an efficient way to perform tier transfer
> - you need to figure out how to obtain the right to perform the
> transfer (for now, through the swap cache - but you can perhaps
> envision some sort of locks), and then you can simply make the change
> at the virtual layer.
>

One idea that comes to mind is whether the backend swap tier for
a page could be lazily adjusted at runtime—either reactively 
or via an explicit interface—before the tier changes.  
Alternatively, if it's preferable to leave pages untouched
when the tier configuration changes at runtime, 
perhaps we could consider making this behavior configurable as well. 

> >
> > >   This idea is very similar to Kairui's work to optimize the (physical)
> > >   swap allocator. He is currently also working on a swap redesign (see
> > >   [11]) - perhaps we can combine the two efforts to take advantage of
> > >   the swap allocator's efficiency for virtual swap.
> >
> > I noticed that your patch appears to be aligned with the work from Kairui.
> > It seems like the overall architecture may be headed toward introducing
> > a virtual swap device layer.
> > I'm curious if there’s already been any concrete discussion
> > around this abstraction, especially regarding how it might be layered over
> > multiple physical swap devices?
> >
> > From a naive perspective, I imagine that while today’s swap devices
> > are in a 1:1 mapping with physical devices,
> > this virtual layer could introduce a 1:N relationship —
> > one virtual swap device mapped to multiple physical ones.
> > Would this virtual device behave as a new swappable block device
> > exposed via `swapon`, or is the plan to abstract it differently?
> 
> That was one of the ideas I was thinking of. Problem is this is a very
> special "device", and I'm not entirely sure opting in through swapon
> like that won't cause issues. Imagine the following scenario:
> 
> 1. We swap on a normal swapfile.
> 
> 2. Users swap things with the swapfile.
> 
> 2. Sysadmin then swapon a virtual swap device.
> 
> It will be quite nightmarish to manage things - we need to be extra
> vigilant in handling a physical swap slot for e.g, since it can back a
> PTE or a virtual swap slot. Also, swapoff becomes less efficient
> again. And the physical swap allocator, even with the swap table
> change, doesn't quite work out of the box for virtual swap yet (see
> [1]).
> 
> I think it's better to just keep it separate, for now, and adopt
> elements from Kairui's work to make virtual swap allocation more
> efficient. Not a hill I will die on though,
> 
> [1]: https://lore.kernel.org/linux-mm/CAKEwX=MmD___ukRrx=hLo7d_m1J_uG_Ke+us7RQgFUV2OSg38w@mail.gmail.com/
> 

I also appreciate your thoughts on keeping the virtual 
and physical swap paths separate for now. 
Thanks for sharing your perspective
—it was helpful to understand the design direction.

Best regards,  
YoungJun Park

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-06-01 12:56     ` YoungJun Park
@ 2025-06-01 16:14       ` Kairui Song
  2025-06-02 15:17         ` YoungJun Park
  2025-06-02 18:29         ` Nhat Pham
  2025-06-01 21:08       ` Nhat Pham
  1 sibling, 2 replies; 30+ messages in thread
From: Kairui Song @ 2025-06-01 16:14 UTC (permalink / raw)
  To: YoungJun Park
  Cc: Nhat Pham, linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts, viro,
	baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Sun, Jun 1, 2025 at 8:56 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > Changelog:
> > > > * v2:
> > > >       * Use a single atomic type (swap_refs) for reference counting
> > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > >         This saves one bit per phyiscal swapfile slot.
> > > >       * Rearrange the patches and the code change to make things more
> > > >         reviewable. Suggested by Johannes Weiner.
> > > >       * Update the cover letter a bit.
> > >
> > > Hi Nhat,
> > >
> > > Thank you for sharing this patch series.
> > > I’ve read through it with great interest.
> > >
> > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > and this patch set appears quite relevant
> > > to our ongoing discussions and early-stage implementation.
> >
> > May I ask - what's the use case you're thinking of here? Remote swapping?
> >
>
> Yes, that's correct.
> Our usage scenario includes remote swap,
> and we're experimenting with assigning swap tiers per cgroup
> in order to improve specific scene of our target device performance.
>
> We’ve explored several approaches and PoCs around this,
> and in the process of evaluating
> whether our direction could eventually be aligned
> with the upstream kernel,
> I came across your patchset and wanted to ask whether
> similar efforts have been discussed or attempted before.
>
> > >
> > > I had a couple of questions regarding the future direction.
> > >
> > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > >   [9]). Similar to swapoff, with the old design we would need to
> > > >   perform the expensive page table walk.
> > >
> > > Based on the discussion in [5], it seems there was some exploration
> > > around enabling per-cgroup selection of multiple tiers.
> > > Do you envision the current design evolving in a similar direction
> > > to those past discussions, or is there a different direction you're aiming for?
> >
> > IIRC, that past design focused on the interface aspect of the problem,
> > but never actually touched the mechanism to implement a multi-tier
> > swapping solution.
> >
> > The simple reason is it's impossible, or at least highly inefficient
> > to do it in the current design, i.e without virtualizing swap. Storing
>
> As you pointed out, there are certainly inefficiencies
> in supporting this use case with the current design,
> but if there is a valid use case,
> I believe there’s room for it to be supported in the current model
> —possibly in a less optimized form—
> until a virtual swap device becomes available
> and provides a more efficient solution.
> What do you think about?

Hi All,

I'd like to share some info from my side. Currently we have an
internal solution for multi tier swap, implemented based on ZRAM and
writeback: 4 compression level and multiple block layer level. The
ZRAM table serves a similar role to the swap table in the "swap table
series" or the virtual layer here.

We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
supports per-cgroup priority, and per-cgroup writeback control, and it
worked perfectly fine in production.

The interface looks something like this:
/sys/fs/cgroup/cg1/zram.prio: [1-4]
/sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
/sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

It's really nothing fancy and complex, the four priority is simply the
four ZRAM compression streams that's already in upstream, and you can
simply hardcode four *bdev in "struct zram" and reuse the bits, then
chain the write bio with new underlayer bio... Getting the priority
info of a cgroup is even simpler once ZRAM is cgroup aware.

All interfaces can be adjusted dynamically at any time (e.g. by an
agent), and already swapped out pages won't be touched. The block
devices are specified in ZRAM's sys files during swapon.

It's easy to implement, but not a good idea for upstream at all:
redundant layers, and performance is bad (if not optimized):
- it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
SYNCHRONIZE_IO completely which actually improved the performance in
every aspect (I've been trying to upstream this for a while);
- ZRAM's block device allocator is just not good (just a bitmap) so we
want to use the SWAP allocator directly (which I'm also trying to
upstream with the swap table series);
- And many other bits and pieces like bio batching are kind of broken,
busy loop due to the ZRAM_WB bit, etc...
- Lacking support for things like effective migration/compaction,
doable but looks horrible.

So I definitely don't like this band-aid solution, but hey, it works.
I'm looking forward to replacing it with native upstream support.
That's one of the motivations behind the swap table series, which
I think it would resolve the problems in an elegant and clean way
upstreamly. The initial tests do show it has a much lower overhead
and cleans up SWAP.

But maybe this is kind of similar to the "less optimized form" you
are talking about? As I mentioned I'm already trying to upstream
some nice parts of it, and hopefully replace it with an upstream
solution finally.

I can try upstream other parts of it if there are people really
interested, but I strongly recommend that we should focus on the
right approach instead and not waste time on that and spam the
mail list.

I have no special preference on how the final upstream interface
should look like. But currently SWAP devices already have priorities,
so maybe we should just make use of that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-06-01 12:56     ` YoungJun Park
  2025-06-01 16:14       ` Kairui Song
@ 2025-06-01 21:08       ` Nhat Pham
  2025-06-02 15:03         ` YoungJun Park
  1 sibling, 1 reply; 30+ messages in thread
From: Nhat Pham @ 2025-06-01 21:08 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
	viro, baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Sun, Jun 1, 2025 at 5:56 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > Changelog:
> > > > * v2:
> > > >       * Use a single atomic type (swap_refs) for reference counting
> > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > >         This saves one bit per phyiscal swapfile slot.
> > > >       * Rearrange the patches and the code change to make things more
> > > >         reviewable. Suggested by Johannes Weiner.
> > > >       * Update the cover letter a bit.
> > >
> > > Hi Nhat,
> > >
> > > Thank you for sharing this patch series.
> > > I’ve read through it with great interest.
> > >
> > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > and this patch set appears quite relevant
> > > to our ongoing discussions and early-stage implementation.
> >
> > May I ask - what's the use case you're thinking of here? Remote swapping?
> >
>
> Yes, that's correct.
> Our usage scenario includes remote swap,
> and we're experimenting with assigning swap tiers per cgroup
> in order to improve specific scene of our target device performance.

Hmm, that can be a start. Right now, we have only 2 swap tiers
essentially, so memory.(z)swap.max and memory.zswap.writeback is
usually sufficient to describe the tiering interface. But if you have
an alternative use case in mind feel free to send a RFC to explore
this!

>
> We’ve explored several approaches and PoCs around this,
> and in the process of evaluating
> whether our direction could eventually be aligned
> with the upstream kernel,
> I came across your patchset and wanted to ask whether
> similar efforts have been discussed or attempted before.

I think it is occasionally touched upon in discussion, but AFAICS
there has not been really an actual upstream patch to add such an
interface.

>
> > >
> > > I had a couple of questions regarding the future direction.
> > >
> > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > >   [9]). Similar to swapoff, with the old design we would need to
> > > >   perform the expensive page table walk.
> > >
> > > Based on the discussion in [5], it seems there was some exploration
> > > around enabling per-cgroup selection of multiple tiers.
> > > Do you envision the current design evolving in a similar direction
> > > to those past discussions, or is there a different direction you're aiming for?
> >
> > IIRC, that past design focused on the interface aspect of the problem,
> > but never actually touched the mechanism to implement a multi-tier
> > swapping solution.
> >
> > The simple reason is it's impossible, or at least highly inefficient
> > to do it in the current design, i.e without virtualizing swap. Storing
>
> As you pointed out, there are certainly inefficiencies
> in supporting this use case with the current design,
> but if there is a valid use case,
> I believe there’s room for it to be supported in the current model
> —possibly in a less optimized form—
> until a virtual swap device becomes available
> and provides a more efficient solution.
> What do you think about?

Which less optimized form are you thinking of?

>
> > the physical swap location in PTEs means that changing the swap
> > backend requires a full page table walk to update all the PTEs that
> > refer to the old physical swap location. So you have to pick your
> > poison - either:
> > 1. Pick your backend at swap out time, and never change it. You might
> > not have sufficient information to decide at that time. It prevents
> > you from adapting to the change in workload dynamics and working set -
> > the access frequency of pages might change, so their physical location
> > should change accordingly.
> >
> > 2. Reserve the space in every tier, and associate them with the same
> > handle. This is kinda what zswap is doing. It is space efficient, and
> > create a lot of operational issues in production.
> >
> > 3. Bite the bullet and perform the page table walk. This is what
> > swapoff is doing, basically. Raise your hands if you're excited about
> > a full page table walk every time you want to evict a page from zswap
> > to disk swap. Booo.
> >
> > This new design will give us an efficient way to perform tier transfer
> > - you need to figure out how to obtain the right to perform the
> > transfer (for now, through the swap cache - but you can perhaps
> > envision some sort of locks), and then you can simply make the change
> > at the virtual layer.
> >
>
> One idea that comes to mind is whether the backend swap tier for
> a page could be lazily adjusted at runtime—either reactively
> or via an explicit interface—before the tier changes.
> Alternatively, if it's preferable to leave pages untouched
> when the tier configuration changes at runtime,
> perhaps we could consider making this behavior configurable as well.
>

I don't quite understand - could you expand on this?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-06-01 21:08       ` Nhat Pham
@ 2025-06-02 15:03         ` YoungJun Park
  0 siblings, 0 replies; 30+ messages in thread
From: YoungJun Park @ 2025-06-02 15:03 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, kasong, chrisl, huang.ying.caritas, ryan.roberts,
	viro, baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Sun, Jun 01, 2025 at 02:08:22PM -0700, Nhat Pham wrote:
> On Sun, Jun 1, 2025 at 5:56 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > > >
> > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > > Changelog:
> > > > > * v2:
> > > > >       * Use a single atomic type (swap_refs) for reference counting
> > > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > > >         This saves one bit per phyiscal swapfile slot.
> > > > >       * Rearrange the patches and the code change to make things more
> > > > >         reviewable. Suggested by Johannes Weiner.
> > > > >       * Update the cover letter a bit.
> > > >
> > > > Hi Nhat,
> > > >
> > > > Thank you for sharing this patch series.
> > > > I’ve read through it with great interest.
> > > >
> > > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > > and this patch set appears quite relevant
> > > > to our ongoing discussions and early-stage implementation.
> > >
> > > May I ask - what's the use case you're thinking of here? Remote swapping?
> > >
> >
> > Yes, that's correct.
> > Our usage scenario includes remote swap,
> > and we're experimenting with assigning swap tiers per cgroup
> > in order to improve specific scene of our target device performance.
> 
> Hmm, that can be a start. Right now, we have only 2 swap tiers
> essentially, so memory.(z)swap.max and memory.zswap.writeback is
> usually sufficient to describe the tiering interface. But if you have
> an alternative use case in mind feel free to send a RFC to explore
> this!
>

Yes, sounds good.
I've organized the details of our swap tiering approach 
including the specific use case we are trying to solve.
This approach is based on leveraging 
the existing priority mechanism in the swap subsystem.
I’ll be sharing it as an RFC shortly.
 
> >
> > We’ve explored several approaches and PoCs around this,
> > and in the process of evaluating
> > whether our direction could eventually be aligned
> > with the upstream kernel,
> > I came across your patchset and wanted to ask whether
> > similar efforts have been discussed or attempted before.
> 
> I think it is occasionally touched upon in discussion, but AFAICS
> there has not been really an actual upstream patch to add such an
> interface.
> 
> >
> > > >
> > > > I had a couple of questions regarding the future direction.
> > > >
> > > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > >   [9]). Similar to swapoff, with the old design we would need to
> > > > >   perform the expensive page table walk.
> > > >
> > > > Based on the discussion in [5], it seems there was some exploration
> > > > around enabling per-cgroup selection of multiple tiers.
> > > > Do you envision the current design evolving in a similar direction
> > > > to those past discussions, or is there a different direction you're aiming for?
> > >
> > > IIRC, that past design focused on the interface aspect of the problem,
> > > but never actually touched the mechanism to implement a multi-tier
> > > swapping solution.
> > >
> > > The simple reason is it's impossible, or at least highly inefficient
> > > to do it in the current design, i.e without virtualizing swap. Storing
> >
> > As you pointed out, there are certainly inefficiencies
> > in supporting this use case with the current design,
> > but if there is a valid use case,
> > I believe there’s room for it to be supported in the current model
> > —possibly in a less optimized form—
> > until a virtual swap device becomes available
> > and provides a more efficient solution.
> > What do you think about?
> 
> Which less optimized form are you thinking of?
>

Just mentioning current swap design would be less optimized regardless 
of the form of tiering applied.
Not meaninig my approach is less optimized.
That may have come across differently than I intended.
Please feel free to disregard that assumption — 
I believe it would be more appropriate 
to evaluate this based on the RFC I plan to share soon.
 
> >
> > > the physical swap location in PTEs means that changing the swap
> > > backend requires a full page table walk to update all the PTEs that
> > > refer to the old physical swap location. So you have to pick your
> > > poison - either:
> > > 1. Pick your backend at swap out time, and never change it. You might
> > > not have sufficient information to decide at that time. It prevents
> > > you from adapting to the change in workload dynamics and working set -
> > > the access frequency of pages might change, so their physical location
> > > should change accordingly.
> > >
> > > 2. Reserve the space in every tier, and associate them with the same
> > > handle. This is kinda what zswap is doing. It is space efficient, and
> > > create a lot of operational issues in production.
> > >
> > > 3. Bite the bullet and perform the page table walk. This is what
> > > swapoff is doing, basically. Raise your hands if you're excited about
> > > a full page table walk every time you want to evict a page from zswap
> > > to disk swap. Booo.
> > >
> > > This new design will give us an efficient way to perform tier transfer
> > > - you need to figure out how to obtain the right to perform the
> > > transfer (for now, through the swap cache - but you can perhaps
> > > envision some sort of locks), and then you can simply make the change
> > > at the virtual layer.
> > >
> >
> > One idea that comes to mind is whether the backend swap tier for
> > a page could be lazily adjusted at runtime—either reactively
> > or via an explicit interface—before the tier changes.
> > Alternatively, if it's preferable to leave pages untouched
> > when the tier configuration changes at runtime,
> > perhaps we could consider making this behavior configurable as well.
> >
> 
> I don't quite understand - could you expand on this?
>

Regarding your point, 
my understanding was that you were referring
to an immediate migration once a new swap tier is selected at runtime. 
I was suggesting whether a lazy migration approach
—or even skipping migration altogether—might 
be worth considering as alternatives.
I only mentioned it because, from our use case perspective, 
immediate migration is not strictly necessary.

Best regards,
YoungJun Park

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-06-01 16:14       ` Kairui Song
@ 2025-06-02 15:17         ` YoungJun Park
  2025-06-02 18:29         ` Nhat Pham
  1 sibling, 0 replies; 30+ messages in thread
From: YoungJun Park @ 2025-06-02 15:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: Nhat Pham, linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts, viro,
	baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Mon, Jun 02, 2025 at 12:14:53AM +0800, Kairui Song wrote:
> On Sun, Jun 1, 2025 at 8:56 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > > >
> > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > > Changelog:
> > > > > * v2:
> > > > >       * Use a single atomic type (swap_refs) for reference counting
> > > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > > >         This saves one bit per phyiscal swapfile slot.
> > > > >       * Rearrange the patches and the code change to make things more
> > > > >         reviewable. Suggested by Johannes Weiner.
> > > > >       * Update the cover letter a bit.
> > > >
> > > > Hi Nhat,
> > > >
> > > > Thank you for sharing this patch series.
> > > > I’ve read through it with great interest.
> > > >
> > > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > > and this patch set appears quite relevant
> > > > to our ongoing discussions and early-stage implementation.
> > >
> > > May I ask - what's the use case you're thinking of here? Remote swapping?
> > >
> >
> > Yes, that's correct.
> > Our usage scenario includes remote swap,
> > and we're experimenting with assigning swap tiers per cgroup
> > in order to improve specific scene of our target device performance.
> >
> > We’ve explored several approaches and PoCs around this,
> > and in the process of evaluating
> > whether our direction could eventually be aligned
> > with the upstream kernel,
> > I came across your patchset and wanted to ask whether
> > similar efforts have been discussed or attempted before.
> >
> > > >
> > > > I had a couple of questions regarding the future direction.
> > > >
> > > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > >   [9]). Similar to swapoff, with the old design we would need to
> > > > >   perform the expensive page table walk.
> > > >
> > > > Based on the discussion in [5], it seems there was some exploration
> > > > around enabling per-cgroup selection of multiple tiers.
> > > > Do you envision the current design evolving in a similar direction
> > > > to those past discussions, or is there a different direction you're aiming for?
> > >
> > > IIRC, that past design focused on the interface aspect of the problem,
> > > but never actually touched the mechanism to implement a multi-tier
> > > swapping solution.
> > >
> > > The simple reason is it's impossible, or at least highly inefficient
> > > to do it in the current design, i.e without virtualizing swap. Storing
> >
> > As you pointed out, there are certainly inefficiencies
> > in supporting this use case with the current design,
> > but if there is a valid use case,
> > I believe there’s room for it to be supported in the current model
> > —possibly in a less optimized form—
> > until a virtual swap device becomes available
> > and provides a more efficient solution.
> > What do you think about?
> 
> Hi All,
> 
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
> 
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
> 
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
> 
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
> 
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
> 
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,
> busy loop due to the ZRAM_WB bit, etc...
> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
> 

That's interesting — we've explored a similar idea as well, 
although not by attaching it to ZRAM.
Instead, our concept involved creating a separate block device 
capable of performing the tiering functionality, and using it as follows:

1. Prepare a block device that can manage multiple backend block devices.
2. Perform swapon on this block device.
3. Within the block device, use cgroup awareness 
to carry out tiered swap operations across the prepared backend devices.

However, we ended up postponing this approach as a second-tier option, mainly 
due to the following concerns:

1. The idea of allocating physical slots but managing them internally 
as logical slots felt inefficient.
2. Embedding cgroup awareness within a block device 
seemed like a layer violation.

> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
> 
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I am in agreement with the points you’ve made.
 
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.

I have been exploring an interface design 
that leverages the existing swap priority mechanism,
and I believe it would be valuable 
to share this for further discussion and feedback.
As mentioned in my earlier response to Nhat,
I intend to submit this as an RFC to solicit broader input from the community. 

Best regards,
YoungJun Park

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-06-01 16:14       ` Kairui Song
  2025-06-02 15:17         ` YoungJun Park
@ 2025-06-02 18:29         ` Nhat Pham
  2025-06-03  9:50           ` Kairui Song
  1 sibling, 1 reply; 30+ messages in thread
From: Nhat Pham @ 2025-06-02 18:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: YoungJun Park, linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts, viro,
	baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Hi All,

Thanks for sharing your setup, Kairui! I've always been curious about
multi-tier compression swapping.

>
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
>
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even

Hmmm this part seems a bit hacky to me too :-?

> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
>
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

How do you do aging with multiple tiers like this? Or do you just rely
on time thresholds, and have userspace invokes writeback in a cron
job-style?

Tbh, I'm surprised that we see performance win with recompression. I
understand that different workloads might benefit the most from
different points in the Pareto frontier of latency-memory saving:
latency-sensitive workloads might like a fast compression algorithm,
whereas other workloads might prefer a compression algorithm that
saves more memory. So a per-cgroup compressor selection can make
sense.

However, would the overhead of moving a page from one tier to the
other not eat up all the benefit from the (usually small) extra memory
savings?

>
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
>
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
>
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,

Interesting, is zram doing writeback batching?

> busy loop due to the ZRAM_WB bit, etc...

Hmmm, this sounds like something swap cache can help with. It's the
approach zswap writeback is taking - concurrent assessors can get the
page in the swap cache, and OTOH zswap writeback back off if it
detects swap cache contention (since the page is probably being
swapped in, freed, or written back by another thread).

But I'm not sure how zram writeback works...

> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
>
> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
>
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
>
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I suppose a lot of this is specific to zram, but bits and pieces of it
sound upstreamable to me :)

We can wait for YoungJun's patches/RFC for further discussion, but perhaps:

1. A new cgroup interface to select swap backends for a cgroup.

2. Writeback/fallback order either designated by the above interface,
or by the priority of the swap backends.


>
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH v2 00/18] Virtual Swap Space
  2025-06-02 18:29         ` Nhat Pham
@ 2025-06-03  9:50           ` Kairui Song
  0 siblings, 0 replies; 30+ messages in thread
From: Kairui Song @ 2025-06-03  9:50 UTC (permalink / raw)
  To: Nhat Pham
  Cc: YoungJun Park, linux-mm, akpm, hannes, hughd, yosry.ahmed, mhocko,
	roman.gushchin, shakeel.butt, muchun.song, len.brown,
	chengming.zhou, chrisl, huang.ying.caritas, ryan.roberts, viro,
	baohua, osalvador, lorenzo.stoakes, christophe.leroy, pavel,
	kernel-team, linux-kernel, cgroups, linux-pm, peterx, gunho.lee,
	taejoon.song, iamjoonsoo.kim

On Tue, Jun 3, 2025 at 2:30 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> >
> > Hi All,
>
> Thanks for sharing your setup, Kairui! I've always been curious about
> multi-tier compression swapping.
>
> >
> > I'd like to share some info from my side. Currently we have an
> > internal solution for multi tier swap, implemented based on ZRAM and
> > writeback: 4 compression level and multiple block layer level. The
> > ZRAM table serves a similar role to the swap table in the "swap table
> > series" or the virtual layer here.
> >
> > We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
>
> Hmmm this part seems a bit hacky to me too :-?

Yeah, terribly hackish :P

One of the reasons why I'm trying to retire it.

>
> > supports per-cgroup priority, and per-cgroup writeback control, and it
> > worked perfectly fine in production.
> >
> > The interface looks something like this:
> > /sys/fs/cgroup/cg1/zram.prio: [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
>
> How do you do aging with multiple tiers like this? Or do you just rely
> on time thresholds, and have userspace invokes writeback in a cron
> job-style?

ZRAM already has a time threshold, and I added another LRU for swapped
out entries, aging is supposed to be done by userspace agents, I
didn't mention it here as things are becoming more irrelevant to
upstream implementation.

> Tbh, I'm surprised that we see performance win with recompression. I
> understand that different workloads might benefit the most from
> different points in the Pareto frontier of latency-memory saving:
> latency-sensitive workloads might like a fast compression algorithm,
> whereas other workloads might prefer a compression algorithm that
> saves more memory. So a per-cgroup compressor selection can make
> sense.
>
> However, would the overhead of moving a page from one tier to the
> other not eat up all the benefit from the (usually small) extra memory
> savings?

So far we are not re-compressing things, but per-cgroup compression /
writeback level is useful indeed. Compressed memory gets written back
to the block device, that's a large gain.

> > It's really nothing fancy and complex, the four priority is simply the
> > four ZRAM compression streams that's already in upstream, and you can
> > simply hardcode four *bdev in "struct zram" and reuse the bits, then
> > chain the write bio with new underlayer bio... Getting the priority
> > info of a cgroup is even simpler once ZRAM is cgroup aware.
> >
> > All interfaces can be adjusted dynamically at any time (e.g. by an
> > agent), and already swapped out pages won't be touched. The block
> > devices are specified in ZRAM's sys files during swapon.
> >
> > It's easy to implement, but not a good idea for upstream at all:
> > redundant layers, and performance is bad (if not optimized):
> > - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> > SYNCHRONIZE_IO completely which actually improved the performance in
> > every aspect (I've been trying to upstream this for a while);
> > - ZRAM's block device allocator is just not good (just a bitmap) so we
> > want to use the SWAP allocator directly (which I'm also trying to
> > upstream with the swap table series);
> > - And many other bits and pieces like bio batching are kind of broken,
>
> Interesting, is zram doing writeback batching?

Nope, it even has a comment saying "XXX: A single page IO would be
inefficient for write". We managed to chain bio on the initial page
writeback but still not an ideal design.

> > busy loop due to the ZRAM_WB bit, etc...
>
> Hmmm, this sounds like something swap cache can help with. It's the
> approach zswap writeback is taking - concurrent assessors can get the
> page in the swap cache, and OTOH zswap writeback back off if it
> detects swap cache contention (since the page is probably being
> swapped in, freed, or written back by another thread).
>
> But I'm not sure how zram writeback works...

Yeah, any bit lock design suffers a similar problem (like
SWAP_HAS_CACHE), I think we should just use folio lock or folio
writeback in the long term, it works extremely well as a generic
infrastructure (which I'm trying to push upstream) and we don't need
any extra locking, minimizing memory / design overhead.

> > - Lacking support for things like effective migration/compaction,
> > doable but looks horrible.
> >
> > So I definitely don't like this band-aid solution, but hey, it works.
> > I'm looking forward to replacing it with native upstream support.
> > That's one of the motivations behind the swap table series, which
> > I think it would resolve the problems in an elegant and clean way
> > upstreamly. The initial tests do show it has a much lower overhead
> > and cleans up SWAP.
> >
> > But maybe this is kind of similar to the "less optimized form" you
> > are talking about? As I mentioned I'm already trying to upstream
> > some nice parts of it, and hopefully replace it with an upstream
> > solution finally.
> >
> > I can try upstream other parts of it if there are people really
> > interested, but I strongly recommend that we should focus on the
> > right approach instead and not waste time on that and spam the
> > mail list.
>
> I suppose a lot of this is specific to zram, but bits and pieces of it
> sound upstreamable to me :)
>
> We can wait for YoungJun's patches/RFC for further discussion, but perhaps:
>
> 1. A new cgroup interface to select swap backends for a cgroup.
>
> 2. Writeback/fallback order either designated by the above interface,
> or by the priority of the swap backends.

Fully agree, the final interface and features definitely need more
discussion and collab in upstream...

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-06-03  9:50 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-29 23:38 [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 01/18] swap: rearrange the swap header file Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 02/18] swapfile: rearrange functions Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 03/18] swapfile: rearrange freeing steps Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 04/18] mm: swap: add an abstract API for locking out swapoff Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 05/18] mm: swap: add a separate type for physical swap slots Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 06/18] mm: create scaffolds for the new virtual swap implementation Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 07/18] mm: swap: zswap: swap cache and zswap support for virtualized swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 08/18] mm: swap: allocate a virtual swap slot for each swapped out page Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 09/18] swap: implement the swap_cgroup API using virtual swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 10/18] swap: manage swap entry lifetime at the virtual swap layer Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 11/18] mm: swap: temporarily disable THP swapin and batched freeing swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 12/18] mm: swap: decouple virtual swap slot from backing store Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 13/18] zswap: do not start zswap shrinker if there is no physical swap slots Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 14/18] memcg: swap: only charge " Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 15/18] vswap: support THP swapin and batch free_swap_and_cache Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 16/18] swap: simplify swapoff using virtual swap Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 17/18] swapfile: move zeromap setup out of enable_swap_info Nhat Pham
2025-04-29 23:38 ` [RFC PATCH v2 18/18] swapfile: remove zeromap in virtual swap implementation Nhat Pham
2025-04-29 23:51 ` [RFC PATCH v2 00/18] Virtual Swap Space Nhat Pham
2025-05-30  6:47 ` YoungJun Park
2025-05-30 16:52   ` Nhat Pham
2025-05-30 16:54     ` Nhat Pham
2025-06-01 12:56     ` YoungJun Park
2025-06-01 16:14       ` Kairui Song
2025-06-02 15:17         ` YoungJun Park
2025-06-02 18:29         ` Nhat Pham
2025-06-03  9:50           ` Kairui Song
2025-06-01 21:08       ` Nhat Pham
2025-06-02 15:03         ` YoungJun Park

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).