[PATCH 00/28] mm, swap: introduce swap table

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/28] mm, swap: introduce swap table
@ 2025-05-14 20:17 Kairui Song
  2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
                   ` (27 more replies)
  0 siblings, 28 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

This is the series that implements the Swap Table idea propose in the
LSF/MM/BPF topic "Integrate swap cache, swap maps with swap allocator"
about one month ago [1].

With this series, swap subsystem will have a ~20-30% performance gain
from basic sequential swap to heavy workloads, for both 4K and mTHP
folios. The idle memory usage is already much lower, the average memory
consumption is still the same or will also be even lower (with further
works). And this enables many more future optimizations, with better
defined swap operations.

This series is stable and mergeable on both mm-unstable and mm-stable.
It's a long series so it might be challenging to review though. It has
been working well with many stress tests.

You can also find the latest branch here:
https://github.com/ryncsn/linux/tree/kasong/devel/swap-table

With swap table, a table entry will be the fundamental and the only needed
data structure for: swap cache, swap map, swap cgroup map. Reducing the
memory usage and making the performance better, also provide more
flexibility and better abstraction.

/*
 * Swap table entry type and bit layouts:
 *
 * NULL:     | ------------    0   -------------|
 * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1|
 * Folio:    | SWAP_COUNT |------ PFN -------|10|
 * Pointer:  |----------- Pointer ----------|100|
 */

This series contains many clean up and refractors due to many historical
issues with the SWAP subsystem, e.g. the fuzzy workflow and definition of
how swap entries are handled, and a lot of corner issues as mentioned
in the LSF/MM/BPF talk.

There could be temporary increase of complicity or memory consumption
in the middle of this series, but in the end it's a much simplified
and sanitized in the end. And these patches have dependency on each
other due to the current complex swap design. So this takes a long
series.

This series cleaned up most of the issues and improved the situation
with following order:
- Simplification and optimizations (Patch 1 - 3)
- Tidy up swap info and cache lookup (Patch 4 - 6)
- Introduce basic swap table infrastructure (Patch 7 - 8)
- Removed swap cache bypassing with SWP_SYNCHRONOUS_IO, enabling mTHP
  for more workloads (Patch 9 - 14).
- Simplify swap in synchronization with swap cache, eliminating long
  tailing issues and improve performance, swap can be synced with folio
  lock now (Patch 15 - 16).
- Make most swap operations into folio based. We now can use folio based
  helpers that ensures the swap entries are stable with folio lock,
  which also make more optimization and sanity checks doable.
  (Patch 17 - 18)
- Removed SWAP_HAS_CACHE. (Patch 19 - 22)
- Completely rework the swap counting using swap table, and remove
  COUNT_CONTINUED (Patch 23 - 27).
- Dynamic reclaim and allocation for swap table (Patch 28)

And the performance is looking great too:

vm-scalability usemem shows a great improvement:
Test using: usemem --init-time -O -y -x -n 31 1G (1G memcg, pmem as swap)
                Before:         After:
System time:    217.39s         161.59s      (-25.67%)
Throughput:     3933.58 MB/s    4975.55 MB/s (+26.48%)
(Similar results with random usemem -R)

Build kernel with defconfig on tmpfs with ZRAM:
Below results shows a test matrix using different memory cgroup limit
and job numbers.

  make -j<NR>|  Total Sys Time (seconds) |   Total Time (seconds)
  (NR / Mem )|  (Before / After / Delta) | (Before / After / Delta)
  With 4k pages only:
   6 / 192M  |    5327 /  3915 / -26.5%  |    1427 /  1141 / -20.0%
  12 / 256M  |    5373 /  4009 / -25.3%  |     743 /   606 / -18.4%
  24 / 384M  |    6149 /  4523 / -26.4%  |     438 /   353 / -19.4%
  48 / 768M  |    7285 /  4521 / -37.9%  |     251 /   190 / -24.3%
  With 64k mTHP:
  24 / 512M  |    4399 /  3328 / -24.3%  |     345 /   289 / -16.2%
  48 /   1G  |    5072 /  3406 / -32.8%  |     187 /   150 / -19.7%

Memory usage is also reduced. Although this series haven't remove the
swap cgroup array yet, the peak usage of one swap entry is already
reduced from 12 bytes to 10 bytes. And the swap table is dynamically
allocated which means the idle memory usage will be reduced by a lot.

Some other high lights and notes:

1. This series introduce a set of helpers "folio_alloc_swap",
   "folio_dup_swap", "folio_put_swap", "folio_free_swap*" to make
   most swap operations folio based, this should brought a clean border
   between swap and rest of mm. Also split the hibernation swap entry
   allocation out of ordinary swap operations.

3. This series enabled mTHP swap-in, and read ahead skipping for more
   workloads, as it removed the swap cache bypassing path:

   We currently only do mTHP swap in and read ahead bypass for
   SWP_SYNCHRONOUS_IO device only when swap count of all related entries
   are equal to one. This makes no sense, clearly read ahead and mTHP
   behaviour should have nothing to do with swap count, it's only a defect
   due to current design that they are coupled with swap cache bypassing.
   This series removed that limitation while showing a major performance
   improvement.

   This not only showed a performance gain, also should reduce mTHP
   fragmentation.

4. By removing the old swap cache design, now all swap cache are
   protected by fine grained cluster lock, this also removed the cluster
   shuffle algorithm should improve the performance for SWAP on HDD too
   (Fixing [4]). And also got rid of the many swap address_space instance
   design.

5. I dropped some future doable optimizations for now, e.g. the folio
   based helper will be an essential part for dropping the swap
   cgroup control map, which will improve the performance and reduce
   memory usage even more. It could be done later. And more folio
   batched operations could be done based on this. So this series is
   not in the best shape but already looking good enough.

Future work items:

1. More tests, and maybe some of the patches need to be split into
   smaller ones or need a few preparation series.

2. Integrate with Nhat Pham's Virtual swap space [2], while this series
   improves the performance and added a sanitized workload for SWAP,
   nothing changed feature wise. The swap table idea is suppose to be
   able to handle things like a virtual device in a cleaner way with
   both lower overhead and better flexibility, more work is needed to
   figure out a way to implement it.

3. Some helpers from this series could be very helpful for future works.
   E.g. the folio based swap helpers, now locking a folio will stabilize
   its swap entries, which could also be used to stabilize the under layer
   swap device's entries if a virtual device design is implemented,
   hence simplify the locking design. Also more entry types could be
   added for things like zero map or shmem.

4. The unified swap in path now already enabled mTHP swap in for swap
   count > 1 entries. This also making unify the read ahead of shmem / anon
   (as demostrated a year ago [3], that one conflicted with the
   standalone mTHP swapin path but now it's unified) doable now.

   We can also implement a read ahead based mTHP swapin based on this too.
   This needs more discussion.

Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://lore.kernel.org/lkml/20250407234223.1059191-1-nphamcs@gmail.com/ [2]
Link: https://lore.kernel.org/all/20240129175423.1987-1-ryncsn@gmail.com/ [3]
Link: https://lore.kernel.org/linux-mm/202504241621.f27743ec-lkp@intel.com/ [4]

Kairui Song (27):
  mm, swap: don't scan every fragment cluster
  mm, swap: consolidate the helper for mincore
  mm, swap: split readahead update out of swap cache lookup
  mm, swap: sanitize swap cache lookup convention
  mm, swap: rearrange swap cluster definition and helpers
  mm, swap: tidy up swap device and cluster info helpers
  mm, swap: use swap table for the swap cache and switch API
  mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc
  mm, swap: add a swap helper for bypassing only read ahead
  mm, swap: clean up and consolidate helper for mTHP swapin check
  mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  mm/shmem, swap: avoid redundant Xarray lookup during swapin
  mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  mm, swap: split locked entry freeing into a standalone helper
  mm, swap: use swap cache as the swap in synchronize layer
  mm, swap: sanitize swap entry management workflow
  mm, swap: rename and introduce folio_free_swap_cache
  mm, swap: clean up and improve swap entries batch freeing
  mm, swap: check swap table directly for checking cache
  mm, swap: add folio to swap cache directly on allocation
  mm, swap: drop the SWAP_HAS_CACHE flag
  mm, swap: remove no longer needed _swap_info_get
  mm, swap: implement helpers for reserving data in swap table
  mm/workingset: leave highest 8 bits empty for anon shadow
  mm, swap: minor clean up for swapon
  mm, swap: use swap table to track swap count
  mm, swap: implement dynamic allocation of swap table

Nhat Pham (1):
  mm/shmem, swap: remove SWAP_MAP_SHMEM

 arch/s390/mm/pgtable.c |    2 +-
 include/linux/swap.h   |  119 +--
 kernel/power/swap.c    |    8 +-
 mm/filemap.c           |   20 +-
 mm/huge_memory.c       |   20 +-
 mm/madvise.c           |    2 +-
 mm/memory-failure.c    |    2 +-
 mm/memory.c            |  384 ++++-----
 mm/migrate.c           |   28 +-
 mm/mincore.c           |   49 +-
 mm/page_io.c           |   12 +-
 mm/rmap.c              |    7 +-
 mm/shmem.c             |  204 ++---
 mm/swap.h              |  316 ++++++--
 mm/swap_state.c        |  646 ++++++++-------
 mm/swap_table.h        |  231 ++++++
 mm/swapfile.c          | 1708 +++++++++++++++++-----------------------
 mm/userfaultfd.c       |    9 +-
 mm/vmscan.c            |   22 +-
 mm/workingset.c        |   39 +-
 mm/zswap.c             |   13 +-
 21 files changed, 1981 insertions(+), 1860 deletions(-)
 create mode 100644 mm/swap_table.h

-- 
2.49.0

^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 01/28] mm, swap: don't scan every fragment cluster
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
                   ` (26 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Fragment clusters were failing high order allocation already, the reason
we scan it now is that a swap entry may get freed without releasing the
cache so a swap map entry will end up in HAS_CACHE only status and the
cluster won't be moved back to non-full or free cluster list.

The chance is low and only happens with the device usage is low
(!vm_swap_full()). This is especially unhelpful for SWP_SYNCHRONOUS_IO
devices as swap cache almost always gets freed when count reaches zero
for these device.

And besides, high order allocation failure isn't a critical issue.
Having the scan actually slow down mTHP allocation by a lot
when the fragment cluster list is long.

The HAS_CACHE issue will be fixed in a proper way later, so drop this
fragment cluster scanning design.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 32 +++++++++-----------------------
 2 files changed, 9 insertions(+), 24 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..817e427a47d2 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,7 +310,6 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 026090bf3efe..34188714479f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
 	else
 		list_move_tail(&ci->list, list);
 	spin_unlock(&si->lock);
-
-	if (ci->flags == CLUSTER_FLAG_FRAG)
-		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
-	else if (new_flags == CLUSTER_FLAG_FRAG)
-		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags = new_flags;
 }
 
@@ -926,32 +921,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
 
 	if (order < PMD_ORDER) {
-		unsigned int frags = 0, frags_existing;
-
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			/* Clusters failed to allocate are moved to frag_clusters */
-			frags++;
 		}
 
-		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
-		while (frags < frags_existing &&
-		       (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
-			atomic_long_dec(&si->frag_cluster_nr[order]);
-			/*
-			 * Rotate the frag list to iterate, they were all
-			 * failing high order allocation or moved here due to
-			 * per-CPU usage, but they could contain newly released
-			 * reclaimable (eg. lazy-freed swap cache) slots.
-			 */
+		/*
+		 * Scan only one fragment cluster is good enough. Order 0
+		 * allocation will surely success, and mTHP allocation failure
+		 * is not critical, and scanning one cluster still keeps the
+		 * list rotated and scanned (for reclaiming HAS_CACHE).
+		 */
+		ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
+		if (ci) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							order, usage);
+					order, usage);
 			if (found)
 				goto done;
-			frags++;
 		}
 	}
 
@@ -973,7 +961,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
 		while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
-			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							0, usage);
 			if (found)
@@ -3234,7 +3221,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
 
 	/*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 02/28] mm, swap: consolidate the helper for mincore
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
  2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
                   ` (25 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

mincore related logics are not used by any one else, so consolidate and
move it to mincore only to simplify the code.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/mincore.c    | 50 +++++++++++++++++++++++++++++++++++++++----------
 mm/swap.h       | 10 ----------
 mm/swap_state.c | 38 -------------------------------------
 3 files changed, 40 insertions(+), 58 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index 42d6c9c8da86..7ee88113d44c 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -44,6 +44,36 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	return 0;
 }
 
+static unsigned char mincore_swap(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct folio *folio = NULL;
+	unsigned char present = 0;
+
+	/* There might be swapin error entries in shmem mapping. */
+	if (non_swap_entry(entry))
+		return 0;
+
+	if (!IS_ENABLED(CONFIG_SWAP)) {
+		WARN_ON_ONCE(1);
+		return 1;
+	}
+
+	/* Prevent swap device to being swapoff under us */
+	si = get_swap_device(entry);
+	if (si) {
+		folio = filemap_get_folio(swap_address_space(entry),
+					  swap_cache_index(entry));
+		put_swap_device(si);
+	}
+	if (folio) {
+		present = folio_test_uptodate(folio);
+		folio_put(folio);
+	}
+
+	return present;
+}
+
 /*
  * Later we can get more picky about what "in core" means precisely.
  * For now, simply check to see if the page is in the page cache,
@@ -61,8 +91,15 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t index)
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-	folio = filemap_get_incore_folio(mapping, index);
-	if (!IS_ERR(folio)) {
+	folio = filemap_get_entry(mapping, index);
+	if (folio) {
+		if (xa_is_value(folio)) {
+			if (shmem_mapping(mapping))
+				return mincore_swap(radix_to_swp_entry(folio));
+			else
+				return 0;
+		}
+
 		present = folio_test_uptodate(folio);
 		folio_put(folio);
 	}
@@ -141,7 +178,6 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				vec[i] = 1;
 		} else { /* pte is a swap entry */
 			swp_entry_t entry = pte_to_swp_entry(pte);
-
 			if (non_swap_entry(entry)) {
 				/*
 				 * migration or hwpoison entries are always
@@ -149,13 +185,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 				 */
 				*vec = 1;
 			} else {
-#ifdef CONFIG_SWAP
-				*vec = mincore_page(swap_address_space(entry),
-						    swap_cache_index(entry));
-#else
-				WARN_ON(1);
-				*vec = 1;
-#endif
+				*vec = mincore_swap(entry);
 			}
 		}
 		vec += step;
diff --git a/mm/swap.h b/mm/swap.h
index 521bf510ec75..4f85195ab83d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -62,9 +62,6 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
 struct folio *swap_cache_get_folio(swp_entry_t entry,
 		struct vm_area_struct *vma, unsigned long addr);
-struct folio *filemap_get_incore_folio(struct address_space *mapping,
-		pgoff_t index);
-
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -156,13 +153,6 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
 	return NULL;
 }
 
-static inline
-struct folio *filemap_get_incore_folio(struct address_space *mapping,
-		pgoff_t index)
-{
-	return filemap_get_folio(mapping, index);
-}
-
 static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ac4e0994931c..4117ea4e7afc 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -324,44 +324,6 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
 	return folio;
 }
 
-/**
- * filemap_get_incore_folio - Find and get a folio from the page or swap caches.
- * @mapping: The address_space to search.
- * @index: The page cache index.
- *
- * This differs from filemap_get_folio() in that it will also look for the
- * folio in the swap cache.
- *
- * Return: The found folio or %NULL.
- */
-struct folio *filemap_get_incore_folio(struct address_space *mapping,
-		pgoff_t index)
-{
-	swp_entry_t swp;
-	struct swap_info_struct *si;
-	struct folio *folio = filemap_get_entry(mapping, index);
-
-	if (!folio)
-		return ERR_PTR(-ENOENT);
-	if (!xa_is_value(folio))
-		return folio;
-	if (!shmem_mapping(mapping))
-		return ERR_PTR(-ENOENT);
-
-	swp = radix_to_swp_entry(folio);
-	/* There might be swapin error entries in shmem mapping. */
-	if (non_swap_entry(swp))
-		return ERR_PTR(-ENOENT);
-	/* Prevent swapoff from happening to us */
-	si = get_swap_device(swp);
-	if (!si)
-		return ERR_PTR(-ENOENT);
-	index = swap_cache_index(swp);
-	folio = filemap_get_folio(swap_address_space(swp), index);
-	put_swap_device(si);
-	return folio;
-}
-
 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
  2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
  2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
                   ` (24 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

From: Nhat Pham <nphamcs@gmail.com>

The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a
("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
belongs to shmem during swapoff.

However, swapoff has since been rewritten in the commit b56a2d8af914
("mm: rid swapoff of quadratic complexity"). Now having swap count ==
SWAP_MAP_SHMEM value is basically the same as having swap count == 1,
and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only
difference of note is that swap_shmem_alloc() does not check for
-ENOMEM returned from __swap_duplicate(), but it is OK because shmem
never re-duplicates any swap entry it owns. This will stil be safe if we
use (batched) swap_duplicate() instead.

This commit adds swap_duplicate_nr(), the batched variant of
swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the
associated swap_shmem_alloc() helper to simplify the state machine (both
mentally and in terms of actual code). We will also have an extra
state/special value that can be repurposed (for swap entries that never
gets re-duplicated).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 15 +++++++--------
 mm/shmem.c           |  2 +-
 mm/swapfile.c        | 42 +++++++++++++++++-------------------------
 3 files changed, 25 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 817e427a47d2..0e52ac4e817d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -230,7 +230,6 @@ enum {
 /* Special value in first swap_map */
 #define SWAP_MAP_MAX	0x3e	/* Max count */
 #define SWAP_MAP_BAD	0x3f	/* Note page is bad */
-#define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs */
 
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
@@ -477,8 +476,7 @@ bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
+extern int swap_duplicate_nr(swp_entry_t entry, int nr);
 extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -541,11 +539,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
-{
-}
-
-static inline int swap_duplicate(swp_entry_t swp)
+static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
 {
 	return 0;
 }
@@ -596,6 +590,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
+static inline int swap_duplicate(swp_entry_t entry)
+{
+	return swap_duplicate_nr(entry, 1);
+}
+
 static inline void free_swap_and_cache(swp_entry_t entry)
 {
 	free_swap_and_cache_nr(entry, 1);
diff --git a/mm/shmem.c b/mm/shmem.c
index 99327c30507c..972bd0eca439 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1641,7 +1641,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 
 	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
 		shmem_recalc_inode(inode, 0, nr_pages);
-		swap_shmem_alloc(folio->swap, nr_pages);
+		swap_duplicate_nr(folio->swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 34188714479f..6b115149b845 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -192,7 +192,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 	unsigned char *map_end = map + nr_pages;
 	unsigned char count = *map;
 
-	if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
+	if (swap_count(count) != 1)
 		return false;
 
 	while (++map < map_end) {
@@ -1359,12 +1359,6 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
 	if (usage == SWAP_HAS_CACHE) {
 		VM_BUG_ON(!has_cache);
 		has_cache = 0;
-	} else if (count == SWAP_MAP_SHMEM) {
-		/*
-		 * Or we could insist on shmem.c using a special
-		 * swap_shmem_free() and free_shmem_swap_and_cache()...
-		 */
-		count = 0;
 	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
 		if (count == COUNT_CONTINUED) {
 			if (swap_count_continued(si, offset, count))
@@ -1478,7 +1472,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	if (nr <= 1)
 		goto fallback;
 	count = swap_count(data_race(si->swap_map[offset]));
-	if (count != 1 && count != SWAP_MAP_SHMEM)
+	if (count != 1)
 		goto fallback;
 
 	ci = lock_cluster(si, offset);
@@ -1533,12 +1527,10 @@ static bool swap_entries_put_map_nr(struct swap_info_struct *si,
 
 /*
  * Check if it's the last ref of swap entry in the freeing path.
- * Qualified vlaue includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM.
  */
 static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
 {
-	return (count == SWAP_HAS_CACHE) || (count == 1) ||
-	       (count == SWAP_MAP_SHMEM);
+	return (count == SWAP_HAS_CACHE) || (count == 1);
 }
 
 /*
@@ -3536,7 +3528,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	VM_WARN_ON(usage == 1 && nr > 1);
 	ci = lock_cluster(si, offset);
 
 	err = 0;
@@ -3596,27 +3587,28 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	return err;
 }
 
-/*
- * Help swapoff by noting that swap entry belongs to shmem/tmpfs
- * (in which case its reference count is never incremented).
- */
-void swap_shmem_alloc(swp_entry_t entry, int nr)
-{
-	__swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
-}
-
-/*
- * Increase reference count of swap entry by 1.
+/**
+ * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
+ *                       by 1.
+ *
+ * @entry: first swap entry from which we want to increase the refcount.
+ * @nr: Number of entries in range.
+ *
  * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
  * but could not be atomically allocated.  Returns 0, just as if it succeeded,
  * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
  * might occur if a page table entry has got corrupted.
+ *
+ * Note that we are currently not handling the case where nr > 1 and we need to
+ * add swap count continuation. This is OK, because no such user exists - shmem
+ * is the only user that can pass nr > 1, and it never re-duplicates any swap
+ * entry it owns.
  */
-int swap_duplicate(swp_entry_t entry)
+int swap_duplicate_nr(swp_entry_t entry, int nr)
 {
 	int err = 0;
 
-	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
+	while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (2 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
                   ` (23 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Decouple readahead update with swap cache lookup. No feature change.

After this, swap_cache_get_folio is the only entry for getting folios
from the swap cache. There are only two callers of it want to update
readahead statistic.

There are only three special cases for accessing swap cache space now:
huge memory splitting, migration and shmem replacing, they directly
modify the Xarray. Following commit will wrap their accesses to the
swap cache with special helpers too.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c      |  6 ++-
 mm/mincore.c     |  3 +-
 mm/shmem.c       |  5 ++-
 mm/swap.h        | 13 +++++--
 mm/swap_state.c  | 99 +++++++++++++++++++++++-------------------------
 mm/swapfile.c    | 11 +++---
 mm/userfaultfd.c |  5 +--
 7 files changed, 72 insertions(+), 70 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 5cb48f262ab0..18b5a77a0a4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4567,9 +4567,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
-	if (folio)
+	folio = swap_cache_get_folio(entry);
+	if (folio) {
+		swap_update_readahead(folio, vma, vmf->address);
 		page = folio_file_page(folio, swp_offset(entry));
+	}
 	swapcache = folio;
 
 	if (!folio) {
diff --git a/mm/mincore.c b/mm/mincore.c
index 7ee88113d44c..a57a9ee9e93d 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -62,8 +62,7 @@ static unsigned char mincore_swap(swp_entry_t entry)
 	/* Prevent swap device to being swapoff under us */
 	si = get_swap_device(entry);
 	if (si) {
-		folio = filemap_get_folio(swap_address_space(entry),
-					  swap_cache_index(entry));
+		folio = swap_cache_get_folio(entry);
 		put_swap_device(si);
 	}
 	if (folio) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 972bd0eca439..01f29cb31c7a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2259,7 +2259,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_get_folio(swap);
+	if (folio)
+		swap_update_readahead(folio, NULL, 0);
 	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		bool fallback_order0 = false;
@@ -2350,7 +2352,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
 	}
-
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
diff --git a/mm/swap.h b/mm/swap.h
index 4f85195ab83d..e83109ad1456 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -60,8 +60,7 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr);
+struct folio *swap_cache_get_folio(swp_entry_t entry);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -72,6 +71,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
+void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned long addr);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -138,6 +139,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline void swap_update_readahead(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+}
+
 static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 {
 	return 0;
@@ -147,8 +153,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
 {
 }
 
-static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 4117ea4e7afc..bca201100138 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -166,6 +166,21 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+/*
+ * Lookup a swap entry in the swap cache. A found folio will be returned
+ * unlocked and with its refcount incremented.
+ *
+ * Caller must hold a reference on the swap device.
+ */
+struct folio *swap_cache_get_folio(swp_entry_t entry)
+{
+	struct folio *folio = filemap_get_folio(swap_address_space(entry),
+						swap_cache_index(entry));
+	if (!IS_ERR(folio))
+		return folio;
+	return NULL;
+}
+
 /*
  * This must be called only on folios that have
  * been verified to be in the swap cache and locked.
@@ -274,54 +289,40 @@ static inline bool swap_use_vma_readahead(void)
 }
 
 /*
- * Lookup a swap entry in the swap cache. A found folio will be returned
- * unlocked and with its refcount incremented - we rely on the kernel
- * lock getting page table operations atomic even if we drop the folio
- * lock before returning.
- *
- * Caller must lock the swap device or hold a reference to keep it valid.
+ * Update the readahead statistics of a vma or globally.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+void swap_update_readahead(struct folio *folio,
+			   struct vm_area_struct *vma,
+			   unsigned long addr)
 {
-	struct folio *folio;
-
-	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
-	if (!IS_ERR(folio)) {
-		bool vma_ra = swap_use_vma_readahead();
-		bool readahead;
+	bool readahead, vma_ra = swap_use_vma_readahead();
 
-		/*
-		 * At the moment, we don't support PG_readahead for anon THP
-		 * so let's bail out rather than confusing the readahead stat.
-		 */
-		if (unlikely(folio_test_large(folio)))
-			return folio;
-
-		readahead = folio_test_clear_readahead(folio);
-		if (vma && vma_ra) {
-			unsigned long ra_val;
-			int win, hits;
-
-			ra_val = GET_SWAP_RA_VAL(vma);
-			win = SWAP_RA_WIN(ra_val);
-			hits = SWAP_RA_HITS(ra_val);
-			if (readahead)
-				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vma->swap_readahead_info,
-					SWAP_RA_VAL(addr, win, hits));
-		}
-
-		if (readahead) {
-			count_vm_event(SWAP_RA_HIT);
-			if (!vma || !vma_ra)
-				atomic_inc(&swapin_readahead_hits);
-		}
-	} else {
-		folio = NULL;
+	/*
+	 * At the moment, we don't support PG_readahead for anon THP
+	 * so let's bail out rather than confusing the readahead stat.
+	 */
+	if (unlikely(folio_test_large(folio)))
+		return;
+
+	readahead = folio_test_clear_readahead(folio);
+	if (vma && vma_ra) {
+		unsigned long ra_val;
+		int win, hits;
+
+		ra_val = GET_SWAP_RA_VAL(vma);
+		win = SWAP_RA_WIN(ra_val);
+		hits = SWAP_RA_HITS(ra_val);
+		if (readahead)
+			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+		atomic_long_set(&vma->swap_readahead_info,
+				SWAP_RA_VAL(addr, win, hits));
 	}
 
-	return folio;
+	if (readahead) {
+		count_vm_event(SWAP_RA_HIT);
+		if (!vma || !vma_ra)
+			atomic_inc(&swapin_readahead_hits);
+	}
 }
 
 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -337,14 +338,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	*new_page_allocated = false;
 	for (;;) {
 		int err;
-		/*
-		 * First check the swap cache.  Since this is normally
-		 * called after swap_cache_get_folio() failed, re-calling
-		 * that would confuse statistics.
-		 */
-		folio = filemap_get_folio(swap_address_space(entry),
-					  swap_cache_index(entry));
-		if (!IS_ERR(folio))
+
+		/* Check the swap cache in case the folio is already there */
+		folio = swap_cache_get_folio(entry);
+		if (folio)
 			goto got_folio;
 
 		/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6b115149b845..29e918102355 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
 	swp_entry_t entry = swp_entry(si->type, offset);
-	struct address_space *address_space = swap_address_space(entry);
 	struct swap_cluster_info *ci;
 	struct folio *folio;
 	int ret, nr_pages;
 	bool need_reclaim;
 
 again:
-	folio = filemap_get_folio(address_space, swap_cache_index(entry));
-	if (IS_ERR(folio))
+	folio = swap_cache_get_folio(entry);
+	if (!folio)
 		return 0;
 
 	nr_pages = folio_nr_pages(folio);
@@ -2098,7 +2097,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 		pte = NULL;
 
-		folio = swap_cache_get_folio(entry, vma, addr);
+		folio = swap_cache_get_folio(entry);
 		if (!folio) {
 			struct vm_fault vmf = {
 				.vma = vma,
@@ -2324,8 +2323,8 @@ static int try_to_unuse(unsigned int type)
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
 		entry = swp_entry(type, i);
-		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
-		if (IS_ERR(folio))
+		folio = swap_cache_get_folio(entry);
+		if (!folio)
 			continue;
 
 		/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..e5a0db7f3331 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1389,9 +1389,8 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 		 * separately to allow proper handling.
 		 */
 		if (!src_folio)
-			folio = filemap_get_folio(swap_address_space(entry),
-					swap_cache_index(entry));
-		if (!IS_ERR_OR_NULL(folio)) {
+			folio = swap_cache_get_folio(entry);
+		if (folio) {
 			if (folio_test_large(folio)) {
 				err = -EBUSY;
 				folio_put(folio);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (3 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-19  4:38   ` Barry Song
  2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
                   ` (22 subsequent siblings)
  27 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap cache lookup is lock less, the returned folio could be invalidated
any time before locking it. So the caller always have to lock and check
the folio before use.

Introduce a helper for swap cache folio checking, document this convention,
and avoid touching the folio until the folio has been verified.

And update all current users using this convention.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c      | 31 ++++++++++++++-----------------
 mm/shmem.c       |  4 ++--
 mm/swap.h        | 21 +++++++++++++++++++++
 mm/swap_state.c  |  8 ++++++--
 mm/swapfile.c    | 10 ++++++++--
 mm/userfaultfd.c |  4 ++++
 6 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 18b5a77a0a4b..254be0e88801 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4568,12 +4568,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	folio = swap_cache_get_folio(entry);
-	if (folio) {
-		swap_update_readahead(folio, vma, vmf->address);
-		page = folio_file_page(folio, swp_offset(entry));
-	}
 	swapcache = folio;
-
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
@@ -4642,20 +4637,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
-		page = folio_file_page(folio, swp_offset(entry));
-	} else if (PageHWPoison(page)) {
-		/*
-		 * hwpoisoned dirty swapcache pages are kept for killing
-		 * owner processes (which may be unknown at hwpoison time)
-		 */
-		ret = VM_FAULT_HWPOISON;
-		goto out_release;
 	}
 
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
+	page = folio_file_page(folio, swp_offset(entry));
 	if (swapcache) {
 		/*
 		 * Make sure folio_free_swap() or swapoff did not release the
@@ -4664,10 +4652,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * swapcache, we need to check that the page's swap has not
 		 * changed.
 		 */
-		if (unlikely(!folio_test_swapcache(folio) ||
-			     page_swap_entry(page).val != entry.val))
+		if (!folio_swap_contains(folio, entry))
 			goto out_page;
 
+		if (PageHWPoison(page)) {
+			/*
+			 * hwpoisoned dirty swapcache pages are kept for killing
+			 * owner processes (which may be unknown at hwpoison time)
+			 */
+			ret = VM_FAULT_HWPOISON;
+			goto out_page;
+		}
+
+		swap_update_readahead(folio, vma, vmf->address);
+
 		/*
 		 * KSM sometimes has to copy on read faults, for example, if
 		 * page->index of !PageKSM() pages would be nonlinear inside the
@@ -4682,8 +4680,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			ret = VM_FAULT_HWPOISON;
 			folio = swapcache;
 			goto out_page;
-		}
-		if (folio != swapcache)
+		} else if (folio != swapcache)
 			page = folio_page(folio, 0);
 
 		/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 01f29cb31c7a..43d9e3bf16f4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2260,8 +2260,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap);
-	if (folio)
-		swap_update_readahead(folio, NULL, 0);
 	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		bool fallback_order0 = false;
@@ -2362,6 +2360,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		error = -EEXIST;
 		goto unlock;
 	}
+	if (!skip_swapcache)
+		swap_update_readahead(folio, NULL, 0);
 	if (!folio_test_uptodate(folio)) {
 		error = -EIO;
 		goto failed;
diff --git a/mm/swap.h b/mm/swap.h
index e83109ad1456..34af06bf6fa4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -50,6 +50,22 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
 }
 
+/*
+ * Check if a folio still contains a swap entry, must be called after a
+ * swap cache lookup as the folio might have been invalidated while
+ * it's unlocked.
+ */
+static inline bool folio_swap_contains(struct folio *folio, swp_entry_t entry)
+{
+	pgoff_t index = swp_offset(entry);
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	if (unlikely(!folio_test_swapcache(folio)))
+		return false;
+	if (unlikely(swp_type(entry) != swp_type(folio->swap)))
+		return false;
+	return (index - swp_offset(folio->swap)) < folio_nr_pages(folio);
+}
+
 void show_swap_cache_info(void);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
 int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
@@ -123,6 +139,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	return 0;
 }
 
+static inline bool folio_swap_contains(struct folio *folio, swp_entry_t entry)
+{
+	return false;
+}
+
 static inline void show_swap_cache_info(void)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bca201100138..07c41676486a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -170,7 +170,8 @@ void __delete_from_swap_cache(struct folio *folio,
  * Lookup a swap entry in the swap cache. A found folio will be returned
  * unlocked and with its refcount incremented.
  *
- * Caller must hold a reference on the swap device.
+ * Caller must hold a reference of the swap device, and check if the
+ * returned folio is still valid after locking it (e.g. folio_swap_contains).
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
@@ -339,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	for (;;) {
 		int err;
 
-		/* Check the swap cache in case the folio is already there */
+		/*
+		 * Check the swap cache first, if a cached folio is found,
+		 * return it unlocked. The caller will lock and check it.
+		 */
 		folio = swap_cache_get_folio(entry);
 		if (folio)
 			goto got_folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 29e918102355..aa031fd27847 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * Offset could point to the middle of a large folio, or folio
 	 * may no longer point to the expected offset before it's locked.
 	 */
-	entry = folio->swap;
-	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
+	if (!folio_swap_contains(folio, entry)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		goto again;
 	}
+	entry = folio->swap;
 	offset = swp_offset(entry);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
@@ -2117,6 +2117,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 
 		folio_lock(folio);
+		if (!folio_swap_contains(folio, entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			continue;
+		}
+
 		folio_wait_writeback(folio);
 		ret = unuse_pte(vma, pmd, addr, entry, folio);
 		if (ret < 0) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e5a0db7f3331..5b4f01aecf35 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
 				goto retry;
 			}
 		}
+		if (!folio_swap_contains(src_folio, entry)) {
+			err = -EBUSY;
+			goto out;
+		}
 		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
 				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
 				dst_ptl, src_ptl, src_folio);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (4 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-19  6:26   ` Barry Song
  2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
                   ` (21 subsequent siblings)
  27 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, move all cluster related definition and helpers to
mm/swap.h, also tidy up and add a "swap_" prefix for all cluster
lock/unlock helpers, so they can be better used outside of swap files.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  34 ---------------
 mm/swap.h            |  62 ++++++++++++++++++++++++++
 mm/swapfile.c        | 102 +++++++++++++------------------------------
 3 files changed, 92 insertions(+), 106 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0e52ac4e817d..1e7d9d55c39a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -234,40 +234,6 @@ enum {
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
 
-/*
- * We use this to track usage of a cluster. A cluster is a block of swap disk
- * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
- * free clusters are organized into a list. We fetch an entry from the list to
- * get a free cluster.
- *
- * The flags field determines if a cluster is free. This is
- * protected by cluster lock.
- */
-struct swap_cluster_info {
-	spinlock_t lock;	/*
-				 * Protect swap_cluster_info fields
-				 * other than list, and swap_info_struct->swap_map
-				 * elements corresponding to the swap cluster.
-				 */
-	u16 count;
-	u8 flags;
-	u8 order;
-	struct list_head list;
-};
-
-/* All on-list cluster must have a non-zero flag. */
-enum swap_cluster_flags {
-	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
-	CLUSTER_FLAG_FREE,
-	CLUSTER_FLAG_NONFULL,
-	CLUSTER_FLAG_FRAG,
-	/* Clusters with flags above are allocatable */
-	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
-	CLUSTER_FLAG_FULL,
-	CLUSTER_FLAG_DISCARD,
-	CLUSTER_FLAG_MAX,
-};
-
 /*
  * The first page in the swap file is the swap header, which is always marked
  * bad to prevent it from being allocated as an entry. This also prevents the
diff --git a/mm/swap.h b/mm/swap.h
index 34af06bf6fa4..38d37d241f1c 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -5,10 +5,72 @@
 struct mempolicy;
 extern int page_cluster;
 
+#ifdef CONFIG_THP_SWAP
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
+#define swap_entry_order(order)	(order)
+#else
+#define SWAPFILE_CLUSTER	256
+#define swap_entry_order(order)	0
+#endif
+
+/*
+ * We use this to track usage of a cluster. A cluster is a block of swap disk
+ * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
+ * free clusters are organized into a list. We fetch an entry from the list to
+ * get a free cluster.
+ *
+ * The flags field determines if a cluster is free. This is
+ * protected by cluster lock.
+ */
+struct swap_cluster_info {
+	spinlock_t lock;	/*
+				 * Protect swap_cluster_info fields
+				 * other than list, and swap_info_struct->swap_map
+				 * elements corresponding to the swap cluster.
+				 */
+	u16 count;
+	u8 flags;
+	u8 order;
+	struct list_head list;
+};
+
+/* All on-list cluster must have a non-zero flag. */
+enum swap_cluster_flags {
+	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
+	CLUSTER_FLAG_FREE,
+	CLUSTER_FLAG_NONFULL,
+	CLUSTER_FLAG_FRAG,
+	/* Clusters with flags above are allocatable */
+	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
+	CLUSTER_FLAG_FULL,
+	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_MAX,
+};
+
 #ifdef CONFIG_SWAP
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+static inline struct swap_cluster_info *swp_offset_cluster(
+		struct swap_info_struct *si, pgoff_t offset)
+{
+	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+}
+
+static inline struct swap_cluster_info *swap_lock_cluster(
+		struct swap_info_struct *si,
+		unsigned long offset)
+{
+	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
+	spin_lock(&ci->lock);
+	return ci;
+}
+
+static inline void swap_unlock_cluster(struct swap_cluster_info *ci)
+{
+	spin_unlock(&ci->lock);
+}
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index aa031fd27847..ba3fd99eb5fa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
-static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
-					      unsigned long offset);
-static inline void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -259,9 +256,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * swap_map is HAS_CACHE only, which means the slots have no page table
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 	need_reclaim = swap_only_has_cache(si, offset, nr_pages);
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	if (!need_reclaim)
 		goto out_unlock;
 
@@ -386,21 +383,6 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
-#ifdef CONFIG_THP_SWAP
-#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
-
-#define swap_entry_order(order)	(order)
-#else
-#define SWAPFILE_CLUSTER	256
-
-/*
- * Define swap_entry_order() as constant to let compiler to optimize
- * out some code if !CONFIG_THP_SWAP
- */
-#define swap_entry_order(order)	0
-#endif
-#define LATENCY_LIMIT		256
-
 static inline bool cluster_is_empty(struct swap_cluster_info *info)
 {
 	return info->count == 0;
@@ -426,34 +408,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
 	return ci - si->cluster_info;
 }
 
-static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
-							  unsigned long offset)
-{
-	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
-}
-
 static inline unsigned int cluster_offset(struct swap_info_struct *si,
 					  struct swap_cluster_info *ci)
 {
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
-static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
-						     unsigned long offset)
-{
-	struct swap_cluster_info *ci;
-
-	ci = offset_to_cluster(si, offset);
-	spin_lock(&ci->lock);
-
-	return ci;
-}
-
-static inline void unlock_cluster(struct swap_cluster_info *ci)
-{
-	spin_unlock(&ci->lock);
-}
-
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags)
@@ -809,7 +769,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	}
 out:
 	relocate_cluster(si, ci);
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
@@ -853,7 +813,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		if (ci->flags == CLUSTER_FLAG_NONE)
 			relocate_cluster(si, ci);
 
-		unlock_cluster(ci);
+		swap_unlock_cluster(ci);
 		if (to_scan <= 0)
 			break;
 	}
@@ -889,10 +849,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		/* Serialize HDD SWAP allocation for each device. */
 		spin_lock(&si->global_cluster_lock);
 		offset = si->global_cluster->next[order];
-		if (offset == SWAP_ENTRY_INVALID)
-			goto new_cluster;
 
-		ci = lock_cluster(si, offset);
+		ci = swap_lock_cluster(si, offset);
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
@@ -900,7 +858,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			found = alloc_swap_scan_cluster(si, ci, offset,
 							order, usage);
 		} else {
-			unlock_cluster(ci);
+			swap_unlock_cluster(ci);
 		}
 		if (found)
 			goto done;
@@ -1178,7 +1136,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (!si || !offset || !get_swap_device_info(si))
 		return false;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
@@ -1186,7 +1144,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 		if (found)
 			*entry = swp_entry(si->type, found);
 	} else {
-		unlock_cluster(ci);
+		swap_unlock_cluster(ci);
 	}
 
 	put_swap_device(si);
@@ -1449,14 +1407,14 @@ static void swap_entries_put_cache(struct swap_info_struct *si,
 	unsigned long offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
 
-	ci = lock_cluster(si, offset);
-	if (swap_only_has_cache(si, offset, nr))
+	ci = swap_lock_cluster(si, offset);
+	if (swap_only_has_cache(si, offset, nr)) {
 		swap_entries_free(si, ci, entry, nr);
-	else {
+	} else {
 		for (int i = 0; i < nr; i++, entry.val++)
 			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
 	}
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 }
 
 static bool swap_entries_put_map(struct swap_info_struct *si,
@@ -1474,7 +1432,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	if (count != 1)
 		goto fallback;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
 		goto locked_fallback;
 	}
@@ -1483,21 +1441,20 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	else
 		for (i = 0; i < nr; i++)
 			WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 
 	return has_cache;
 
 fallback:
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 locked_fallback:
 	for (i = 0; i < nr; i++, entry.val++) {
 		count = swap_entry_put_locked(si, ci, entry, 1);
 		if (count == SWAP_HAS_CACHE)
 			has_cache = true;
 	}
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	return has_cache;
-
 }
 
 /*
@@ -1545,7 +1502,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 	unsigned char *map_end = map + nr_pages;
 
 	/* It should never free entries across different clusters */
-	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
+	VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_empty(ci));
 	VM_BUG_ON(ci->count < nr_pages);
 
@@ -1620,9 +1577,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	int count;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 	count = swap_count(si->swap_map[offset]);
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	return !!count;
 }
 
@@ -1645,7 +1602,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
@@ -1668,7 +1625,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	return count;
 }
 
@@ -1683,7 +1640,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	int i;
 	bool ret = false;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 	if (nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
@@ -1696,7 +1653,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 		}
 	}
 unlock_out:
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	return ret;
 }
 
@@ -2246,6 +2203,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
  * Return 0 if there are no inuse entries after prev till end of
  * the map.
  */
+#define LATENCY_LIMIT 256
 static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 					unsigned int prev)
 {
@@ -2629,8 +2587,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	BUG_ON(si->flags & SWP_WRITEOK);
 
 	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
-		ci = lock_cluster(si, offset);
-		unlock_cluster(ci);
+		ci = swap_lock_cluster(si, offset);
+		swap_unlock_cluster(ci);
 	}
 }
 
@@ -3533,7 +3491,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 
 	err = 0;
 	for (i = 0; i < nr; i++) {
@@ -3588,7 +3546,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	}
 
 unlock_out:
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	return err;
 }
 
@@ -3688,7 +3646,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster(si, offset);
+	ci = swap_lock_cluster(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 
@@ -3748,7 +3706,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out_unlock_cont:
 	spin_unlock(&si->cont_lock);
 out:
-	unlock_cluster(ci);
+	swap_unlock_cluster(ci);
 	put_swap_device(si);
 outer:
 	if (page)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (5 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
                   ` (20 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

swp_swap_info is the common helper used for retrieving swap info in many
places. It has an internal check that may lead to a NULL return value,
but almost none of its caller checks the return value, and the internal
check is pointless. In fact, most of these callers already ensure the
entry is valid during that period and never expect a NULL value.

Tidy this up. If the caller ensures the swap entry / type is valid and
device is pinned, use swp_info / swp_type_info instead, which has more
debug checks and lower overhead as they are inline.

Caller that may expect a NULL value could use swp_get_info /
swp_type_get_info instead.

No feature change as the rearranged codes have no effect since they were
mostly ignored in some way, some new sanity checks are added for debug
built to catch potential misuse.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  6 ------
 mm/memory.c          |  2 +-
 mm/page_io.c         | 12 ++++++------
 mm/swap.h            | 32 ++++++++++++++++++++++++++++++--
 mm/swap_state.c      |  4 ++--
 mm/swapfile.c        | 35 ++++++++++++++++++-----------------
 6 files changed, 57 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1e7d9d55c39a..4239852fd203 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -453,7 +453,6 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
-struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
@@ -466,11 +465,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 }
 
 #else /* CONFIG_SWAP */
-static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return NULL;
-}
-
 static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/memory.c b/mm/memory.c
index 254be0e88801..cc1f6891cf99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4315,7 +4315,7 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	pgoff_t offset = swp_offset(entry);
 	int i;
 
diff --git a/mm/page_io.c b/mm/page_io.c
index 4bce19df557b..eaf6319b81ab 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,7 +204,7 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
 	unsigned int i;
@@ -223,7 +223,7 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	swp_entry_t entry;
 	unsigned int i;
 
@@ -375,7 +375,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc)
 {
 	struct swap_iocb *sio = NULL;
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	struct file *swap_file = sis->swap_file;
 	loff_t pos = swap_dev_pos(folio->swap);
 
@@ -452,7 +452,7 @@ static void swap_writepage_bdev_async(struct folio *folio,
 
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	/*
@@ -543,7 +543,7 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	struct swap_iocb *sio = NULL;
 	loff_t pos = swap_dev_pos(folio->swap);
 
@@ -614,7 +614,7 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset = folio_test_workingset(folio);
 	unsigned long pflags;
diff --git a/mm/swap.h b/mm/swap.h
index 38d37d241f1c..4982e6c2ad95 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -13,6 +13,8 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+extern struct swap_info_struct *swap_info[];
+
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
@@ -51,9 +53,29 @@ enum swap_cluster_flags {
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+/*
+ * All swp_* function callers must ensure the entry is valid, and hold the
+ * swap device reference or pin the device in other ways. E.g, a locked
+ * folio in the swap cache makes sure its entries (folio->swap) are valid
+ * and won't be freed, the device is also pinned by its entries.
+ */
+static inline struct swap_info_struct *swp_type_info(int type)
+{
+	struct swap_info_struct *si;
+	si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
+	return si;
+}
+
+static inline struct swap_info_struct *swp_info(swp_entry_t entry)
+{
+	return swp_type_info(swp_type(entry));
+}
+
 static inline struct swap_cluster_info *swp_offset_cluster(
 		struct swap_info_struct *si, pgoff_t offset)
 {
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
@@ -62,6 +84,7 @@ static inline struct swap_cluster_info *swap_lock_cluster(
 		unsigned long offset)
 {
 	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	spin_lock(&ci->lock);
 	return ci;
 }
@@ -154,7 +177,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->flags;
+	return swp_info(folio->swap)->flags;
 }
 
 /*
@@ -165,7 +188,7 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		bool *is_zeromap)
 {
-	struct swap_info_struct *sis = swp_swap_info(entry);
+	struct swap_info_struct *sis = swp_info(entry);
 	unsigned long start = swp_offset(entry);
 	unsigned long end = start + max_nr;
 	bool first_bit;
@@ -184,6 +207,11 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+static inline struct swap_info_struct *swp_info(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 07c41676486a..db9efa64f64e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -330,7 +330,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	struct folio *folio;
 	struct folio *new_folio = NULL;
 	struct folio *result = NULL;
@@ -554,7 +554,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ba3fd99eb5fa..2f834069b7ad 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
 static struct plist_head *swap_avail_heads;
 static DEFINE_SPINLOCK(swap_avail_lock);
 
-static struct swap_info_struct *swap_info[MAX_SWAPFILES];
+struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
 static DEFINE_MUTEX(swapon_mutex);
 
@@ -124,14 +124,20 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
 	.lock = INIT_LOCAL_LOCK(),
 };
 
-static struct swap_info_struct *swap_type_to_swap_info(int type)
+/* May return NULL on invalid type, caller must check for NULL return */
+static struct swap_info_struct *swp_type_get_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
 		return NULL;
-
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
+/* May return NULL on invalid entry, caller must check for NULL return */
+static struct swap_info_struct *swp_get_info(swp_entry_t entry)
+{
+	return swp_type_get_info(swp_type(entry));
+}
+
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -343,7 +349,7 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
 
 sector_t swap_folio_sector(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	struct swap_extent *se;
 	sector_t sector;
 	pgoff_t offset;
@@ -1272,7 +1278,7 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 
 	if (!entry.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swp_get_info(entry);
 	if (!si)
 		goto bad_nofile;
 	if (data_race(!(si->flags & SWP_USED)))
@@ -1381,7 +1387,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 
 	if (!entry.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swp_get_info(entry);
 	if (!si)
 		goto bad_nofile;
 	if (!get_swap_device_info(si))
@@ -1560,7 +1566,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	pgoff_t offset = swp_offset(entry);
 
 	return swap_count(si->swap_map[offset]);
@@ -1791,7 +1797,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 
 swp_entry_t get_swap_page_of_type(int type)
 {
-	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	struct swap_info_struct *si = swp_type_get_info(type);
 	unsigned long offset;
 	swp_entry_t entry = {0};
 
@@ -1872,7 +1878,7 @@ int find_first_swap(dev_t *device)
  */
 sector_t swapdev_block(int type, pgoff_t offset)
 {
-	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	struct swap_info_struct *si = swp_type_get_info(type);
 	struct swap_extent *se;
 
 	if (!si || !(si->flags & SWP_WRITEOK))
@@ -2801,7 +2807,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
 	if (!l)
 		return SEQ_START_TOKEN;
 
-	for (type = 0; (si = swap_type_to_swap_info(type)); type++) {
+	for (type = 0; (si = swp_type_get_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		if (!--l)
@@ -2822,7 +2828,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
 		type = si->type + 1;
 
 	++(*pos);
-	for (; (si = swap_type_to_swap_info(type)); type++) {
+	for (; (si = swp_type_get_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		return si;
@@ -3483,7 +3489,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	unsigned char has_cache;
 	int err, i;
 
-	si = swp_swap_info(entry);
+	si = swp_get_info(entry);
 	if (WARN_ON_ONCE(!si)) {
 		pr_err("%s%08lx\n", Bad_file, entry.val);
 		return -EINVAL;
@@ -3598,11 +3604,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 	swap_entries_put_cache(si, entry, nr);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return swap_type_to_swap_info(swp_type(entry));
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (6 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
                   ` (19 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Introduce basic swap table infrastructures, which is now just a
fix-sized flat array inside each swap cluster, with access wrappers.

Each table entry is an opaque atomic long, which could be in 3 types:
a shadow type (XA_VALUE), a folio type (pointer), or NULL.

In this first step, it only supports storing a folio or shadow, and
it is a drop-in replacement for the swap cache's underlying structure.

This commit converts all swap cache users to use new sets of APIs.
Swap cache lookups (swap_cache_get_*) are still lock-less and require a
pin on the swap device to prevent the memory from being freed, which
is unchanged as before.

All swap cache updates will now be protected by the swap cluster lock,
which is either handled by the new helpers internally, or the caller
has to lock the cluster before using __swap_cache_* functions. The
cluster lock also replaces where Xarray locks were used.

At this point it looks like just a downgrade from Xarray to flat array.
Later commits will implement a fully cluster based unified swap table,
with dynamic allocation, which should reduce the memory usage while
making the performance even better.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   2 -
 mm/filemap.c         |  20 +--
 mm/huge_memory.c     |  20 ++-
 mm/memory.c          |   2 +-
 mm/migrate.c         |  28 ++--
 mm/shmem.c           |  25 +---
 mm/swap.h            | 133 +++++++++++++------
 mm/swap_state.c      | 308 +++++++++++++++++++++----------------------
 mm/swap_table.h      | 103 +++++++++++++++
 mm/swapfile.c        | 108 ++++++++-------
 mm/vmscan.c          |  21 ++-
 mm/zswap.c           |   7 +-
 12 files changed, 473 insertions(+), 304 deletions(-)
 create mode 100644 mm/swap_table.h

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4239852fd203..58230f3e15e6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -454,8 +454,6 @@ extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 struct backing_dev_info;
-extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
-extern void exit_swap_address_space(unsigned int type);
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 09d005848f0d..6840cd817ed3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4403,23 +4403,17 @@ static void filemap_cachestat(struct address_space *mapping,
 #ifdef CONFIG_SWAP /* implies CONFIG_MMU */
 			if (shmem_mapping(mapping)) {
 				/* shmem file - in swap cache */
+				struct swap_info_struct *si;
 				swp_entry_t swp = radix_to_swp_entry(folio);
 
-				/* swapin error results in poisoned entry */
-				if (non_swap_entry(swp))
+				/* prevent swapoff from releasing the device */
+				si = get_swap_device(swp);
+				if (!si)
 					goto resched;
 
-				/*
-				 * Getting a swap entry from the shmem
-				 * inode means we beat
-				 * shmem_unuse(). rcu_read_lock()
-				 * ensures swapoff waits for us before
-				 * freeing the swapper space. However,
-				 * we can race with swapping and
-				 * invalidation, so there might not be
-				 * a shadow in the swapcache (yet).
-				 */
-				shadow = get_shadow_from_swap_cache(swp);
+				shadow = swap_cache_get_shadow(swp);
+				put_swap_device(si);
+
 				if (!shadow)
 					goto resched;
 			}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d3e66136e41a..126cf217293c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3457,9 +3457,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		bool uniform_split)
 {
 	struct lruvec *lruvec;
-	struct address_space *swap_cache = NULL;
 	struct folio *origin_folio = folio;
 	struct folio *next_folio = folio_next(folio);
+	struct swap_cluster_info *ci = NULL;
 	struct folio *new_folio;
 	struct folio *next;
 	int order = folio_order(folio);
@@ -3476,8 +3476,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		if (!uniform_split || new_order != 0)
 			return -EINVAL;
 
-		swap_cache = swap_address_space(folio->swap);
-		xa_lock(&swap_cache->i_pages);
+		ci = swap_lock_folio_cluster(folio);
 	}
 
 	if (folio_test_anon(folio))
@@ -3566,7 +3565,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 				continue;
 
 			folio_ref_unfreeze(release, 1 +
-					((mapping || swap_cache) ?
+					((mapping || ci) ?
 						folio_nr_pages(release) : 0));
 
 			lru_add_split_folio(origin_folio, release, lruvec,
@@ -3584,10 +3583,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 			} else if (mapping) {
 				__xa_store(&mapping->i_pages,
 						release->index, release, 0);
-			} else if (swap_cache) {
-				__xa_store(&swap_cache->i_pages,
-						swap_cache_index(release->swap),
-						release, 0);
+			} else if (ci) {
+				__swap_cache_override_folio(ci, release->swap,
+							    origin_folio, release);
 			}
 		}
 	}
@@ -3599,12 +3597,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	 * see stale page cache entries.
 	 */
 	folio_ref_unfreeze(origin_folio, 1 +
-		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
+		((mapping || ci) ? folio_nr_pages(origin_folio) : 0));
 
 	unlock_page_lruvec(lruvec);
 
-	if (swap_cache)
-		xa_unlock(&swap_cache->i_pages);
+	if (ci)
+		swap_unlock_cluster(ci);
 	if (mapping)
 		xa_unlock(&mapping->i_pages);
 
diff --git a/mm/memory.c b/mm/memory.c
index cc1f6891cf99..f2897d9059f2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4603,7 +4603,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 				memcg1_swapin(entry, nr_pages);
 
-				shadow = get_shadow_from_swap_cache(entry);
+				shadow = swap_cache_get_shadow(entry);
 				if (shadow)
 					workingset_refault(folio, shadow);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 784ac2256d08..dad428b1a78f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -458,10 +458,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int expected_count)
 {
 	XA_STATE(xas, &mapping->i_pages, folio_index(folio));
+	struct swap_cluster_info *ci = NULL;
 	struct zone *oldzone, *newzone;
 	int dirty;
 	long nr = folio_nr_pages(folio);
-	long entries, i;
 
 	if (!mapping) {
 		/* Take off deferred split queue while frozen and memcg set */
@@ -487,9 +487,16 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	oldzone = folio_zone(folio);
 	newzone = folio_zone(newfolio);
 
-	xas_lock_irq(&xas);
+	if (folio_test_swapcache(folio))
+		ci = swap_lock_folio_cluster_irq(folio);
+	else
+		xas_lock_irq(&xas);
+
 	if (!folio_ref_freeze(folio, expected_count)) {
-		xas_unlock_irq(&xas);
+		if (ci)
+			swap_unlock_cluster(ci);
+		else
+			xas_unlock_irq(&xas);
 		return -EAGAIN;
 	}
 
@@ -510,9 +517,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	if (folio_test_swapcache(folio)) {
 		folio_set_swapcache(newfolio);
 		newfolio->private = folio_get_private(folio);
-		entries = nr;
-	} else {
-		entries = 1;
 	}
 
 	/* Move dirty while folio refs frozen and newfolio not yet exposed */
@@ -522,11 +526,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		folio_set_dirty(newfolio);
 	}
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	for (i = 0; i < entries; i++) {
+	if (folio_test_swapcache(folio))
+		WARN_ON_ONCE(__swap_cache_replace_folio(ci, folio->swap, folio, newfolio));
+	else
 		xas_store(&xas, newfolio);
-		xas_next(&xas);
-	}
 
 	/*
 	 * Drop cache reference from old folio by unfreezing
@@ -535,8 +538,11 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	 */
 	folio_ref_unfreeze(folio, expected_count - nr);
 
-	xas_unlock(&xas);
 	/* Leave irq disabled to prevent preemption while updating stats */
+	if (ci)
+		swap_unlock_cluster(ci);
+	else
+		xas_unlock(&xas);
 
 	/*
 	 * If moved to a different zone then also account
diff --git a/mm/shmem.c b/mm/shmem.c
index 43d9e3bf16f4..0da9e06eaee8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2009,7 +2009,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	new->swap = entry;
 
 	memcg1_swapin(entry, nr_pages);
-	shadow = get_shadow_from_swap_cache(entry);
+	shadow = swap_cache_get_shadow(entry);
 	if (shadow)
 		workingset_refault(new, shadow);
 	folio_add_lru(new);
@@ -2038,13 +2038,11 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index,
 				struct vm_area_struct *vma)
 {
+	struct swap_cluster_info *ci;
 	struct folio *new, *old = *foliop;
 	swp_entry_t entry = old->swap;
-	struct address_space *swap_mapping = swap_address_space(entry);
-	pgoff_t swap_index = swap_cache_index(entry);
-	XA_STATE(xas, &swap_mapping->i_pages, swap_index);
 	int nr_pages = folio_nr_pages(old);
-	int error = 0, i;
+	int error = 0;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
@@ -2073,25 +2071,14 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	xa_lock_irq(&swap_mapping->i_pages);
-	for (i = 0; i < nr_pages; i++) {
-		void *item = xas_load(&xas);
-
-		if (item != old) {
-			error = -ENOENT;
-			break;
-		}
-
-		xas_store(&xas, new);
-		xas_next(&xas);
-	}
+	ci = swap_lock_folio_cluster_irq(old);
+	error = __swap_cache_replace_folio(ci, entry, old, new);
 	if (!error) {
 		mem_cgroup_replace_folio(old, new);
 		shmem_update_stats(new, nr_pages);
 		shmem_update_stats(old, -nr_pages);
 	}
-	xa_unlock_irq(&swap_mapping->i_pages);
+	swap_unlock_cluster(ci);
 
 	if (unlikely(error)) {
 		/*
diff --git a/mm/swap.h b/mm/swap.h
index 4982e6c2ad95..30cd257aecbb 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -15,6 +15,15 @@ extern int page_cluster;
 
 extern struct swap_info_struct *swap_info[];
 
+/*
+ * A swap table entry represents the status of a swap slot
+ * on a swap (physical or virtual) device. Swap table is a
+ * 1:1 map of the swap device, composed of swap table entries.
+ *
+ * See mm/swap_table.h for details.
+ */
+typedef atomic_long_t swp_te_t;
+
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
@@ -25,14 +34,11 @@ extern struct swap_info_struct *swap_info[];
  * protected by cluster lock.
  */
 struct swap_cluster_info {
-	spinlock_t lock;	/*
-				 * Protect swap_cluster_info fields
-				 * other than list, and swap_info_struct->swap_map
-				 * elements corresponding to the swap cluster.
-				 */
+	spinlock_t lock; /* Protects all fields below except `list`. */
 	u16 count;
 	u8 flags;
 	u8 order;
+	swp_te_t *table;
 	struct list_head list;
 };
 
@@ -79,21 +85,56 @@ static inline struct swap_cluster_info *swp_offset_cluster(
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
-static inline struct swap_cluster_info *swap_lock_cluster(
-		struct swap_info_struct *si,
-		unsigned long offset)
+static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry)
+{
+	return swp_offset_cluster(swp_info(entry), swp_offset(entry));
+}
+
+/*
+ * Lock the swap cluster of given offset. Caller must ensure modification
+ * won't cross multiple cluster. swap_lock_folio_cluster is preferred when
+ * with more sanity checks.
+ */
+static inline struct swap_cluster_info *__swap_lock_cluster(
+		struct swap_info_struct *si, unsigned long offset, bool irq)
 {
 	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-	spin_lock(&ci->lock);
+	if (irq)
+		spin_lock_irq(&ci->lock);
+	else
+		spin_lock(&ci->lock);
 	return ci;
 }
+#define swap_lock_cluster(si, offset) __swap_lock_cluster(si, offset, false)
+#define swap_lock_cluster_irq(si, offset) __swap_lock_cluster(si, offset, true)
+
+/*
+ * Lock the swap cluster that holds a folio's swap entries. This is safer as a
+ * locked folio in swap cache always have its entry limited in one cluster,
+ * won't be freed, and pins the device.
+ */
+static inline struct swap_cluster_info *__swap_lock_folio_cluster(
+		struct folio *folio, bool irq)
+{
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	return __swap_lock_cluster(swp_info(folio->swap),
+				   swp_offset(folio->swap), irq);
+}
+#define swap_lock_folio_cluster(folio) __swap_lock_folio_cluster(folio, false)
+#define swap_lock_folio_cluster_irq(folio) __swap_lock_folio_cluster(folio, true)
 
 static inline void swap_unlock_cluster(struct swap_cluster_info *ci)
 {
 	spin_unlock(&ci->lock);
 }
 
+static inline void swap_unlock_cluster_irq(struct swap_cluster_info *ci)
+{
+	spin_unlock_irq(&ci->lock);
+}
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -109,14 +150,27 @@ int swap_writepage(struct page *page, struct writeback_control *wbc);
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
 
 /* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT	14
-#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
-#define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
-extern struct address_space *swapper_spaces[];
-#define swap_address_space(entry)			    \
-	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
-		>> SWAP_ADDRESS_SPACE_SHIFT])
+extern struct address_space swap_space __ro_after_init;
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return &swap_space;
+}
+
+/* Below helpers requires the caller to pin the swap device. */
+extern struct folio *swap_cache_get_folio(swp_entry_t entry);
+extern int swap_cache_add_folio(swp_entry_t entry,
+				struct folio *folio, void **shadow);
+extern void *swap_cache_get_shadow(swp_entry_t entry);
+/* Below helpers requires the caller to lock the swap cluster. */
+extern void __swap_cache_del_folio(swp_entry_t entry,
+				   struct folio *folio, void *shadow);
+extern int __swap_cache_replace_folio(struct swap_cluster_info *ci,
+				      swp_entry_t entry, struct folio *old,
+				      struct folio *new);
+extern void __swap_cache_override_folio(struct swap_cluster_info *ci,
+					swp_entry_t entry, struct folio *old,
+					struct folio *new);
+extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
 
 /*
  * Return the swap device position of the swap entry.
@@ -131,8 +185,7 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
  */
 static inline pgoff_t swap_cache_index(swp_entry_t entry)
 {
-	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
-	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
+	return swp_offset(entry);
 }
 
 /*
@@ -152,16 +205,8 @@ static inline bool folio_swap_contains(struct folio *folio, swp_entry_t entry)
 }
 
 void show_swap_cache_info(void);
-void *get_shadow_from_swap_cache(swp_entry_t entry);
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-		      gfp_t gfp, void **shadowp);
-void __delete_from_swap_cache(struct folio *folio,
-			      swp_entry_t entry, void *shadow);
 void delete_from_swap_cache(struct folio *folio);
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
-struct folio *swap_cache_get_folio(swp_entry_t entry);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -207,6 +252,14 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+
+#define swap_lock_cluster(si, offset) NULL
+#define swap_lock_cluster_irq(si, offset) NULL
+#define swap_lock_folio_cluster(folio) NULL
+#define swap_lock_folio_cluster_irq(folio) NULL
+#define swap_unlock_cluster(ci) do {} while (0)
+#define swap_unlock_cluster_irq(ci) do {} while (0)
+
 static inline struct swap_info_struct *swp_info(swp_entry_t entry)
 {
 	return NULL;
@@ -269,28 +322,34 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return NULL;
 }
 
-static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
 {
-	return NULL;
+	return -EINVAL;
 }
 
-static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
+static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
 {
-	return -1;
 }
 
-static inline void __delete_from_swap_cache(struct folio *folio,
-					swp_entry_t entry, void *shadow)
+static inline int __swap_cache_replace_folio(
+		struct swap_cluster_info *ci, swp_entry_t entry,
+		struct folio *old, struct folio *new)
 {
+	return -EINVAL;
 }
 
-static inline void delete_from_swap_cache(struct folio *folio)
+static inline void __swap_cache_override_folio(
+		struct swap_cluster_info *ci, swp_entry_t entry,
+		struct folio *old, struct folio *new)
 {
 }
 
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+static inline void *swap_cache_get_shadow(swp_entry_t end)
+{
+	return NULL;
+}
+
+static inline void delete_from_swap_cache(struct folio *folio)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index db9efa64f64e..bef9633533ec 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,7 @@
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
+#include "swap_table.h"
 #include "swap.h"
 
 /*
@@ -37,8 +38,11 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
-static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
+/* swap_space is read only as swap cache is handled by swap table */
+struct address_space swap_space __ro_after_init = {
+	.a_ops = &swap_aops,
+};
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -70,164 +74,187 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", K(total_swap_pages));
 }
 
-void *get_shadow_from_swap_cache(swp_entry_t entry)
+/* For huge page splitting, override an old folio with a smaller new one. */
+void __swap_cache_override_folio(struct swap_cluster_info *ci, swp_entry_t entry,
+				 struct folio *old, struct folio *new)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	pgoff_t idx = swap_cache_index(entry);
-	void *shadow;
-
-	shadow = xa_load(&address_space->i_pages, idx);
-	if (xa_is_value(shadow))
-		return shadow;
-	return NULL;
+	pgoff_t offset = swp_offset(entry);
+	pgoff_t end = offset + folio_nr_pages(new);
+
+	VM_WARN_ON_ONCE(entry.val < old->swap.val || entry.val != new->swap.val);
+	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
+
+	do {
+		VM_WARN_ON_ONCE(swp_te_folio(__swap_table_get(ci, offset)) != old);
+		__swap_table_set_folio(ci, offset, new);
+	} while (++offset < end);
 }
 
-/*
- * add_to_swap_cache resembles filemap_add_folio on swapper_space,
- * but sets SwapCache flag and 'swap' instead of mapping and index.
- */
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-			gfp_t gfp, void **shadowp)
+/* For migration and shmem replacement, replace an old folio with a new one. */
+int __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
+			       struct folio *old, struct folio *new)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	pgoff_t idx = swap_cache_index(entry);
-	XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
-	unsigned long i, nr = folio_nr_pages(folio);
-	void *old;
+	unsigned long nr_pages = folio_nr_pages(old);
+	pgoff_t offset = swp_offset(entry);
+	pgoff_t end = offset + nr_pages;
+
+	VM_WARN_ON_ONCE(entry.val != old->swap.val || entry.val != new->swap.val);
+	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
 
-	xas_set_update(&xas, workingset_update_node);
+	do {
+		if (swp_te_folio(__swap_table_get(ci, offset)) != old)
+			return -ENOENT;
+		__swap_table_set_folio(ci, offset, new);
+	} while (++offset < end);
+
+	return 0;
+}
+
+int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
+			 void **shadow)
+{
+	swp_te_t exist;
+	pgoff_t end, start, offset;
+	struct swap_cluster_info *ci;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	start = swp_offset(entry);
+	end = start + nr_pages;
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
 
-	folio_ref_add(folio, nr);
+	offset = start;
+	ci = swap_lock_cluster(swp_info(entry), offset);
+	do {
+		exist = __swap_table_get(ci, offset);
+		if (unlikely(swp_te_is_folio(exist)))
+			goto out_failed;
+		if (shadow && swp_te_is_shadow(exist))
+			*shadow = swp_te_shadow(exist);
+		__swap_table_set_folio(ci, offset, folio);
+	} while (++offset < end);
+
+	folio_ref_add(folio, nr_pages);
 	folio_set_swapcache(folio);
 	folio->swap = entry;
+	swap_unlock_cluster(ci);
 
-	do {
-		xas_lock_irq(&xas);
-		xas_create_range(&xas);
-		if (xas_error(&xas))
-			goto unlock;
-		for (i = 0; i < nr; i++) {
-			VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
-			if (shadowp) {
-				old = xas_load(&xas);
-				if (xa_is_value(old))
-					*shadowp = old;
-			}
-			xas_store(&xas, folio);
-			xas_next(&xas);
-		}
-		address_space->nrpages += nr;
-		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
-		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
-unlock:
-		xas_unlock_irq(&xas);
-	} while (xas_nomem(&xas, gfp));
+	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
 
-	if (!xas_error(&xas))
-		return 0;
+	return 0;
 
-	folio_clear_swapcache(folio);
-	folio_ref_sub(folio, nr);
-	return xas_error(&xas);
+out_failed:
+	/*
+	 * We may lose shadow due to raced swapin, which should be
+	 * fine, caller better keep the previous returned shadow.
+	 */
+	while (offset-- > start)
+		__swap_table_set_shadow(ci, offset, NULL);
+	swap_unlock_cluster(ci);
+
+	return -EEXIST;
 }
 
 /*
- * This must be called only on folios that have
- * been verified to be in the swap cache.
+ * This must be called only on folios that have been verified to
+ * be in the swap cache and locked. It will never put the folio
+ * into the free list, the caller has a reference on the folio.
  */
-void __delete_from_swap_cache(struct folio *folio,
-			swp_entry_t entry, void *shadow)
+void __swap_cache_del_folio(swp_entry_t entry,
+			    struct folio *folio, void *shadow)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	int i;
-	long nr = folio_nr_pages(folio);
-	pgoff_t idx = swap_cache_index(entry);
-	XA_STATE(xas, &address_space->i_pages, idx);
-
-	xas_set_update(&xas, workingset_update_node);
+	swp_te_t exist;
+	pgoff_t offset, start, end;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long nr_pages = folio_nr_pages(folio);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
 
-	for (i = 0; i < nr; i++) {
-		void *entry = xas_store(&xas, shadow);
-		VM_BUG_ON_PAGE(entry != folio, entry);
-		xas_next(&xas);
-	}
+	start = swp_offset(entry);
+	end = start + nr_pages;
+
+	si = swp_info(entry);
+	ci = swp_offset_cluster(si, start);
+	offset = start;
+	do {
+		exist = __swap_table_get(ci, offset);
+		VM_WARN_ON_ONCE(swp_te_folio(exist) != folio);
+		__swap_table_set_shadow(ci, offset, shadow);
+	} while (++offset < end);
+
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
-	address_space->nrpages -= nr;
-	__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
-	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
+	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
 }
 
-/*
- * Lookup a swap entry in the swap cache. A found folio will be returned
- * unlocked and with its refcount incremented.
- *
- * Caller must hold a reference of the swap device, and check if the
- * returned folio is still valid after locking it (e.g. folio_swap_contains).
- */
-struct folio *swap_cache_get_folio(swp_entry_t entry)
+void delete_from_swap_cache(struct folio *folio)
 {
-	struct folio *folio = filemap_get_folio(swap_address_space(entry),
-						swap_cache_index(entry));
-	if (!IS_ERR(folio))
-		return folio;
-	return NULL;
+	struct swap_cluster_info *ci;
+	swp_entry_t entry = folio->swap;
+
+	ci = swap_lock_cluster(swp_info(entry), swp_offset(entry));
+	__swap_cache_del_folio(entry, folio, NULL);
+	swap_unlock_cluster(ci);
+
+	put_swap_folio(folio, entry);
+	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
 /*
- * This must be called only on folios that have
- * been verified to be in the swap cache and locked.
- * It will never put the folio into the free list,
- * the caller has a reference on the folio.
+ * Caller must hold a reference on the swap device, and check if the
+ * returned folio is still valid after locking it (e.g. folio_swap_contains).
  */
-void delete_from_swap_cache(struct folio *folio)
+void *swap_cache_get_shadow(swp_entry_t entry)
 {
-	swp_entry_t entry = folio->swap;
-	struct address_space *address_space = swap_address_space(entry);
+	swp_te_t swp_te;
 
-	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(folio, entry, NULL);
-	xa_unlock_irq(&address_space->i_pages);
+	pgoff_t offset = swp_offset(entry);
+	swp_te = __swap_table_get(swp_cluster(entry), offset);
 
-	put_swap_folio(folio, entry);
-	folio_ref_sub(folio, folio_nr_pages(folio));
+	return swp_te_is_shadow(swp_te) ? swp_te_shadow(swp_te) : NULL;
 }
 
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
 {
-	unsigned long curr = begin;
-	void *old;
+	struct swap_cluster_info *ci;
+	pgoff_t offset = swp_offset(entry), end;
 
-	for (;;) {
-		swp_entry_t entry = swp_entry(type, curr);
-		unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
-		struct address_space *address_space = swap_address_space(entry);
-		XA_STATE(xas, &address_space->i_pages, index);
-
-		xas_set_update(&xas, workingset_update_node);
-
-		xa_lock_irq(&address_space->i_pages);
-		xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
-			if (!xa_is_value(old))
-				continue;
-			xas_store(&xas, NULL);
-		}
-		xa_unlock_irq(&address_space->i_pages);
+	ci = swp_offset_cluster(swp_info(entry), offset);
+	end = offset + nr_ents;
+	do {
+		WARN_ON_ONCE(swp_te_is_folio(__swap_table_get(ci, offset)));
+		__swap_table_set_null(ci, offset);
+	} while (++offset < end);
+}
 
-		/* search the next swapcache until we meet end */
-		curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
-		if (curr > end)
-			break;
-	}
+/*
+ * Lookup a swap entry in the swap cache. A found folio will be returned
+ * unlocked and with its refcount incremented.
+ *
+ * Caller must hold a reference of the swap device, and check if the
+ * returned folio is still valid after locking it (e.g. folio_swap_contains).
+ */
+struct folio *swap_cache_get_folio(swp_entry_t entry)
+{
+	swp_te_t swp_te;
+	struct folio *folio;
+	swp_te = __swap_table_get(swp_cluster(entry), swp_offset(entry));
+
+	if (!swp_te_is_folio(swp_te))
+		return NULL;
+
+	folio = swp_te_folio(swp_te);
+	if (!folio_try_get(folio))
+		return NULL;
+
+	return folio;
 }
 
 /*
@@ -387,7 +414,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			goto put_and_return;
 
 		/*
-		 * We might race against __delete_from_swap_cache(), and
+		 * We might race against __swap_cache_del_folio(), and
 		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
 		 * has not yet been cleared.  Or race against another
 		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
@@ -405,8 +432,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
 		goto fail_unlock;
 
-	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
+	if (swap_cache_add_folio(entry, new_folio, &shadow))
 		goto fail_unlock;
 
 	memcg1_swapin(entry, 1);
@@ -600,41 +626,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
-int init_swap_address_space(unsigned int type, unsigned long nr_pages)
-{
-	struct address_space *spaces, *space;
-	unsigned int i, nr;
-
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
-	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
-	if (!spaces)
-		return -ENOMEM;
-	for (i = 0; i < nr; i++) {
-		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
-		atomic_set(&space->i_mmap_writable, 0);
-		space->a_ops = &swap_aops;
-		/* swap cache doesn't use writeback related tags */
-		mapping_set_no_writeback_tags(space);
-	}
-	nr_swapper_spaces[type] = nr;
-	swapper_spaces[type] = spaces;
-
-	return 0;
-}
-
-void exit_swap_address_space(unsigned int type)
-{
-	int i;
-	struct address_space *spaces = swapper_spaces[type];
-
-	for (i = 0; i < nr_swapper_spaces[type]; i++)
-		VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
-	kvfree(spaces);
-	nr_swapper_spaces[type] = 0;
-	swapper_spaces[type] = NULL;
-}
-
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
 {
@@ -807,7 +798,7 @@ static const struct attribute_group swap_attr_group = {
 	.attrs = swap_attrs,
 };
 
-static int __init swap_init_sysfs(void)
+static int __init swap_init(void)
 {
 	int err;
 	struct kobject *swap_kobj;
@@ -822,11 +813,12 @@ static int __init swap_init_sysfs(void)
 		pr_err("failed to register swap group\n");
 		goto delete_obj;
 	}
+	mapping_set_no_writeback_tags(&swap_space);
 	return 0;
 
 delete_obj:
 	kobject_put(swap_kobj);
 	return err;
 }
-subsys_initcall(swap_init_sysfs);
+subsys_initcall(swap_init);
 #endif
diff --git a/mm/swap_table.h b/mm/swap_table.h
new file mode 100644
index 000000000000..69a074339444
--- /dev/null
+++ b/mm/swap_table.h
@@ -0,0 +1,103 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _MM_SWAP_TABLE_H
+#define _MM_SWAP_TABLE_H
+
+#include "swap.h"
+
+/*
+ * Swap table entry could be a pointer (folio), a XA_VALUE (shadow), or NULL.
+ */
+
+/*
+ * Helpers for casting one type of info into a swap table entry.
+ */
+static inline swp_te_t null_swp_te(void)
+{
+	swp_te_t swp_te = ATOMIC_LONG_INIT(0);
+	return swp_te;
+}
+
+static inline swp_te_t folio_swp_te(struct folio *folio)
+{
+	BUILD_BUG_ON(sizeof(swp_te_t) != sizeof(void *));
+	swp_te_t swp_te = { .counter = (unsigned long)folio };
+	return swp_te;
+}
+
+static inline swp_te_t shadow_swp_te(void *shadow)
+{
+	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
+		     BITS_PER_BYTE * sizeof(swp_te_t));
+	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
+	swp_te_t swp_te = { .counter = ((unsigned long)shadow) };
+	return swp_te;
+}
+
+/*
+ * Helpers for swap table entry type checking.
+ */
+static inline bool swp_te_is_null(swp_te_t swp_te)
+{
+	return !swp_te.counter;
+}
+
+static inline bool swp_te_is_folio(swp_te_t swp_te)
+{
+	return !xa_is_value((void *)swp_te.counter) && !swp_te_is_null(swp_te);
+}
+
+static inline bool swp_te_is_shadow(swp_te_t swp_te)
+{
+	return xa_is_value((void *)swp_te.counter);
+}
+
+/*
+ * Helpers for retrieving info from swap table.
+ */
+static inline struct folio *swp_te_folio(swp_te_t swp_te)
+{
+	VM_WARN_ON(!swp_te_is_folio(swp_te));
+	return (void *)swp_te.counter;
+}
+
+static inline void *swp_te_shadow(swp_te_t swp_te)
+{
+	VM_WARN_ON(!swp_te_is_shadow(swp_te));
+	return (void *)swp_te.counter;
+}
+
+/*
+ * Helpers for accessing or modifying the swap table,
+ * the swap cluster must be locked.
+ */
+static inline void __swap_table_set(struct swap_cluster_info *ci, pgoff_t off,
+				    swp_te_t swp_te)
+{
+	atomic_long_set(&ci->table[off % SWAPFILE_CLUSTER], swp_te.counter);
+}
+
+static inline swp_te_t __swap_table_get(struct swap_cluster_info *ci, pgoff_t off)
+{
+	swp_te_t swp_te = {
+		.counter = atomic_long_read(&ci->table[off % SWAPFILE_CLUSTER])
+	};
+	return swp_te;
+}
+
+static inline void __swap_table_set_folio(struct swap_cluster_info *ci, pgoff_t off,
+					  struct folio *folio)
+{
+	__swap_table_set(ci, off, folio_swp_te(folio));
+}
+
+static inline void __swap_table_set_shadow(struct swap_cluster_info *ci, pgoff_t off,
+					   void *shadow)
+{
+	__swap_table_set(ci, off, shadow_swp_te(shadow));
+}
+
+static inline void __swap_table_set_null(struct swap_cluster_info *ci, pgoff_t off)
+{
+	__swap_table_set(ci, off, null_swp_te());
+}
+#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2f834069b7ad..aaf7d21eaecb 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -46,6 +46,7 @@
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
 
@@ -438,6 +439,30 @@ static void move_cluster(struct swap_info_struct *si,
 	ci->flags = new_flags;
 }
 
+static int cluster_table_alloc(struct swap_cluster_info *ci)
+{
+	WARN_ON(ci->table);
+	ci->table = kzalloc(sizeof(swp_te_t) * SWAPFILE_CLUSTER,
+			    GFP_KERNEL);
+	if (!ci->table)
+		return -ENOMEM;
+	return 0;
+}
+
+static void cluster_table_free(struct swap_cluster_info *ci)
+{
+	unsigned int offset;
+
+	if (!ci->table)
+		return;
+
+	for (offset = 0; offset <= SWAPFILE_CLUSTER; offset++)
+		WARN_ON(!swp_te_is_null(__swap_table_get(ci, offset)));
+
+	kfree(ci->table);
+	ci->table = NULL;
+}
+
 /* Add a cluster to discard list and schedule it to do discard */
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 		struct swap_cluster_info *ci)
@@ -582,7 +607,9 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 static void partial_free_cluster(struct swap_info_struct *si,
 				 struct swap_cluster_info *ci)
 {
-	VM_BUG_ON(!ci->count || ci->count == SWAPFILE_CLUSTER);
+	VM_BUG_ON(!ci->count);
+	VM_BUG_ON(ci->count == SWAPFILE_CLUSTER);
+
 	lockdep_assert_held(&ci->lock);
 
 	if (ci->flags != CLUSTER_FLAG_NONFULL)
@@ -707,6 +734,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 				unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
+	unsigned long offset, end = start + nr_pages;
 
 	lockdep_assert_held(&ci->lock);
 
@@ -720,7 +748,11 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	if (cluster_is_empty(ci))
 		ci->order = order;
 
-	memset(si->swap_map + start, usage, nr_pages);
+	for (offset = start; offset < end; offset++) {
+		VM_WARN_ON_ONCE(swap_count(si->swap_map[offset]));
+		VM_WARN_ON_ONCE(!swp_te_is_null(__swap_table_get(ci, offset)));
+		si->swap_map[offset] = usage;
+	}
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
@@ -1070,7 +1102,6 @@ static void swap_range_alloc(struct swap_info_struct *si,
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			    unsigned int nr_entries)
 {
-	unsigned long begin = offset;
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
@@ -1089,13 +1120,13 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			si->bdev->bd_disk->fops->swap_slot_free_notify;
 	else
 		swap_slot_free_notify = NULL;
+	__swap_cache_clear_shadow(swp_entry(si->type, offset), nr_entries);
 	while (offset <= end) {
 		arch_swap_invalidate_page(si->type, offset);
 		if (swap_slot_free_notify)
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	clear_shadow_from_swap_cache(si->type, begin, end);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1252,15 +1283,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	if (!entry.val)
 		return -ENOMEM;
 
-	/*
-	 * XArray node allocations from PF_MEMALLOC contexts could
-	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
-	 * stops emergency reserves from being allocated.
-	 *
-	 * TODO: this could cause a theoretical memory reclaim
-	 * deadlock in the swap out path.
-	 */
-	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
+	if (swap_cache_add_folio(entry, folio, NULL))
 		goto out_free;
 
 	atomic_long_sub(size, &nr_swap_pages);
@@ -2598,6 +2621,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	}
 }
 
+static void free_cluster_info(struct swap_cluster_info *cluster_info,
+			      unsigned long maxpages)
+{
+	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
+
+	if (!cluster_info)
+		return;
+	for (i = 0; i < nr_clusters; i++)
+		cluster_table_free(&cluster_info[i]);
+	kvfree(cluster_info);
+}
+
 /*
  * Called after swap device's reference count is dead, so
  * neither scan nor allocation will use it.
@@ -2738,6 +2773,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
 	cluster_info = p->cluster_info;
+	free_cluster_info(cluster_info, p->max);
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
@@ -2748,10 +2784,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
-	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
-	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
 
@@ -3141,15 +3175,18 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	unsigned long i, j, idx;
 	int err = -ENOMEM;
+	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
 		goto err;
 
-	for (i = 0; i < nr_clusters; i++)
+	for (i = 0; i < nr_clusters; i++) {
 		spin_lock_init(&cluster_info[i].lock);
+		if (cluster_table_alloc(&cluster_info[i]))
+			goto err_free;
+	}
 
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
@@ -3184,31 +3221,19 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
 	}
 
-	/*
-	 * Reduce false cache line sharing between cluster_info and
-	 * sharing same address space.
-	 */
-	for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
-		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
-			struct swap_cluster_info *ci;
-			idx = i * SWAP_CLUSTER_COLS + j;
-			ci = cluster_info + idx;
-			if (idx >= nr_clusters)
-				continue;
-			if (ci->count) {
-				ci->flags = CLUSTER_FLAG_NONFULL;
-				list_add_tail(&ci->list, &si->nonfull_clusters[0]);
-				continue;
-			}
+	for (i = 0; i < nr_clusters; i++) {
+		struct swap_cluster_info *ci = &cluster_info[i];
+		if (ci->count) {
+			ci->flags = CLUSTER_FLAG_NONFULL;
+			list_add_tail(&ci->list, &si->nonfull_clusters[0]);
+		} else {
 			ci->flags = CLUSTER_FLAG_FREE;
 			list_add_tail(&ci->list, &si->free_clusters);
 		}
 	}
-
 	return cluster_info;
-
 err_free:
-	kvfree(cluster_info);
+	free_cluster_info(cluster_info, maxpages);
 err:
 	return ERR_PTR(err);
 }
@@ -3381,13 +3406,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 	}
 
-	error = init_swap_address_space(si->type, maxpages);
-	if (error)
-		goto bad_swap_unlock_inode;
-
 	error = zswap_swapon(si->type, maxpages);
 	if (error)
-		goto free_swap_address_space;
+		goto bad_swap_unlock_inode;
 
 	/*
 	 * Flush any pending IO and dirty mappings before we start using this
@@ -3422,8 +3443,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	goto out;
 free_swap_zswap:
 	zswap_swapoff(si->type);
-free_swap_address_space:
-	exit_swap_address_space(si->type);
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
@@ -3438,7 +3457,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
 	kvfree(zeromap);
-	kvfree(cluster_info);
+	if (cluster_info)
+		free_cluster_info(cluster_info, maxpages);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
 	if (swap_file)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7d6d1ce3921e..7b5f41b4147b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -743,13 +743,19 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 {
 	int refcount;
 	void *shadow = NULL;
+	struct swap_cluster_info *ci;
 
 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(mapping != folio_mapping(folio));
 
-	if (!folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		ci = swap_lock_cluster_irq(swp_info(folio->swap),
+					   swp_offset(folio->swap));
+	} else {
 		spin_lock(&mapping->host->i_lock);
-	xa_lock_irq(&mapping->i_pages);
+		xa_lock_irq(&mapping->i_pages);
+	}
+
 	/*
 	 * The non racy check for a busy folio.
 	 *
@@ -789,9 +795,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		__delete_from_swap_cache(folio, swap, shadow);
+		__swap_cache_del_folio(swap, folio, shadow);
 		memcg1_swapout(folio, swap);
-		xa_unlock_irq(&mapping->i_pages);
+		swap_unlock_cluster_irq(ci);
 		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
@@ -829,9 +835,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 	return 1;
 
 cannot_free:
-	xa_unlock_irq(&mapping->i_pages);
-	if (!folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		swap_unlock_cluster_irq(ci);
+	} else {
+		xa_unlock_irq(&mapping->i_pages);
 		spin_unlock(&mapping->host->i_lock);
+	}
 	return 0;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index 455e9425c5f5..af954bda0b02 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -233,10 +233,13 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+/* One swap address space for each 64M swap space */
+#define ZSWAP_ADDRESS_SPACE_SHIFT 14
+#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
-		>> SWAP_ADDRESS_SPACE_SHIFT];
+		>> ZSWAP_ADDRESS_SPACE_SHIFT];
 }
 
 #define zswap_pool_debug(msg, p)				\
@@ -1741,7 +1744,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
 	struct xarray *trees, *tree;
 	unsigned int nr, i;
 
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
 	trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
 	if (!trees) {
 		pr_err("alloc failed, zswap disabled for swap type %d\n", type);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (7 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
                   ` (18 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.

It's not async, and it's not doing any read. Rename it to better
present its usage.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  2 +-
 mm/swap_state.c | 20 ++++++++++----------
 mm/swapfile.c   |  2 +-
 mm/zswap.c      |  4 ++--
 4 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 30cd257aecbb..fec7d6e751ae 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -210,7 +210,7 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
+struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_flags,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bef9633533ec..fe71706e29d9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -353,7 +353,7 @@ void swap_update_readahead(struct folio *folio,
 	}
 }
 
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
+struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
@@ -403,12 +403,12 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			goto put_and_return;
 
 		/*
-		 * Protect against a recursive call to __read_swap_cache_async()
+		 * Protect against a recursive call to __swapin_cache_alloc()
 		 * on the same entry waiting forever here because SWAP_HAS_CACHE
 		 * is set but the folio is not the swap cache yet. This can
 		 * happen today if mem_cgroup_swapin_charge_folio() below
 		 * triggers reclaim through zswap, which may call
-		 * __read_swap_cache_async() in the writeback path.
+		 * __swapin_cache_alloc() in the writeback path.
 		 */
 		if (skip_if_exists)
 			goto put_and_return;
@@ -417,7 +417,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * We might race against __swap_cache_del_folio(), and
 		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
 		 * has not yet been cleared.  Or race against another
-		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
+		 * __swapin_cache_alloc(), which has set SWAP_HAS_CACHE
 		 * in swap_map, but not yet added its folio to swap cache.
 		 */
 		schedule_timeout_uninterruptible(1);
@@ -464,7 +464,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * the swap entry is no longer in use.
  *
  * get/put_swap_device() aren't needed to call this function, because
- * __read_swap_cache_async() call them and swap_read_folio() holds the
+ * __swapin_cache_alloc() call them and swap_read_folio() holds the
  * swap cache folio lock.
  */
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -482,7 +482,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		return NULL;
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = __swapin_cache_alloc(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	mpol_cond_put(mpol);
 
@@ -600,7 +600,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
-		folio = __read_swap_cache_async(
+		folio = __swapin_cache_alloc(
 				swp_entry(swp_type(entry), offset),
 				gfp_mask, mpol, ilx, &page_allocated, false);
 		if (!folio)
@@ -619,7 +619,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
 	/* The page was likely read above, so no need for plugging here */
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = __swapin_cache_alloc(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
@@ -714,7 +714,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 			continue;
 		pte_unmap(pte);
 		pte = NULL;
-		folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+		folio = __swapin_cache_alloc(entry, gfp_mask, mpol, ilx,
 						&page_allocated, false);
 		if (!folio)
 			continue;
@@ -734,7 +734,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	lru_add_drain();
 skip:
 	/* The folio was likely read above, so no need for plugging here */
-	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
+	folio = __swapin_cache_alloc(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index aaf7d21eaecb..62af67b6f7c2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1390,7 +1390,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
  *   CPU1				CPU2
  *   do_swap_page()
  *     ...				swapoff+swapon
- *     __read_swap_cache_async()
+ *     __swapin_cache_alloc()
  *       swapcache_prepare()
  *         __swap_duplicate()
  *           // check swap_map
diff --git a/mm/zswap.c b/mm/zswap.c
index af954bda0b02..87aebeee11ef 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1084,8 +1084,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -EEXIST;
 
 	mpol = get_task_policy(current);
-	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
-			NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+	folio = __swapin_cache_alloc(swpentry, GFP_KERNEL, mpol,
+				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
 	put_swap_device(si);
 	if (!folio)
 		return -ENOMEM;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (8 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
                   ` (17 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The swap cache now has a very low overhead, bypassing it is not helpful
anymore. To prepare for unifying the swap in path, introduce a new
helper that only bypasses read ahead and does not bypass the swap cache.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |   6 ++
 mm/swap_state.c | 158 ++++++++++++++++++++++++++++++------------------
 2 files changed, 105 insertions(+), 59 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index fec7d6e751ae..aab6bf9c3a8a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -217,6 +217,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
+struct folio *swapin_entry(swp_entry_t entry, struct folio *folio);
 void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr);
 
@@ -303,6 +304,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline struct folio *swapin_entry(swp_entry_t ent, struct folio *folio)
+{
+	return NULL;
+}
+
 static inline void swap_update_readahead(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index fe71706e29d9..d68687295f52 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -353,54 +353,26 @@ void swap_update_readahead(struct folio *folio,
 	}
 }
 
-struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists)
+static struct folio *__swapin_cache_add_prepare(swp_entry_t entry,
+						struct folio *folio,
+						bool skip_if_exists)
 {
-	struct swap_info_struct *si = swp_info(entry);
-	struct folio *folio;
-	struct folio *new_folio = NULL;
-	struct folio *result = NULL;
+	int nr_pages = folio_nr_pages(folio);
+	struct folio *exist;
 	void *shadow = NULL;
+	int err;
 
-	*new_page_allocated = false;
 	for (;;) {
-		int err;
-
 		/*
-		 * Check the swap cache first, if a cached folio is found,
-		 * return it unlocked. The caller will lock and check it.
+		 * Caller should have checked swap cache and swap count
+		 * already, try prepare the swap map directly, it will still
+		 * fail with -ENOENT or -EEXIST if the entry is gone or raced.
 		 */
-		folio = swap_cache_get_folio(entry);
-		if (folio)
-			goto got_folio;
-
-		/*
-		 * Just skip read ahead for unused swap slot.
-		 */
-		if (!swap_entry_swapped(si, entry))
-			goto put_and_return;
-
-		/*
-		 * Get a new folio to read into from swap.  Allocate it now if
-		 * new_folio not exist, before marking swap_map SWAP_HAS_CACHE,
-		 * when -EEXIST will cause any racers to loop around until we
-		 * add it to cache.
-		 */
-		if (!new_folio) {
-			new_folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
-			if (!new_folio)
-				goto put_and_return;
-		}
-
-		/*
-		 * Swap entry may have been freed since our caller observed it.
-		 */
-		err = swapcache_prepare(entry, 1);
+		err = swapcache_prepare(entry, nr_pages);
 		if (!err)
 			break;
 		else if (err != -EEXIST)
-			goto put_and_return;
+			return NULL;
 
 		/*
 		 * Protect against a recursive call to __swapin_cache_alloc()
@@ -411,7 +383,11 @@ struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
 		 * __swapin_cache_alloc() in the writeback path.
 		 */
 		if (skip_if_exists)
-			goto put_and_return;
+			return NULL;
+
+		exist = swap_cache_get_folio(entry);
+		if (exist)
+			return exist;
 
 		/*
 		 * We might race against __swap_cache_del_folio(), and
@@ -426,35 +402,99 @@ struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
 	/*
 	 * The swap entry is ours to swap in. Prepare the new folio.
 	 */
-	__folio_set_locked(new_folio);
-	__folio_set_swapbacked(new_folio);
+	__folio_set_locked(folio);
+	__folio_set_swapbacked(folio);
 
-	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
-		goto fail_unlock;
-
-	if (swap_cache_add_folio(entry, new_folio, &shadow))
+	if (swap_cache_add_folio(entry, folio, &shadow))
 		goto fail_unlock;
 
 	memcg1_swapin(entry, 1);
 
 	if (shadow)
-		workingset_refault(new_folio, shadow);
+		workingset_refault(folio, shadow);
 
 	/* Caller will initiate read into locked new_folio */
-	folio_add_lru(new_folio);
-	*new_page_allocated = true;
-	folio = new_folio;
-got_folio:
-	result = folio;
-	goto put_and_return;
+	folio_add_lru(folio);
+	return folio;
 
 fail_unlock:
-	put_swap_folio(new_folio, entry);
-	folio_unlock(new_folio);
-put_and_return:
-	if (!(*new_page_allocated) && new_folio)
-		folio_put(new_folio);
-	return result;
+	put_swap_folio(folio, entry);
+	folio_unlock(folio);
+	return NULL;
+}
+
+struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
+		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
+		bool skip_if_exists)
+{
+	struct swap_info_struct *si = swp_info(entry);
+	struct folio *swapcache = NULL, *folio = NULL;
+
+	/*
+	 * Check the swap cache first, if a cached folio is found,
+	 * return it unlocked. The caller will lock and check it.
+	 */
+	swapcache = swap_cache_get_folio(entry);
+	if (swapcache)
+		goto out;
+
+	/*
+	 * Just skip read ahead for unused swap slot.
+	 */
+	if (!swap_entry_swapped(si, entry))
+		goto out;
+
+	/*
+	 * Get a new folio to read into from swap.  Allocate it now if
+	 * new_folio not exist, before marking swap_map SWAP_HAS_CACHE,
+	 * when -EEXIST will cause any racers to loop around until we
+	 * add it to cache.
+	 */
+	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+	if (!folio)
+		goto out;
+
+	if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
+		goto out;
+
+	swapcache = __swapin_cache_add_prepare(entry, folio, skip_if_exists);
+out:
+	if (swapcache && swapcache == folio) {
+		*new_page_allocated = true;
+	} else {
+		if (folio)
+			folio_put(folio);
+		*new_page_allocated = false;
+	}
+
+	return swapcache;
+}
+
+/**
+ * swapin_entry - swap-in one or multiple entries skipping readahead
+ *
+ * @entry: swap entry to swap in
+ * @folio: pre allocated folio
+ *
+ * Reads @entry into @folio. @folio will be added to swap cache first, if
+ * this raced with another users, only one user will successfully add its
+ * folio into swap cache, and that folio will be returned for all readers.
+ *
+ * If @folio is a large folio, the entry will be rounded down to match
+ * the folio start and the whole folio will be read in.
+ */
+struct folio *swapin_entry(swp_entry_t entry, struct folio *folio)
+{
+	struct folio *swapcache;
+	pgoff_t offset = swp_offset(entry);
+	unsigned long nr_pages = folio_nr_pages(folio);
+	VM_WARN_ON_ONCE(nr_pages > SWAPFILE_CLUSTER);
+
+	entry = swp_entry(swp_type(entry), ALIGN_DOWN(offset, nr_pages));
+	swapcache = __swapin_cache_add_prepare(entry, folio, false);
+	if (swapcache == folio)
+		swap_read_folio(folio, NULL);
+	return swapcache;
 }
 
 /*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (9 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-15  9:31   ` Klara Modin
  2025-05-19  7:08   ` Barry Song
  2025-05-14 20:17 ` [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (16 subsequent siblings)
  27 siblings, 2 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Move all mTHP swapin check into can_swapin_thp and use it for both pre
IO check and post IO check. This way the code is more consolidated and
make later commit easier to maintain.

Also clean up the comments while at it. The current comment of
non_swapcache_batch is not correct: swap in bypassing swap cache won't
reach the swap device as long as the entry is cached, because it still
sets the SWAP_HAS_CACHE flag. If the folio is already in swap cache, raced
swap in will either fail due to -EEXIST with swapcache_prepare, or see the
cached folio.

The real reason this non_swapcache_batch is needed is that if a smaller
folio is in the swap cache but not mapped, mTHP swapin will be blocked
forever as it won't see the folio due to index offset, nor it can set the
SWAP_HAS_CACHE bit, so it has to fallback to order 0 swap in.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 90 ++++++++++++++++++++++++-----------------------------
 1 file changed, 41 insertions(+), 49 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f2897d9059f2..1b6e192de6ec 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4319,12 +4319,6 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 	pgoff_t offset = swp_offset(entry);
 	int i;
 
-	/*
-	 * While allocating a large folio and doing swap_read_folio, which is
-	 * the case the being faulted pte doesn't have swapcache. We need to
-	 * ensure all PTEs have no cache as well, otherwise, we might go to
-	 * swap devices while the content is in swapcache.
-	 */
 	for (i = 0; i < max_nr; i++) {
 		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
 			return i;
@@ -4334,34 +4328,30 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 }
 
 /*
- * Check if the PTEs within a range are contiguous swap entries
- * and have consistent swapcache, zeromap.
+ * Check if the page table is still suitable for large folio swap in.
+ * @vmf: The fault triggering the swap-in.
+ * @ptep: Pointer to the PTE that should be the head of the swap in folio.
+ * @addr: The address corresponding to the PTE.
+ * @nr_pages: Number of pages of the folio that suppose to be swapped in.
  */
-static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
+			   unsigned long addr, unsigned int nr_pages)
 {
-	unsigned long addr;
-	swp_entry_t entry;
-	int idx;
-	pte_t pte;
-
-	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
-	idx = (vmf->address - addr) / PAGE_SIZE;
-	pte = ptep_get(ptep);
+	pte_t pte = ptep_get(ptep);
+	unsigned long addr_end = addr + (PAGE_SIZE * nr_pages);
+	unsigned long pte_offset = (vmf->address - addr) / PAGE_SIZE;
 
-	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+	VM_WARN_ON_ONCE(!IS_ALIGNED(addr, PAGE_SIZE) ||
+			addr > vmf->address || addr_end <= vmf->address);
+	if (unlikely(addr < max(addr & PMD_MASK, vmf->vma->vm_start) ||
+		     addr_end > pmd_addr_end(addr, vmf->vma->vm_end)))
 		return false;
-	entry = pte_to_swp_entry(pte);
-	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
-		return false;
-
 	/*
-	 * swap_read_folio() can't handle the case a large folio is hybridly
-	 * from different backends. And they are likely corner cases. Similar
-	 * things might be added once zswap support large folios.
+	 * All swap entries must from the same swap device, in same
+	 * cgroup, with same exclusiveness, only differs in offset.
 	 */
-	if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
-		return false;
-	if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
+	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -pte_offset)) ||
+	    swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
 		return false;
 
 	return true;
@@ -4441,13 +4431,24 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * completely swap entries with contiguous swap offsets.
 	 */
 	order = highest_order(orders);
-	while (orders) {
-		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
-		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
-			break;
-		order = next_order(&orders, order);
+	for (; orders; order = next_order(&orders, order)) {
+		unsigned long nr_pages = 1 << order;
+		swp_entry_t swap_entry = { .val = ALIGN_DOWN(entry.val, nr_pages) };
+		addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+		if (!can_swapin_thp(vmf, pte + pte_index(addr), addr, nr_pages))
+			continue;
+		/*
+		 * If there is already a smaller folio in cache, it will
+		 * conflict with the larger folio in the swap cache layer
+		 * and block the swap in.
+		 */
+		if (unlikely(non_swapcache_batch(swap_entry, nr_pages) != nr_pages))
+			continue;
+		/* Zero map doesn't work with large folio yet. */
+		if (unlikely(swap_zeromap_batch(swap_entry, nr_pages, NULL) != nr_pages))
+			continue;
+		break;
 	}
-
 	pte_unmap_unlock(pte, ptl);
 
 	/* Try allocating the highest of the remaining orders. */
@@ -4731,27 +4732,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	page_idx = 0;
 	address = vmf->address;
 	ptep = vmf->pte;
+
 	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
-		int nr = folio_nr_pages(folio);
+		unsigned long nr = folio_nr_pages(folio);
 		unsigned long idx = folio_page_idx(folio, page);
-		unsigned long folio_start = address - idx * PAGE_SIZE;
-		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
-		pte_t *folio_ptep;
-		pte_t folio_pte;
+		unsigned long folio_address = address - idx * PAGE_SIZE;
+		pte_t *folio_ptep = vmf->pte - idx;
 
-		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
-			goto check_folio;
-		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
-			goto check_folio;
-
-		folio_ptep = vmf->pte - idx;
-		folio_pte = ptep_get(folio_ptep);
-		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
-		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+		if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))
 			goto check_folio;
 
 		page_idx = idx;
-		address = folio_start;
+		address = folio_address;
 		ptep = folio_ptep;
 		nr_pages = nr;
 		entry = folio->swap;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (10 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
                   ` (15 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now the overhead of the swap cache is trivial to none, bypassing the
swap cache is no longer a valid optimization.

This commit is more than code simplification, it changes the swap in
behaviour in multiple ways:

We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
The indicator to bypass the swap cache and read ahead, in many workload
bypassing read ahead is the more helpful part for SWP_SYNCHRONOUS_IO
devices as they have extreme low latency the read ahead isn't helpful.

The `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` is not a good
indicator in the first place: obviously, read ahead has nothing to do with
swap count, that's more of a workaround due to the limitation of current
implementation that read ahead bypassing is strictly coupled with swap
cache bypassing. Swap count > 1 can't bypass the swap cache because that
will result in redundant IO or wasted CPU time.

So the first change with this commit is that read ahead is now always
disabled for SWP_SYNCHRONOUS_IO devices, this is a good thing as these
devices have extreme low latency, and queued IO do not affect them
(ZRAM, RAMDISK), so read ahead isn't helpful.

The second thing here is that this enabled mTHP swap in for all faults on
SWP_SYNCHRONOUS_IO devices. Previously, the mTHP swap is also coupled with
swap cache bypassing. But again clearly, it doesn't make much sense that
mTHP's ref count affects its swap in behavior.

And to catch potential issues with mTHP swap in, especially with page
exclusiveness, more debug sanity checks and comments are added. But the
code is still simpler with reduced LOC.

For a real mTHP workload, this may cause more serious thrashing, this isn't
a problem with this commit but a generic mTHP issue. For a 4K workload,
this commit boosts the performance:

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 267 +++++++++++++++++++++++-----------------------------
 1 file changed, 116 insertions(+), 151 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1b6e192de6ec..0b41d15c6d7a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -87,6 +87,7 @@
 #include <asm/tlbflush.h>
 
 #include "pgalloc-track.h"
+#include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
 
@@ -4477,7 +4478,33 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+/* Check if a folio should be exclusive, with sanity tests */
+static bool check_swap_exclusive(struct folio *folio, swp_entry_t entry,
+				 pte_t *ptep, unsigned int fault_nr)
+{
+	pgoff_t offset = swp_offset(entry);
+	struct page *page = folio_file_page(folio, offset);
+
+	if (!pte_swp_exclusive(ptep_get(ptep)))
+		return false;
+
+	/* For exclusive swapin, it must not be mapped */
+	if (fault_nr == 1)
+		VM_WARN_ON_ONCE_PAGE(atomic_read(&page->_mapcount) != -1, page);
+	else
+		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
+	/*
+	 * Check if swap count is consistent with exclusiveness. The folio
+	 * and PTL lock keeps the swap count stable.
+	 */
+	if (IS_ENABLED(CONFIG_VM_DEBUG)) {
+		for (int i = 0; i < fault_nr; i++) {
+			VM_WARN_ON_FOLIO(__swap_count(entry) != 1, folio);
+			entry.val++;
+		}
+	}
+	return true;
+}
 
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
@@ -4490,17 +4517,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct folio *swapcache, *folio = NULL;
-	DECLARE_WAITQUEUE(wait, current);
+	struct folio *swapcache = NULL, *folio;
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
-	bool need_clear_cache = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 	int nr_pages;
 	unsigned long page_idx;
 	unsigned long address;
@@ -4571,56 +4595,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	folio = swap_cache_get_folio(entry);
 	swapcache = folio;
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			/* skip swapcache */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = alloc_swap_folio(vmf);
 			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
-				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
-				 */
-				if (swapcache_prepare(entry, nr_pages)) {
-					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
-					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
-				}
-				need_clear_cache = true;
-
-				memcg1_swapin(entry, nr_pages);
-
-				shadow = swap_cache_get_shadow(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL);
-				folio->private = NULL;
+				swapcache = swapin_entry(entry, folio);
+				if (swapcache != folio)
+					folio_put(folio);
 			}
 		} else {
-			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			swapcache = folio;
+			swapcache = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
 		}
 
+		folio = swapcache;
 		if (!folio) {
 			/*
 			 * Back out if somebody else faulted in this pte
@@ -4644,57 +4630,56 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
+	/*
+	 * Make sure folio_free_swap() or swapoff did not release the
+	 * swapcache from under us.  The page pin, and pte_same test
+	 * below, are not enough to exclude that.  Even if it is still
+	 * swapcache, we need to check that the page's swap has not
+	 * changed.
+	 */
+	if (!folio_swap_contains(folio, entry))
+		goto out_page;
 	page = folio_file_page(folio, swp_offset(entry));
-	if (swapcache) {
-		/*
-		 * Make sure folio_free_swap() or swapoff did not release the
-		 * swapcache from under us.  The page pin, and pte_same test
-		 * below, are not enough to exclude that.  Even if it is still
-		 * swapcache, we need to check that the page's swap has not
-		 * changed.
-		 */
-		if (!folio_swap_contains(folio, entry))
-			goto out_page;
 
-		if (PageHWPoison(page)) {
-			/*
-			 * hwpoisoned dirty swapcache pages are kept for killing
-			 * owner processes (which may be unknown at hwpoison time)
-			 */
-			ret = VM_FAULT_HWPOISON;
-			goto out_page;
-		}
-
-		swap_update_readahead(folio, vma, vmf->address);
+	/*
+	 * hwpoisoned dirty swapcache pages are kept for killing
+	 * owner processes (which may be unknown at hwpoison time)
+	 */
+	if (PageHWPoison(page)) {
+		ret = VM_FAULT_HWPOISON;
+		goto out_page;
+	}
 
-		/*
-		 * KSM sometimes has to copy on read faults, for example, if
-		 * page->index of !PageKSM() pages would be nonlinear inside the
-		 * anon VMA -- PageKSM() is lost on actual swapout.
-		 */
-		folio = ksm_might_need_to_copy(folio, vma, vmf->address);
-		if (unlikely(!folio)) {
-			ret = VM_FAULT_OOM;
-			folio = swapcache;
-			goto out_page;
-		} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
-			ret = VM_FAULT_HWPOISON;
-			folio = swapcache;
-			goto out_page;
-		} else if (folio != swapcache)
-			page = folio_page(folio, 0);
+	swap_update_readahead(folio, vma, vmf->address);
 
-		/*
-		 * If we want to map a page that's in the swapcache writable, we
-		 * have to detect via the refcount if we're really the exclusive
-		 * owner. Try removing the extra reference from the local LRU
-		 * caches if required.
-		 */
-		if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
-		    !folio_test_ksm(folio) && !folio_test_lru(folio))
-			lru_add_drain();
+	/*
+	 * KSM sometimes has to copy on read faults, for example, if
+	 * page->index of !PageKSM() pages would be nonlinear inside the
+	 * anon VMA -- PageKSM() is lost on actual swapout.
+	 */
+	folio = ksm_might_need_to_copy(folio, vma, vmf->address);
+	if (unlikely(!folio)) {
+		ret = VM_FAULT_OOM;
+		folio = swapcache;
+		goto out_page;
+	} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
+		ret = VM_FAULT_HWPOISON;
+		folio = swapcache;
+		goto out_page;
+	} else if (folio != swapcache) {
+		page = folio_file_page(folio, swp_offset(entry));
 	}
 
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * owner. Try removing the extra reference from the local LRU
+	 * caches if required.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
+	    !folio_test_ksm(folio) && !folio_test_lru(folio))
+		lru_add_drain();
+
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
@@ -4710,44 +4695,41 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
-	/* allocated large folios for SWP_SYNCHRONOUS_IO */
-	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
-		unsigned long nr = folio_nr_pages(folio);
-		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
-		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
-		pte_t *folio_ptep = vmf->pte - idx;
-		pte_t folio_pte = ptep_get(folio_ptep);
-
-		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
-		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
-			goto out_nomap;
-
-		page_idx = idx;
-		address = folio_start;
-		ptep = folio_ptep;
-		goto check_folio;
-	}
-
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
 	ptep = vmf->pte;
 
-	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
+	if (folio_test_large(folio)) {
 		unsigned long nr = folio_nr_pages(folio);
 		unsigned long idx = folio_page_idx(folio, page);
-		unsigned long folio_address = address - idx * PAGE_SIZE;
+		unsigned long folio_address = vmf->address - idx * PAGE_SIZE;
 		pte_t *folio_ptep = vmf->pte - idx;
 
-		if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))
+		if (can_swapin_thp(vmf, folio_ptep, folio_address, nr)) {
+			page_idx = idx;
+			address = folio_address;
+			ptep = folio_ptep;
+			nr_pages = nr;
+			entry = folio->swap;
+			page = &folio->page;
 			goto check_folio;
-
-		page_idx = idx;
-		address = folio_address;
-		ptep = folio_ptep;
-		nr_pages = nr;
-		entry = folio->swap;
-		page = &folio->page;
+		}
+		/*
+		 * If it's a fresh large folio in the swap cache but the
+		 * page table supporting it is gone, drop it and fallback
+		 * to order 0 swap in again.
+		 *
+		 * The folio must be clean, nothing should have touched
+		 * it, shmem removes the folio from swap cache upon
+		 * swapin, and anon flag won't be gone once set.
+		 * TODO: We might want to split or partially map it.
+		 */
+		if (!folio_test_anon(folio)) {
+			WARN_ON_ONCE(folio_test_dirty(folio));
+			delete_from_swap_cache(folio);
+			goto out_nomap;
+		}
 	}
 
 check_folio:
@@ -4767,7 +4749,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * the swap entry concurrently) for certainly exclusive pages.
 	 */
 	if (!folio_test_ksm(folio)) {
-		exclusive = pte_swp_exclusive(vmf->orig_pte);
+		exclusive = check_swap_exclusive(folio, entry, ptep, nr_pages);
 		if (folio != swapcache) {
 			/*
 			 * We have a fresh page that is not exposed to the
@@ -4805,15 +4787,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
-	/*
-	 * Remove the swap entry and conditionally try to free up the swapcache.
-	 * We're already holding a reference on the page but haven't mapped it
-	 * yet.
-	 */
-	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
-		folio_free_swap(folio);
-
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
 	pte = mk_pte(page, vma->vm_page_prot);
@@ -4849,14 +4822,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
-		/*
-		 * We currently only expect small !anon folios which are either
-		 * fully exclusive or fully shared, or new allocated large
-		 * folios which are fully exclusive. If we ever get large
-		 * folios within swapcache here, we have to be careful.
-		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
-		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
+		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
@@ -4869,7 +4836,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	arch_do_swap_page_nr(vma->vm_mm, vma, address,
 			pte, pte, nr_pages);
 
+	/*
+	 * Remove the swap entry and conditionally try to free up the
+	 * swapcache then unlock the folio. Do this after the PTEs are
+	 * set, so raced faults will see updated PTEs.
+	 */
+	swap_free_nr(entry, nr_pages);
+	if (should_try_to_free_swap(folio, vma, vmf->flags))
+		folio_free_swap(folio);
 	folio_unlock(folio);
+
 	if (folio != swapcache && swapcache) {
 		/*
 		 * Hold the lock to avoid the swap entry to be reused
@@ -4896,12 +4872,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
-	/* Clear the swap cache pin for direct swapin after PTL unlock */
-	if (need_clear_cache) {
-		swapcache_clear(si, entry, nr_pages);
-		if (waitqueue_active(&swapcache_wq))
-			wake_up(&swapcache_wq);
-	}
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -4916,11 +4886,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (need_clear_cache) {
-		swapcache_clear(si, entry, nr_pages);
-		if (waitqueue_active(&swapcache_wq))
-			wake_up(&swapcache_wq);
-	}
 	if (si)
 		put_swap_device(si);
 	return ret;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (11 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (14 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently shmem calls xa_get_order multiple times to get the swap radix
entry order. This can be combined with the swap entry value checking
(shmem_confirm_swap) to avoid the duplicated lookup, which should
improve the performance.

This also provides the helper need for later commits.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 67 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 37 insertions(+), 30 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0da9e06eaee8..da80a8faa39e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -509,11 +509,26 @@ static int shmem_replace_entry(struct address_space *mapping,
  *
  * Checking folio is not enough: by the time a swapcache folio is locked, it
  * might be reused, and again be swapcache, using the same swap as before.
+ *
+ * Check if the swap entry is still in the shmem mapping and get its order,
+ * return -1 if it's no longer valid.
  */
-static bool shmem_confirm_swap(struct address_space *mapping,
-			       pgoff_t index, swp_entry_t swap)
+static int shmem_check_swap_entry(struct address_space *mapping, pgoff_t index,
+				  swp_entry_t swap)
 {
-	return xa_load(&mapping->i_pages, index) == swp_to_radix_entry(swap);
+	XA_STATE(xas, &mapping->i_pages, index);
+	int order = -1;
+	void *entry;
+
+	rcu_read_lock();
+	do {
+		entry = xas_load(&xas);
+		if (entry == swp_to_radix_entry(swap))
+			order = xas_get_order(&xas);
+	} while (xas_retry(&xas, entry));
+	rcu_read_unlock();
+
+	return order;
 }
 
 /*
@@ -2238,16 +2253,17 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		return -EIO;
 
 	si = get_swap_device(swap);
-	if (!si) {
-		if (!shmem_confirm_swap(mapping, index, swap))
-			return -EEXIST;
-		else
-			return -EINVAL;
+	order = shmem_check_swap_entry(mapping, index, swap);
+	if (order < 0) {
+		if (si)
+			put_swap_device(si);
+		return -EEXIST;
 	}
+	if (!si)
+		return -EINVAL;
 
 	/* Look it up and read it in.. */
 	folio = swap_cache_get_folio(swap);
-	order = xa_get_order(&mapping->i_pages, index);
 	if (!folio) {
 		bool fallback_order0 = false;
 
@@ -2303,7 +2319,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		 */
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
-
 			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
 		}
 
@@ -2325,25 +2340,20 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			error = split_order;
 			goto failed;
 		}
-
-		/*
-		 * If the large swap entry has already been split, it is
-		 * necessary to recalculate the new swap entry based on
-		 * the old order alignment.
-		 */
-		if (split_order > 0) {
-			pgoff_t offset = index - round_down(index, 1 << split_order);
-
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
-		}
 	}
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
-	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    folio->swap.val != swap.val ||
-	    !shmem_confirm_swap(mapping, index, swap) ||
-	    xa_get_order(&mapping->i_pages, index) != folio_order(folio)) {
+	if (!skip_swapcache && !folio_swap_contains(folio, swap)) {
+		error = -EEXIST;
+		goto unlock;
+	}
+
+	nr_pages = folio_nr_pages(folio);
+	index = round_down(index, nr_pages);
+	swap = swp_entry(swp_type(swap), round_down(swp_offset(swap), nr_pages));
+
+	if (folio_order(folio) != shmem_check_swap_entry(mapping, index, swap)) {
 		error = -EEXIST;
 		goto unlock;
 	}
@@ -2354,7 +2364,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		goto failed;
 	}
 	folio_wait_writeback(folio);
-	nr_pages = folio_nr_pages(folio);
 
 	/*
 	 * Some architectures may have to restore extra metadata to the
@@ -2368,8 +2377,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			goto failed;
 	}
 
-	error = shmem_add_to_page_cache(folio, mapping,
-					round_down(index, nr_pages),
+	error = shmem_add_to_page_cache(folio, mapping, index,
 					swp_to_radix_entry(swap), gfp);
 	if (error)
 		goto failed;
@@ -2392,7 +2400,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	*foliop = folio;
 	return 0;
 failed:
-	if (!shmem_confirm_swap(mapping, index, swap))
+	if (shmem_check_swap_entry(mapping, index, swap) < 0)
 		error = -EEXIST;
 	if (error == -EIO)
 		shmem_set_folio_swapin_error(inode, index, folio, swap,
@@ -2405,7 +2413,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_put(folio);
 	}
 	put_swap_device(si);
-
 	return error;
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (12 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
                   ` (13 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now the overhead of the swap cache is trivial to none, bypassing the
swap cache is no longer a valid optimization.

So remove the cache bypass swap path for simplification. Many helpers
and functions can be dropped now.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c    | 109 ++++++++++++++++++--------------------------------
 mm/swap.h     |   4 --
 mm/swapfile.c |  35 +++++-----------
 3 files changed, 48 insertions(+), 100 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index da80a8faa39e..e87eff03c08b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -899,7 +899,9 @@ static int shmem_add_to_page_cache(struct folio *folio,
 				   pgoff_t index, void *expected, gfp_t gfp)
 {
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, folio_order(folio));
-	long nr = folio_nr_pages(folio);
+	unsigned long nr = folio_nr_pages(folio);
+	swp_entry_t iter, swap;
+	void *entry;
 
 	VM_BUG_ON_FOLIO(index != round_down(index, nr), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -912,13 +914,19 @@ static int shmem_add_to_page_cache(struct folio *folio,
 	gfp &= GFP_RECLAIM_MASK;
 	folio_throttle_swaprate(folio, gfp);
 
+	if (expected)
+		swap = iter = radix_to_swp_entry(expected);
+
 	do {
 		xas_lock_irq(&xas);
-		if (expected != xas_find_conflict(&xas)) {
-			xas_set_err(&xas, -EEXIST);
-			goto unlock;
+		xas_for_each_conflict(&xas, entry) {
+			if (!expected || entry != swp_to_radix_entry(iter)) {
+				xas_set_err(&xas, -EEXIST);
+				goto unlock;
+			}
+			iter.val += 1 << xas_get_order(&xas);
 		}
-		if (expected && xas_find_conflict(&xas)) {
+		if (expected && iter.val - nr != swap.val) {
 			xas_set_err(&xas, -EEXIST);
 			goto unlock;
 		}
@@ -1973,14 +1981,12 @@ static struct folio *shmem_alloc_and_add_folio(struct vm_fault *vmf,
 	return ERR_PTR(error);
 }
 
-static struct folio *shmem_swap_alloc_folio(struct inode *inode,
+static struct folio *shmem_swapin_folio_order(struct inode *inode,
 		struct vm_area_struct *vma, pgoff_t index,
 		swp_entry_t entry, int order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	struct folio *new;
-	void *shadow;
-	int nr_pages;
+	struct folio *new, *swapcache;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
@@ -1995,41 +2001,19 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 
 	new = shmem_alloc_folio(gfp, order, info, index);
 	if (!new)
-		return ERR_PTR(-ENOMEM);
+		return NULL;
 
-	nr_pages = folio_nr_pages(new);
 	if (mem_cgroup_swapin_charge_folio(new, vma ? vma->vm_mm : NULL,
-					   gfp, entry)) {
+				gfp, entry)) {
 		folio_put(new);
-		return ERR_PTR(-ENOMEM);
+		return NULL;
 	}
 
-	/*
-	 * Prevent parallel swapin from proceeding with the swap cache flag.
-	 *
-	 * Of course there is another possible concurrent scenario as well,
-	 * that is to say, the swap cache flag of a large folio has already
-	 * been set by swapcache_prepare(), while another thread may have
-	 * already split the large swap entry stored in the shmem mapping.
-	 * In this case, shmem_add_to_page_cache() will help identify the
-	 * concurrent swapin and return -EEXIST.
-	 */
-	if (swapcache_prepare(entry, nr_pages)) {
+	swapcache = swapin_entry(entry, new);
+	if (swapcache != new)
 		folio_put(new);
-		return ERR_PTR(-EEXIST);
-	}
 
-	__folio_set_locked(new);
-	__folio_set_swapbacked(new);
-	new->swap = entry;
-
-	memcg1_swapin(entry, nr_pages);
-	shadow = swap_cache_get_shadow(entry);
-	if (shadow)
-		workingset_refault(new, shadow);
-	folio_add_lru(new);
-	swap_read_folio(new, NULL);
-	return new;
+	return swapcache;
 }
 
 /*
@@ -2122,8 +2106,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 }
 
 static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
-					 struct folio *folio, swp_entry_t swap,
-					 bool skip_swapcache)
+					 struct folio *folio, swp_entry_t swap)
 {
 	struct address_space *mapping = inode->i_mapping;
 	swp_entry_t swapin_error;
@@ -2139,8 +2122,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
-	if (!skip_swapcache)
-		delete_from_swap_cache(folio);
+	delete_from_swap_cache(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
@@ -2241,7 +2223,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
-	bool skip_swapcache = false;
 	swp_entry_t swap;
 	int error, nr_pages, order, split_order;
 
@@ -2283,25 +2264,16 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 				  !zswap_never_enabled()))
 			fallback_order0 = true;
 
-		/* Skip swapcache for synchronous device. */
+		/* Try mTHP swapin for synchronous device. */
 		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
-			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
-			if (!IS_ERR(folio)) {
-				skip_swapcache = true;
+			folio = shmem_swapin_folio_order(inode, vma, index, swap, order, gfp);
+			if (folio)
 				goto alloced;
-			}
-
-			/*
-			 * Fallback to swapin order-0 folio unless the swap entry
-			 * already exists.
-			 */
-			error = PTR_ERR(folio);
-			folio = NULL;
-			if (error == -EEXIST)
-				goto failed;
 		}
 
 		/*
+		 * Fallback to swapin order-0 folio.
+		 *
 		 * Now swap device can only swap in order 0 folio, then we
 		 * should split the large swap entry stored in the pagecache
 		 * if necessary.
@@ -2338,13 +2310,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		split_order = shmem_split_large_entry(inode, index, swap, gfp);
 		if (split_order < 0) {
 			error = split_order;
+			folio_put(folio);
+			folio = NULL;
 			goto failed;
 		}
 	}
 alloced:
 	/* We have to do this with folio locked to prevent races */
 	folio_lock(folio);
-	if (!skip_swapcache && !folio_swap_contains(folio, swap)) {
+	if (!folio_swap_contains(folio, swap)) {
 		error = -EEXIST;
 		goto unlock;
 	}
@@ -2353,12 +2327,15 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	index = round_down(index, nr_pages);
 	swap = swp_entry(swp_type(swap), round_down(swp_offset(swap), nr_pages));
 
-	if (folio_order(folio) != shmem_check_swap_entry(mapping, index, swap)) {
+	/*
+	 * Swap must go through swap cache layer, only the split may happen
+	 * without locking the swap cache.
+	 */
+	if (folio_order(folio) < shmem_check_swap_entry(mapping, index, swap)) {
 		error = -EEXIST;
 		goto unlock;
 	}
-	if (!skip_swapcache)
-		swap_update_readahead(folio, NULL, 0);
+	swap_update_readahead(folio, NULL, 0);
 	if (!folio_test_uptodate(folio)) {
 		error = -EIO;
 		goto failed;
@@ -2387,12 +2364,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	if (skip_swapcache) {
-		folio->swap.val = 0;
-		swapcache_clear(si, swap, nr_pages);
-	} else {
-		delete_from_swap_cache(folio);
-	}
+	delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
@@ -2403,11 +2375,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (shmem_check_swap_entry(mapping, index, swap) < 0)
 		error = -EEXIST;
 	if (error == -EIO)
-		shmem_set_folio_swapin_error(inode, index, folio, swap,
-					     skip_swapcache);
+		shmem_set_folio_swapin_error(inode, index, folio, swap);
 unlock:
-	if (skip_swapcache)
-		swapcache_clear(si, swap, folio_nr_pages(folio));
 	if (folio) {
 		folio_unlock(folio);
 		folio_put(folio);
diff --git a/mm/swap.h b/mm/swap.h
index aab6bf9c3a8a..cad24a3abda8 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -319,10 +319,6 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
-}
-
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 62af67b6f7c2..d3abd2149f8e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1430,22 +1430,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-static void swap_entries_put_cache(struct swap_info_struct *si,
-				   swp_entry_t entry, int nr)
-{
-	unsigned long offset = swp_offset(entry);
-	struct swap_cluster_info *ci;
-
-	ci = swap_lock_cluster(si, offset);
-	if (swap_only_has_cache(si, offset, nr)) {
-		swap_entries_free(si, ci, entry, nr);
-	} else {
-		for (int i = 0; i < nr; i++, entry.val++)
-			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
-	}
-	swap_unlock_cluster(ci);
-}
-
 static bool swap_entries_put_map(struct swap_info_struct *si,
 				 swp_entry_t entry, int nr)
 {
@@ -1578,13 +1562,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
 	int size = 1 << swap_entry_order(folio_order(folio));
 
 	si = _swap_info_get(entry);
 	if (!si)
 		return;
 
-	swap_entries_put_cache(si, entry, size);
+	ci = swap_lock_cluster(si, offset);
+	if (swap_only_has_cache(si, offset, size))
+		swap_entries_free(si, ci, entry, size);
+	else
+		for (int i = 0; i < size; i++, entry.val++)
+			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+	swap_unlock_cluster(ci);
 }
 
 int __swap_count(swp_entry_t entry)
@@ -3615,15 +3607,6 @@ int swapcache_prepare(swp_entry_t entry, int nr)
 	return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
 }
 
-/*
- * Caller should ensure entries belong to the same folio so
- * the entries won't span cross cluster boundary.
- */
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
-	swap_entries_put_cache(si, entry, nr);
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (13 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
                   ` (12 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, split the common logic into a stand alone helper to
be reused later.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 61 +++++++++++++++++++++++++++------------------------
 1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d3abd2149f8e..d01dc0646db9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3492,26 +3492,14 @@ void si_swapinfo(struct sysinfo *val)
  * - swap-cache reference is requested but the entry is not used. -> ENOENT
  * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+static int swap_dup_entries(struct swap_info_struct *si,
+			    struct swap_cluster_info *ci,
+			    unsigned long offset,
+			    unsigned char usage, int nr)
 {
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset;
-	unsigned char count;
-	unsigned char has_cache;
-	int err, i;
-
-	si = swp_get_info(entry);
-	if (WARN_ON_ONCE(!si)) {
-		pr_err("%s%08lx\n", Bad_file, entry.val);
-		return -EINVAL;
-	}
-
-	offset = swp_offset(entry);
-	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	ci = swap_lock_cluster(si, offset);
+	int i;
+	unsigned char count, has_cache;
 
-	err = 0;
 	for (i = 0; i < nr; i++) {
 		count = si->swap_map[offset + i];
 
@@ -3520,24 +3508,20 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
 		 */
 		if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
-			err = -ENOENT;
-			goto unlock_out;
+			return -ENOENT;
 		}
 
 		has_cache = count & SWAP_HAS_CACHE;
 		count &= ~SWAP_HAS_CACHE;
 
 		if (!count && !has_cache) {
-			err = -ENOENT;
+			return -ENOENT;
 		} else if (usage == SWAP_HAS_CACHE) {
 			if (has_cache)
-				err = -EEXIST;
+				return -EEXIST;
 		} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
-			err = -EINVAL;
+			return -EINVAL;
 		}
-
-		if (err)
-			goto unlock_out;
 	}
 
 	for (i = 0; i < nr; i++) {
@@ -3556,15 +3540,34 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 			 * Don't need to rollback changes, because if
 			 * usage == 1, there must be nr == 1.
 			 */
-			err = -ENOMEM;
-			goto unlock_out;
+			return -ENOMEM;
 		}
 
 		WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
 	}
 
-unlock_out:
+	return 0;
+}
+
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset;
+	int err;
+
+	si = swp_get_info(entry);
+	if (WARN_ON_ONCE(!si)) {
+		pr_err("%s%08lx\n", Bad_file, entry.val);
+		return -EINVAL;
+	}
+
+	offset = swp_offset(entry);
+	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+	ci = swap_lock_cluster(si, offset);
+	err = swap_dup_entries(si, ci, offset, usage, nr);
 	swap_unlock_cluster(ci);
+
 	return err;
 }
 
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (14 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
                   ` (11 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Current swap synchronization is mostly based on the swap_map's
SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual work to
swap in a folio.

This has been causing many issues as it's just a poor implementation
of a bit lock based on a busy loop. The busy loop is relaxed with a
schedule_timeout_uninterruptible(1), which is ugly and causes more
long tailing or other performance issues. Beside, the abuse of
SWAP_HAS_CACHE has been causing trouble for maintenance.

We have just removed all swap in paths bypassing the swap cache,
so now we can just resolve the swap synchronization with the swap cache
layer directly (similar to page cache). Whoever adds a folio into the
swap cache first takes care of the real IO. Raced threads will see the
newly inserted folio from others, they can simply wait on the folio lock.

This way, the race swap in is synchronized with a proper lock.
This both simplifies the logic and should improve the performance, and
eliminated issues like the workaround in commit 01626a1823024
("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"),
or the "skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking
memcg-aware").

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   6 ---
 mm/swap.h            |  17 ++++--
 mm/swap_state.c      | 120 +++++++++++++++++--------------------------
 mm/swapfile.c        |  32 ++++++------
 mm/vmscan.c          |   1 -
 mm/zswap.c           |   2 +-
 6 files changed, 76 insertions(+), 102 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 58230f3e15e6..2da769cdc663 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -443,7 +443,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
@@ -502,11 +501,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
 	return 0;
 }
 
-static inline int swapcache_prepare(swp_entry_t swp, int nr)
-{
-	return 0;
-}
-
 static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }
diff --git a/mm/swap.h b/mm/swap.h
index cad24a3abda8..2abfb40fc7ec 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -135,6 +135,13 @@ static inline void swap_unlock_cluster_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
+extern int __swap_cache_set_entry(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  unsigned long offset);
+extern void __swap_cache_put_entries(struct swap_info_struct *si,
+				     struct swap_cluster_info *ci,
+				     swp_entry_t entry, unsigned int size);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -158,8 +165,8 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 
 /* Below helpers requires the caller to pin the swap device. */
 extern struct folio *swap_cache_get_folio(swp_entry_t entry);
-extern int swap_cache_add_folio(swp_entry_t entry,
-				struct folio *folio, void **shadow);
+extern struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
+					  void **shadow, bool swapin);
 extern void *swap_cache_get_shadow(swp_entry_t entry);
 /* Below helpers requires the caller to lock the swap cluster. */
 extern void __swap_cache_del_folio(swp_entry_t entry,
@@ -211,8 +218,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
 struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_flags,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists);
+		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
@@ -324,7 +330,8 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return NULL;
 }
 
-static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
+static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio,
+				       void **shadow, bool swapin)
 {
 	return -EINVAL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d68687295f52..715aff5aca57 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -110,12 +110,18 @@ int __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
 	return 0;
 }
 
-int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
-			 void **shadow)
+/*
+ * Return the folio being added on success, or return the existing folio
+ * with conflicting index on failure.
+ */
+struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
+				   void **shadow, bool swapin)
 {
 	swp_te_t exist;
 	pgoff_t end, start, offset;
+	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
+	struct folio *existing = NULL;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	start = swp_offset(entry);
@@ -124,12 +130,18 @@ int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
-
+again:
 	offset = start;
-	ci = swap_lock_cluster(swp_info(entry), offset);
+	existing = NULL;
+	si = swp_info(entry);
+	ci = swap_lock_cluster(si, offset);
 	do {
 		exist = __swap_table_get(ci, offset);
-		if (unlikely(swp_te_is_folio(exist)))
+		if (unlikely(swp_te_is_folio(exist))) {
+			existing = swp_te_folio(exist);
+			goto out_failed;
+		}
+		if (swapin && __swap_cache_set_entry(si, ci, offset))
 			goto out_failed;
 		if (shadow && swp_te_is_shadow(exist))
 			*shadow = swp_te_shadow(exist);
@@ -144,18 +156,27 @@ int swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
 
-	return 0;
+	return folio;
 
 out_failed:
 	/*
-	 * We may lose shadow due to raced swapin, which should be
-	 * fine, caller better keep the previous returned shadow.
+	 * We may lose shadow here due to raced swapin, which is rare and OK,
+	 * caller better keep the previous returned shadow.
 	 */
-	while (offset-- > start)
+	while (offset-- > start) {
 		__swap_table_set_shadow(ci, offset, NULL);
+		__swap_cache_put_entries(si, ci, swp_entry(si->type, offset), 1);
+	}
 	swap_unlock_cluster(ci);
 
-	return -EEXIST;
+	/*
+	 * Need to grab the conflicting folio before return. If it's
+	 * already gone, just try insert again.
+	 */
+	if (existing && !folio_try_get(existing))
+		goto again;
+
+	return existing;
 }
 
 /*
@@ -192,6 +213,7 @@ void __swap_cache_del_folio(swp_entry_t entry,
 	folio_clear_swapcache(folio);
 	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
+	__swap_cache_put_entries(si, ci, entry, nr_pages);
 }
 
 void delete_from_swap_cache(struct folio *folio)
@@ -203,7 +225,6 @@ void delete_from_swap_cache(struct folio *folio)
 	__swap_cache_del_folio(entry, folio, NULL);
 	swap_unlock_cluster(ci);
 
-	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
@@ -354,59 +375,18 @@ void swap_update_readahead(struct folio *folio,
 }
 
 static struct folio *__swapin_cache_add_prepare(swp_entry_t entry,
-						struct folio *folio,
-						bool skip_if_exists)
+						struct folio *folio)
 {
-	int nr_pages = folio_nr_pages(folio);
-	struct folio *exist;
 	void *shadow = NULL;
-	int err;
+	struct folio *swapcache = NULL;
 
-	for (;;) {
-		/*
-		 * Caller should have checked swap cache and swap count
-		 * already, try prepare the swap map directly, it will still
-		 * fail with -ENOENT or -EEXIST if the entry is gone or raced.
-		 */
-		err = swapcache_prepare(entry, nr_pages);
-		if (!err)
-			break;
-		else if (err != -EEXIST)
-			return NULL;
-
-		/*
-		 * Protect against a recursive call to __swapin_cache_alloc()
-		 * on the same entry waiting forever here because SWAP_HAS_CACHE
-		 * is set but the folio is not the swap cache yet. This can
-		 * happen today if mem_cgroup_swapin_charge_folio() below
-		 * triggers reclaim through zswap, which may call
-		 * __swapin_cache_alloc() in the writeback path.
-		 */
-		if (skip_if_exists)
-			return NULL;
-
-		exist = swap_cache_get_folio(entry);
-		if (exist)
-			return exist;
-
-		/*
-		 * We might race against __swap_cache_del_folio(), and
-		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
-		 * has not yet been cleared.  Or race against another
-		 * __swapin_cache_alloc(), which has set SWAP_HAS_CACHE
-		 * in swap_map, but not yet added its folio to swap cache.
-		 */
-		schedule_timeout_uninterruptible(1);
-	}
-
-	/*
-	 * The swap entry is ours to swap in. Prepare the new folio.
-	 */
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
-
-	if (swap_cache_add_folio(entry, folio, &shadow))
-		goto fail_unlock;
+	swapcache = swap_cache_add_folio(entry, folio, &shadow, true);
+	if (swapcache != folio) {
+		folio_unlock(folio);
+		return swapcache;
+	}
 
 	memcg1_swapin(entry, 1);
 
@@ -416,16 +396,10 @@ static struct folio *__swapin_cache_add_prepare(swp_entry_t entry,
 	/* Caller will initiate read into locked new_folio */
 	folio_add_lru(folio);
 	return folio;
-
-fail_unlock:
-	put_swap_folio(folio, entry);
-	folio_unlock(folio);
-	return NULL;
 }
 
 struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists)
+		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated)
 {
 	struct swap_info_struct *si = swp_info(entry);
 	struct folio *swapcache = NULL, *folio = NULL;
@@ -457,7 +431,7 @@ struct folio *__swapin_cache_alloc(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_swapin_charge_folio(folio, NULL, gfp_mask, entry))
 		goto out;
 
-	swapcache = __swapin_cache_add_prepare(entry, folio, skip_if_exists);
+	swapcache = __swapin_cache_add_prepare(entry, folio);
 out:
 	if (swapcache && swapcache == folio) {
 		*new_page_allocated = true;
@@ -491,7 +465,7 @@ struct folio *swapin_entry(swp_entry_t entry, struct folio *folio)
 	VM_WARN_ON_ONCE(nr_pages > SWAPFILE_CLUSTER);
 
 	entry = swp_entry(swp_type(entry), ALIGN_DOWN(offset, nr_pages));
-	swapcache = __swapin_cache_add_prepare(entry, folio, false);
+	swapcache = __swapin_cache_add_prepare(entry, folio);
 	if (swapcache == folio)
 		swap_read_folio(folio, NULL);
 	return swapcache;
@@ -523,7 +497,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	folio = __swapin_cache_alloc(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+				     &page_allocated);
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
@@ -642,7 +616,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		/* Ok, do the async read-ahead now */
 		folio = __swapin_cache_alloc(
 				swp_entry(swp_type(entry), offset),
-				gfp_mask, mpol, ilx, &page_allocated, false);
+				gfp_mask, mpol, ilx, &page_allocated);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -660,7 +634,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	folio = __swapin_cache_alloc(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+				     &page_allocated);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
 	return folio;
@@ -755,7 +729,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		pte_unmap(pte);
 		pte = NULL;
 		folio = __swapin_cache_alloc(entry, gfp_mask, mpol, ilx,
-						&page_allocated, false);
+					     &page_allocated);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -775,7 +749,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 skip:
 	/* The folio was likely read above, so no need for plugging here */
 	folio = __swapin_cache_alloc(targ_entry, gfp_mask, mpol, targ_ilx,
-					&page_allocated, false);
+				     &page_allocated);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
 	return folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d01dc0646db9..8909d1655432 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1283,7 +1283,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	if (!entry.val)
 		return -ENOMEM;
 
-	if (swap_cache_add_folio(entry, folio, NULL))
+	if (WARN_ON(swap_cache_add_folio(entry, folio, NULL, false) != folio))
 		goto out_free;
 
 	atomic_long_sub(size, &nr_swap_pages);
@@ -1556,6 +1556,17 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 	}
 }
 
+void __swap_cache_put_entries(struct swap_info_struct *si,
+			      struct swap_cluster_info *ci,
+			      swp_entry_t entry, unsigned int size)
+{
+	if (swap_only_has_cache(si, swp_offset(entry), size))
+		swap_entries_free(si, ci, entry, size);
+	else
+		for (int i = 0; i < size; i++, entry.val++)
+			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
@@ -1571,11 +1582,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 		return;
 
 	ci = swap_lock_cluster(si, offset);
-	if (swap_only_has_cache(si, offset, size))
-		swap_entries_free(si, ci, entry, size);
-	else
-		for (int i = 0; i < size; i++, entry.val++)
-			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+	__swap_cache_put_entries(si, ci, entry, size);
 	swap_unlock_cluster(ci);
 }
 
@@ -3597,17 +3604,10 @@ int swap_duplicate_nr(swp_entry_t entry, int nr)
 	return err;
 }
 
-/*
- * @entry: first swap entry from which we allocate nr swap cache.
- *
- * Called when allocating swap cache for existing swap entries,
- * This can return error codes. Returns 0 at success.
- * -EEXIST means there is a swap cache.
- * Note: return code is different from swap_duplicate().
- */
-int swapcache_prepare(swp_entry_t entry, int nr)
+int __swap_cache_set_entry(struct swap_info_struct *si,
+			   struct swap_cluster_info *ci, unsigned long offset)
 {
-	return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
+	return swap_dup_entries(si, ci, offset, SWAP_HAS_CACHE, 1);
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b5f41b4147b..8b5498cae0d5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -798,7 +798,6 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 		__swap_cache_del_folio(swap, folio, shadow);
 		memcg1_swapout(folio, swap);
 		swap_unlock_cluster_irq(ci);
-		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
 
diff --git a/mm/zswap.c b/mm/zswap.c
index 87aebeee11ef..65c1aff5c4a4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1085,7 +1085,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	mpol = get_task_policy(current);
 	folio = __swapin_cache_alloc(swpentry, GFP_KERNEL, mpol,
-				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+				     NO_INTERLEAVE_INDEX, &folio_was_allocated);
 	put_swap_device(si);
 	if (!folio)
 		return -ENOMEM;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 17/28] mm, swap: sanitize swap entry management workflow
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (15 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
                   ` (10 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The current swap entry allocation / freeing workflow has never had a
clear definition. Which makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would
be allocated and freed. Now, most operations are folio-based, so they will
Never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with sanity checks. Also making more optimization possible
later:

Swap entry should be allocated and free, either with a folio bound,
or from the page table / mapping directly.

When swap entries are bound with a folio (folio in the swap cache and
locked), these entries could be considered `stable`, won't be freed
completely and the swap device can't be swapped offed too.

(Hibernation subsystem is a bit different, see below)

For swap operations from the mm (folio/page table) side:

- folio_alloc_swap() - Allocation (swapout)
  This allocates one or a set of continuous swap entries for one folio
  and bind the folio with it by adding it to the swap cache.
  Context: The folio must be locked.

- folio_dup_swap() - On folio unmap (swapout)
  This increases the ref count of swap entries allocated to a folio.
  Context: The folio must be locked and in the swap cache.

- folio_put_swap() - On folio map (swap in)
  This decreases the ref count of swap entries allocated to a folio.
  Context: The folio must be locked and in the swap cache.

  NOTE: this won't remove the folio from the swap cache, as the swap
  cache is lazy freed. The allocator can reclaim clean swap caches, and
  reduces IO or allocator overhead. The lazy freeing of the swap cache
  could be further optimized later. (see folio_free_swap below *)

- do_put_swap_entries() - On page table zapping / shmem truncate
  This decreases the ref count of swap entries as the page table gets
  zapped, or mapping gets truncated.
  Context: The caller must ensure the entries won't be freed completely
  during the period, which is currently done by holding the page table
  lock (zapping) or mapping cmpxchg (shmem) to prevent swap in or
  concurrent free.

- do_dup_swap_entry() - On page table copy (fork)
  This increases the ref count of the swap entry as the page table gets
  copied.
  Context: The caller must ensure the entry won't be freed during
  The period, which is currently done by holding the page table lock
  to prevent a swap-in or concurrent free.

* There is already a folio_free_swap(), it's a bit special as it will
  try to free the swap entries pinned by a folio only if all entries'
  count have dropped to zero. So it can be called after folio_put_swap()
  have dropped all swap ref counts. It can be better optimized and maybe
  merged into folio_put_swap() later.

For hibernation, two special helpers are here:

- get_swap_page_of_type() - Allocate one entry from one device.
- free_swap_page_of_entry() - Free one entry allocated by above helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |  57 ++++++++---------
 kernel/power/swap.c    |   8 ++-
 mm/madvise.c           |   2 +-
 mm/memory.c            |  20 +++---
 mm/rmap.c              |   7 ++-
 mm/shmem.c             |   8 +--
 mm/swap.h              |  29 +++++++++
 mm/swapfile.c          | 135 +++++++++++++++++++++++++++++------------
 9 files changed, 176 insertions(+), 92 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 9901934284ec..c402552bc8f3 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -715,7 +715,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 
 		dec_mm_counter(mm, mm_counter(folio));
 	}
-	free_swap_and_cache(entry);
+	do_put_swap_entries(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2da769cdc663..adac6d51da05 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -437,14 +437,8 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
-bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -456,6 +450,28 @@ struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
+/*
+ * For manipulating allocated swap table entries from page table or
+ * mapping (shmem) directly. Caller must ensure the entries won't be
+ * freed during the period.
+ *
+ * All entries must be allocated by folio_alloc_swap(), see
+ * mm/swap.h for more comments on it.
+ */
+extern int do_dup_swap_entry(swp_entry_t entry);
+extern void do_put_swap_entries(swp_entry_t entry, int nr);
+
+/*
+ * folio_free_swap is a bit special, it's a best effort try to
+ * free the swap entries pinned by a folio, and it need to be
+ * here to be called by other components
+ */
+bool folio_free_swap(struct folio *folio);
+
+/* Allocate / free (hibernation) exclusive entries */
+extern swp_entry_t get_swap_page_of_type(int);
+extern void free_swap_page_of_entry(swp_entry_t entry);
+
 static inline void put_swap_device(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
@@ -483,10 +499,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
-{
-}
-
 static inline void free_swap_cache(struct folio *folio)
 {
 }
@@ -496,12 +508,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
+static inline int do_dup_swap_entry(swp_entry_t ent)
 {
 	return 0;
 }
 
-static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void do_put_swap_entries(swp_entry_t ent, int nr)
 {
 }
 
@@ -524,11 +536,6 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask)
-{
-	return -EINVAL;
-}
-
 static inline bool folio_free_swap(struct folio *folio)
 {
 	return false;
@@ -541,22 +548,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 	return -EINVAL;
 }
 #endif /* CONFIG_SWAP */
-
-static inline int swap_duplicate(swp_entry_t entry)
-{
-	return swap_duplicate_nr(entry, 1);
-}
-
-static inline void free_swap_and_cache(swp_entry_t entry)
-{
-	free_swap_and_cache_nr(entry, 1);
-}
-
-static inline void swap_free(swp_entry_t entry)
-{
-	swap_free_nr(entry, 1);
-}
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 80ff5f933a62..f94c4ea350cf 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -182,7 +182,7 @@ sector_t alloc_swapdev_block(int swap)
 	offset = swp_offset(get_swap_page_of_type(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			free_swap_page_of_entry(swp_entry(swap, offset));
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap)
 
 void free_all_swap_pages(int swap)
 {
+	unsigned long offset;
 	struct rb_node *node;
 
 	while ((node = swsusp_extents.rb_node)) {
@@ -204,8 +205,9 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
-			     ext->end - ext->start + 1);
+
+		for (offset = ext->start; offset < ext->end; offset++)
+			free_swap_page_of_entry(swp_entry(swap, offset));
 
 		kfree(ext);
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index 8433ac9b27e0..36c62353d184 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -697,7 +697,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				max_nr = (end - addr) / PAGE_SIZE;
 				nr = swap_pte_batch(pte, max_nr, ptent);
 				nr_swap -= nr;
-				free_swap_and_cache_nr(entry, nr);
+				do_put_swap_entries(entry, nr);
 				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 			} else if (is_hwpoison_entry(entry) ||
 				   is_poisoned_swp_entry(entry)) {
diff --git a/mm/memory.c b/mm/memory.c
index 0b41d15c6d7a..c000e39b3eb2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -804,7 +804,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	swp_entry_t entry = pte_to_swp_entry(orig_pte);
 
 	if (likely(!non_swap_entry(entry))) {
-		if (swap_duplicate(entry) < 0)
+		if (do_dup_swap_entry(entry) < 0)
 			return -EIO;
 
 		/* make sure dst_mm is on swapoff's mmlist. */
@@ -1625,7 +1625,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 
 		nr = swap_pte_batch(pte, max_nr, ptent);
 		rss[MM_SWAPENTS] -= nr;
-		free_swap_and_cache_nr(entry, nr);
+		do_put_swap_entries(entry, nr);
 	} else if (is_migration_entry(entry)) {
 		struct folio *folio = pfn_swap_entry_folio(entry);
 
@@ -4783,7 +4783,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -4821,13 +4821,17 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache && swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
+		folio_put_swap(swapcache, NULL);
 	} else if (!folio_test_anon(folio)) {
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
+		folio_put_swap(folio, NULL);
 	} else {
+		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
-					rmap_flags);
+					 rmap_flags);
+		folio_put_swap(folio, nr_pages == folio_nr_pages(folio) ? NULL : page);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
@@ -4837,11 +4841,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			pte, pte, nr_pages);
 
 	/*
-	 * Remove the swap entry and conditionally try to free up the
-	 * swapcache then unlock the folio. Do this after the PTEs are
-	 * set, so raced faults will see updated PTEs.
+	 * Conditionally try to free up the swapcache and unlock the folio
+	 * after the PTEs are set, so raced faults will see updated PTEs.
 	 */
-	swap_free_nr(entry, nr_pages);
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 	folio_unlock(folio);
@@ -4851,7 +4853,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
 		 * (to avoid false positives from pte_same). For
-		 * further safety release the lock after the swap_free
+		 * further safety release the lock after the folio_put_swap
 		 * so that the swap count won't change under a
 		 * parallel locked swapcache.
 		 */
diff --git a/mm/rmap.c b/mm/rmap.c
index fb63d9256f09..d2195ebb4c35 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -83,6 +83,7 @@
 #include <trace/events/migrate.h>
 
 #include "internal.h"
+#include "swap.h"
 
 static struct kmem_cache *anon_vma_cachep;
 static struct kmem_cache *anon_vma_chain_cachep;
@@ -2141,7 +2142,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto discard;
 			}
 
-			if (swap_duplicate(entry) < 0) {
+			if (folio_dup_swap(folio, subpage) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2152,7 +2153,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * so we'll not check/care.
 			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2160,7 +2161,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index e87eff03c08b..0d23c1c12204 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -981,7 +981,7 @@ static long shmem_free_swap(struct address_space *mapping,
 	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
 	if (old != radswap)
 		return 0;
-	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+	do_put_swap_entries(radix_to_swp_entry(radswap), 1 << order);
 
 	return 1 << order;
 }
@@ -1663,8 +1663,8 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		list_add(&info->swaplist, &shmem_swaplist);
 
 	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
+		folio_dup_swap(folio, NULL);
 		shmem_recalc_inode(inode, 0, nr_pages);
-		swap_duplicate_nr(folio->swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		mutex_unlock(&shmem_swaplist_mutex);
@@ -2122,6 +2122,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
+	folio_put_swap(folio, NULL);
 	delete_from_swap_cache(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2129,7 +2130,6 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * in shmem_evict_inode().
 	 */
 	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
-	swap_free_nr(swap, nr_pages);
 }
 
 static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
@@ -2364,9 +2364,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
+	folio_put_swap(folio, NULL);
 	delete_from_swap_cache(folio);
 	folio_mark_dirty(folio);
-	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
 
 	*foliop = folio;
diff --git a/mm/swap.h b/mm/swap.h
index 2abfb40fc7ec..4c4a71081895 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -142,6 +142,20 @@ extern void __swap_cache_put_entries(struct swap_info_struct *si,
 				     struct swap_cluster_info *ci,
 				     swp_entry_t entry, unsigned int size);
 
+/*
+ * All swap entries starts getting allocated by folio_alloc_swap(),
+ * and the folio will be added to swap cache.
+ *
+ * Swap out (pageout) unmaps a folio and increased the swap table entry
+ * count with folio_dup_swap.
+ *
+ * Swap in maps a folio in swap cache and decrease the swap table entry
+ * count with folio_put_swap.
+ */
+int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
+int folio_dup_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio, struct page *subpage);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -272,9 +286,24 @@ static inline struct swap_info_struct *swp_info(swp_entry_t entry)
 	return NULL;
 }
 
+static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+{
+	return -EINVAL;
+}
+
+static inline int folio_dup_swap(struct folio *folio, struct page *page)
+{
+	return -EINVAL;
+}
+
+static inline void folio_put_swap(struct folio *folio, struct page *page)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
+
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8909d1655432..daf7810bcb28 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,6 +42,7 @@
 #include <linux/suspend.h>
 #include <linux/zswap.h>
 #include <linux/plist.h>
+#include <linux/align.h>
 
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
@@ -58,6 +59,9 @@ static void swap_entries_free(struct swap_info_struct *si,
 			      swp_entry_t entry, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
+static bool swap_entries_put_map(struct swap_info_struct *si,
+				 swp_entry_t entry, int nr);
 static bool folio_swapcache_freeable(struct folio *folio);
 
 static DEFINE_SPINLOCK(swap_lock);
@@ -1236,7 +1240,6 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
- * @gfp: gfp mask for shadow nodes
  *
  * Allocate swap space for the folio and add the folio to the
  * swap cache.
@@ -1286,6 +1289,12 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	if (WARN_ON(swap_cache_add_folio(entry, folio, NULL, false) != folio))
 		goto out_free;
 
+	/*
+	 * Allocator should always allocate aligned entries so folio based
+	 * operations never crossed more than one cluster.
+	 */
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
+
 	atomic_long_sub(size, &nr_swap_pages);
 	return 0;
 
@@ -1294,6 +1303,57 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	return -ENOMEM;
 }
 
+/*
+ * folio_dup_swap() - Increase ref count of swap entries allocated to a folio.
+ *
+ * @folio: the folio with swap entries allocated.
+ * @subpage: if not NULL, only increase the ref count of this subpage.
+ */
+int folio_dup_swap(struct folio *folio, struct page *subpage)
+{
+	int err = 0;
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
+		err = add_swap_count_continuation(entry, GFP_ATOMIC);
+
+	return err;
+}
+
+/*
+ * folio_put_swap() - Decrease ref count of swap entries allocated to a folio.
+ *
+ * @folio: the folio with swap entries allocated.
+ * @subpage: if not NULL, only decrease the ref count of this subpage.
+ *
+ * This won't remove the folio from swap cache, so the swap entry may
+ * still be pinned by the swap cache.
+ */
+void folio_put_swap(struct folio *folio, struct page *subpage)
+{
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	swap_entries_put_map(swp_info(entry), entry, nr_pages);
+}
+
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
@@ -1538,24 +1598,6 @@ static void swap_entries_free(struct swap_info_struct *si,
  * Caller has made sure that the swap device corresponding to entry
  * is still around or has not been recycled.
  */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
-{
-	int nr;
-	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
-
-	sis = _swap_info_get(entry);
-	if (!sis)
-		return;
-
-	while (nr_pages) {
-		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-		swap_entries_put_map(sis, swp_entry(sis->type, offset), nr);
-		offset += nr;
-		nr_pages -= nr;
-	}
-}
-
 void __swap_cache_put_entries(struct swap_info_struct *si,
 			      struct swap_cluster_info *ci,
 			      swp_entry_t entry, unsigned int size)
@@ -1751,16 +1793,21 @@ bool folio_free_swap(struct folio *folio)
 }
 
 /**
- * free_swap_and_cache_nr() - Release reference on range of swap entries and
- *                            reclaim their cache if no more references remain.
+ * do_put_swap_entries() - Release reference on range of swap entries and
+ *                      reclaim their cache if no more references remain.
  * @entry: First entry of range.
  * @nr: Number of entries in range.
  *
  * For each swap entry in the contiguous range, release a reference. If any swap
  * entries become free, try to reclaim their underlying folios, if present. The
  * offset range is defined by [entry.offset, entry.offset + nr).
+ *
+ * Context: Called when page table or mapping get released direct without swap
+ * in, caller must ensure the entries won't get completely freed during this
+ * period. For page table releasing, this is protected by page table lock.
+ * For shmem, this is protected by the cmpxchg of the mapping value.
  */
-void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+void do_put_swap_entries(swp_entry_t entry, int nr)
 {
 	const unsigned long start_offset = swp_offset(entry);
 	const unsigned long end_offset = start_offset + nr;
@@ -1769,10 +1816,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	unsigned long offset;
 
 	si = get_swap_device(entry);
-	if (!si)
+	if (WARN_ON_ONCE(!si))
 		return;
-
-	if (WARN_ON(end_offset > si->max))
+	if (WARN_ON_ONCE(end_offset > si->max))
 		goto out;
 
 	/*
@@ -1816,7 +1862,6 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 }
 
 #ifdef CONFIG_HIBERNATION
-
 swp_entry_t get_swap_page_of_type(int type)
 {
 	struct swap_info_struct *si = swp_type_get_info(type);
@@ -1841,6 +1886,24 @@ swp_entry_t get_swap_page_of_type(int type)
 	return entry;
 }
 
+/*
+ * Free entries allocated by get_swap_page_of_type, these entries are
+ * exclusive for hibernation.
+ */
+void free_swap_page_of_entry(swp_entry_t entry)
+{
+	struct swap_info_struct *si = swp_info(entry);
+	pgoff_t offset = swp_offset(entry);
+	struct swap_cluster_info *ci;
+	if (!si)
+		return;
+	ci = swap_lock_cluster(si, offset);
+	WARN_ON(swap_count(swap_entry_put_locked(si, ci, entry, 1)));
+	/* It might got added to swap cache accidentally by read ahead */
+	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+	swap_unlock_cluster(ci);
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
@@ -1995,7 +2058,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -2036,7 +2099,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	swap_free(entry);
+	folio_put_swap(folio, page);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
@@ -3579,27 +3642,23 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 }
 
 /**
- * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
- *                       by 1.
+ * do_dup_swap_entry() - Increase reference count of a swap entry by one.
  *
  * @entry: first swap entry from which we want to increase the refcount.
- * @nr: Number of entries in range.
  *
  * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
  * but could not be atomically allocated.  Returns 0, just as if it succeeded,
  * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
  * might occur if a page table entry has got corrupted.
  *
- * Note that we are currently not handling the case where nr > 1 and we need to
- * add swap count continuation. This is OK, because no such user exists - shmem
- * is the only user that can pass nr > 1, and it never re-duplicates any swap
- * entry it owns.
+ * Context: The caller must ensure the entry won't be completely freed during
+ * the period. Currently this is only used by forking, the page table is locked
+ * to protect the entry from being freed.
  */
-int swap_duplicate_nr(swp_entry_t entry, int nr)
+int do_dup_swap_entry(swp_entry_t entry)
 {
 	int err = 0;
-
-	while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
+	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (16 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
                   ` (9 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

We now have folio_alloc_swap, folio_dup_swap, folio_put_swap, and
folio_free_swap (which is actually try to free). Also rename
delete_from_swap_cache to folio_free_swap_cache, because swap cache will
always be the last reference of a folio bounded entry now. Freeing the
swap cache will also attempt to free the swap entries.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory-failure.c |  2 +-
 mm/memory.c         |  2 +-
 mm/shmem.c          |  4 ++--
 mm/swap.h           | 14 +++++++++-----
 mm/swap_state.c     | 12 ------------
 mm/swapfile.c       | 23 +++++++++++++++++++++--
 mm/zswap.c          |  2 +-
 7 files changed, 35 insertions(+), 24 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b91a33fb6c69..ba96aaf96e83 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1185,7 +1185,7 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
 	struct folio *folio = page_folio(p);
 	int ret;
 
-	delete_from_swap_cache(folio);
+	folio_free_swap_cache(folio);
 
 	ret = delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED;
 	folio_unlock(folio);
diff --git a/mm/memory.c b/mm/memory.c
index c000e39b3eb2..a70624a55aa2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4727,7 +4727,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 */
 		if (!folio_test_anon(folio)) {
 			WARN_ON_ONCE(folio_test_dirty(folio));
-			delete_from_swap_cache(folio);
+			folio_free_swap_cache(folio);
 			goto out_nomap;
 		}
 	}
diff --git a/mm/shmem.c b/mm/shmem.c
index 0d23c1c12204..c7475629365c 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2123,7 +2123,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
 	folio_put_swap(folio, NULL);
-	delete_from_swap_cache(folio);
+	folio_free_swap_cache(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
@@ -2365,7 +2365,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_mark_accessed(folio);
 
 	folio_put_swap(folio, NULL);
-	delete_from_swap_cache(folio);
+	folio_free_swap_cache(folio);
 	folio_mark_dirty(folio);
 	put_swap_device(si);
 
diff --git a/mm/swap.h b/mm/swap.h
index 4c4a71081895..467996dafbae 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -151,10 +151,15 @@ extern void __swap_cache_put_entries(struct swap_info_struct *si,
  *
  * Swap in maps a folio in swap cache and decrease the swap table entry
  * count with folio_put_swap.
+ *
+ * Swap uses lazy free, so a folio may stay in swap cache for a long time
+ * and pin the swap entry. folio_free_swap_cache and folio_free_swap can
+ * be used to reclaim the swap cache.
  */
 int folio_alloc_swap(struct folio *folio, gfp_t gfp_mask);
 int folio_dup_swap(struct folio *folio, struct page *subpage);
 void folio_put_swap(struct folio *folio, struct page *subpage);
+void folio_free_swap_cache(struct folio *folio);
 
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
@@ -226,7 +231,6 @@ static inline bool folio_swap_contains(struct folio *folio, swp_entry_t entry)
 }
 
 void show_swap_cache_info(void);
-void delete_from_swap_cache(struct folio *folio);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
@@ -300,6 +304,10 @@ static inline void folio_put_swap(struct folio *folio, struct page *page)
 {
 }
 
+static inline void folio_free_swap_cache(struct folio *folio)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
@@ -387,10 +395,6 @@ static inline void *swap_cache_get_shadow(swp_entry_t end)
 	return NULL;
 }
 
-static inline void delete_from_swap_cache(struct folio *folio)
-{
-}
-
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 715aff5aca57..c8bb16835612 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -216,18 +216,6 @@ void __swap_cache_del_folio(swp_entry_t entry,
 	__swap_cache_put_entries(si, ci, entry, nr_pages);
 }
 
-void delete_from_swap_cache(struct folio *folio)
-{
-	struct swap_cluster_info *ci;
-	swp_entry_t entry = folio->swap;
-
-	ci = swap_lock_cluster(swp_info(entry), swp_offset(entry));
-	__swap_cache_del_folio(entry, folio, NULL);
-	swap_unlock_cluster(ci);
-
-	folio_ref_sub(folio, folio_nr_pages(folio));
-}
-
 /*
  * Caller must hold a reference on the swap device, and check if the
  * returned folio is still valid after locking it (e.g. folio_swap_contains).
diff --git a/mm/swapfile.c b/mm/swapfile.c
index daf7810bcb28..0a8b36ecbf08 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -273,7 +273,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	delete_from_swap_cache(folio);
+	folio_free_swap_cache(folio);
 	folio_set_dirty(folio);
 	ret = nr_pages;
 out_unlock:
@@ -1354,6 +1354,25 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
 	swap_entries_put_map(swp_info(entry), entry, nr_pages);
 }
 
+/*
+ * folio_free_swap_cache() - Remove the folio from swap cache, and free
+ * all entires with zero count.
+ *
+ * NOTE: if the folio is dirty and any of its swap entries' count is not
+ * zero, freeing the swap cache without write back may cause data loss.
+ */
+void folio_free_swap_cache(struct folio *folio)
+{
+	struct swap_cluster_info *ci;
+	swp_entry_t entry = folio->swap;
+
+	ci = swap_lock_cluster(swp_info(entry), swp_offset(entry));
+	__swap_cache_del_folio(entry, folio, NULL);
+	swap_unlock_cluster(ci);
+
+	folio_ref_sub(folio, folio_nr_pages(folio));
+}
+
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
@@ -1787,7 +1806,7 @@ bool folio_free_swap(struct folio *folio)
 	if (folio_swapped(folio))
 		return false;
 
-	delete_from_swap_cache(folio);
+	folio_free_swap_cache(folio);
 	folio_set_dirty(folio);
 	return true;
 }
diff --git a/mm/zswap.c b/mm/zswap.c
index 65c1aff5c4a4..6bac50bc2bf5 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1141,7 +1141,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 out:
 	if (ret && ret != -EEXIST) {
-		delete_from_swap_cache(folio);
+		folio_free_swap_cache(folio);
 		folio_unlock(folio);
 	}
 	folio_put(folio);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (17 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
                   ` (8 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Introduce a helper to free up all the continuous entries that has only
one last swap entry count left and has no cache.

Compared to the current design, which scans the whole region first, then
frees it only if the whole region is filled with the same count, this
new helper avoids the two-pass scan, and will batch-free more entries
when fragmented, also more robust with sanity checks. And check the swap
table directly for the cache status instead of looking at swap_map here.

Also rename related functions to better present their usage.

This simplifies the code and prepares for follow up commits to clean
up the freeing of swap entries even more.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 165 ++++++++++++++++++++------------------------------
 1 file changed, 67 insertions(+), 98 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0a8b36ecbf08..ef233466725e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -54,14 +54,16 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entries_free(struct swap_info_struct *si,
+static void swap_free_entries(struct swap_info_struct *si,
 			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int nr_pages);
+			      unsigned long start, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
-static bool swap_entries_put_map(struct swap_info_struct *si,
-				 swp_entry_t entry, int nr);
+static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
+					   struct swap_cluster_info *ci,
+					   swp_entry_t entry,
+					   unsigned char usage);
 static bool folio_swapcache_freeable(struct folio *folio);
 
 static DEFINE_SPINLOCK(swap_lock);
@@ -193,25 +195,6 @@ static bool swap_only_has_cache(struct swap_info_struct *si,
 	return true;
 }
 
-static bool swap_is_last_map(struct swap_info_struct *si,
-		unsigned long offset, int nr_pages, bool *has_cache)
-{
-	unsigned char *map = si->swap_map + offset;
-	unsigned char *map_end = map + nr_pages;
-	unsigned char count = *map;
-
-	if (swap_count(count) != 1)
-		return false;
-
-	while (++map < map_end) {
-		if (*map != count)
-			return false;
-	}
-
-	*has_cache = !!(count & SWAP_HAS_CACHE);
-	return true;
-}
-
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
  * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
@@ -1237,6 +1220,56 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 	return false;
 }
 
+/*
+ * Put the ref count of entries, caller must ensure the entries'
+ * swap table count are not zero. This won't free up the swap cache.
+ */
+static bool swap_put_entries(struct swap_info_struct *si,
+			     unsigned long start, int nr)
+{
+	unsigned long offset = start, end = start + nr, cluster_end;
+	unsigned long head = SWAP_ENTRY_INVALID;
+	struct swap_cluster_info *ci;
+	bool has_cache = false;
+	unsigned int count;
+	swp_te_t swp_te;
+next_cluster:
+	ci = swap_lock_cluster(si, offset);
+	cluster_end = min(cluster_offset(si, ci) + SWAPFILE_CLUSTER, end);
+	do {
+		swp_te = __swap_table_get(ci, offset);
+		count = si->swap_map[offset];
+		if (WARN_ON_ONCE(!swap_count(count))) {
+			goto skip;
+		} else if (swp_te_is_folio(swp_te)) {
+			VM_WARN_ON_ONCE(!(count & SWAP_HAS_CACHE));
+			/* Let the swap cache (folio) handle the final free */
+			has_cache = true;
+		} else if (count == 1) {
+			/* Free up continues last ref entries in batch */
+			head = head ? head : offset;
+			continue;
+		}
+		swap_put_entry_locked(si, ci, swp_entry(si->type, offset), 1);
+skip:
+		if (head) {
+			swap_free_entries(si, ci, head, offset - head);
+			head = SWAP_ENTRY_INVALID;
+		}
+	} while (++offset < cluster_end);
+
+	if (head) {
+		swap_free_entries(si, ci, head, offset - head);
+		head = SWAP_ENTRY_INVALID;
+	}
+
+	swap_unlock_cluster(ci);
+	if (unlikely(cluster_end < end))
+		goto next_cluster;
+
+	return has_cache;
+}
+
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -1351,7 +1384,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
 		nr_pages = 1;
 	}
 
-	swap_entries_put_map(swp_info(entry), entry, nr_pages);
+	swap_put_entries(swp_info(entry), swp_offset(entry), nr_pages);
 }
 
 /*
@@ -1407,7 +1440,7 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
+static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
 					   struct swap_cluster_info *ci,
 					   swp_entry_t entry,
 					   unsigned char usage)
@@ -1438,7 +1471,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
 	if (usage)
 		WRITE_ONCE(si->swap_map[offset], usage);
 	else
-		swap_entries_free(si, ci, entry, 1);
+		swap_free_entries(si, ci, offset, 1);
 
 	return usage;
 }
@@ -1509,70 +1542,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-static bool swap_entries_put_map(struct swap_info_struct *si,
-				 swp_entry_t entry, int nr)
-{
-	unsigned long offset = swp_offset(entry);
-	struct swap_cluster_info *ci;
-	bool has_cache = false;
-	unsigned char count;
-	int i;
-
-	if (nr <= 1)
-		goto fallback;
-	count = swap_count(data_race(si->swap_map[offset]));
-	if (count != 1)
-		goto fallback;
-
-	ci = swap_lock_cluster(si, offset);
-	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
-		goto locked_fallback;
-	}
-	if (!has_cache)
-		swap_entries_free(si, ci, entry, nr);
-	else
-		for (i = 0; i < nr; i++)
-			WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	swap_unlock_cluster(ci);
-
-	return has_cache;
-
-fallback:
-	ci = swap_lock_cluster(si, offset);
-locked_fallback:
-	for (i = 0; i < nr; i++, entry.val++) {
-		count = swap_entry_put_locked(si, ci, entry, 1);
-		if (count == SWAP_HAS_CACHE)
-			has_cache = true;
-	}
-	swap_unlock_cluster(ci);
-	return has_cache;
-}
-
-/*
- * Only functions with "_nr" suffix are able to free entries spanning
- * cross multi clusters, so ensure the range is within a single cluster
- * when freeing entries with functions without "_nr" suffix.
- */
-static bool swap_entries_put_map_nr(struct swap_info_struct *si,
-				    swp_entry_t entry, int nr)
-{
-	int cluster_nr, cluster_rest;
-	unsigned long offset = swp_offset(entry);
-	bool has_cache = false;
-
-	cluster_rest = SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER;
-	while (nr) {
-		cluster_nr = min(nr, cluster_rest);
-		has_cache |= swap_entries_put_map(si, entry, cluster_nr);
-		cluster_rest = SWAPFILE_CLUSTER;
-		nr -= cluster_nr;
-		entry.val += cluster_nr;
-	}
-
-	return has_cache;
-}
-
 /*
  * Check if it's the last ref of swap entry in the freeing path.
  */
@@ -1585,11 +1554,11 @@ static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
  * Drop the last ref of swap entries, caller have to ensure all entries
  * belong to the same cgroup and cluster.
  */
-static void swap_entries_free(struct swap_info_struct *si,
+static void swap_free_entries(struct swap_info_struct *si,
 			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int nr_pages)
+			      unsigned long offset, unsigned int nr_pages)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_entry_t entry = swp_entry(si->type, offset);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 
@@ -1622,10 +1591,10 @@ void __swap_cache_put_entries(struct swap_info_struct *si,
 			      swp_entry_t entry, unsigned int size)
 {
 	if (swap_only_has_cache(si, swp_offset(entry), size))
-		swap_entries_free(si, ci, entry, size);
+		swap_free_entries(si, ci, swp_offset(entry), size);
 	else
 		for (int i = 0; i < size; i++, entry.val++)
-			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+			swap_put_entry_locked(si, ci, entry, SWAP_HAS_CACHE);
 }
 
 /*
@@ -1843,7 +1812,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	/*
 	 * First free all entries in the range.
 	 */
-	any_only_cache = swap_entries_put_map_nr(si, entry, nr);
+	any_only_cache = swap_put_entries(swp_info(entry), swp_offset(entry), nr);
 
 	/*
 	 * Short-circuit the below loop if none of the entries had their
@@ -1917,7 +1886,7 @@ void free_swap_page_of_entry(swp_entry_t entry)
 	if (!si)
 		return;
 	ci = swap_lock_cluster(si, offset);
-	WARN_ON(swap_count(swap_entry_put_locked(si, ci, entry, 1)));
+	WARN_ON(swap_count(swap_put_entry_locked(si, ci, entry, 1)));
 	/* It might got added to swap cache accidentally by read ahead */
 	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 	swap_unlock_cluster(ci);
@@ -3805,7 +3774,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
  * into, carry if so, or else fail until a new continuation page is allocated;
  * when the original swap_map count is decremented from 0 with continuation,
  * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or caller of swap_entry_put_locked()
+ * Called while __swap_duplicate() or caller of swap_put_entry_locked()
  * holds cluster lock.
  */
 static bool swap_count_continued(struct swap_info_struct *si,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 20/28] mm, swap: check swap table directly for checking cache
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (18 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-06-19 10:38   ` Baoquan He
  2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
                   ` (7 subsequent siblings)
  27 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of looking at the swap map, check swap table directly to tell if
a swap entry has cache. Prepare for remove SWAP_HAS_CACHE.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 12 +++++------
 mm/swap.h       |  6 ++++++
 mm/swap_state.c | 11 ++++++++++
 mm/swapfile.c   | 54 +++++++++++++++++++++++--------------------------
 4 files changed, 48 insertions(+), 35 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index a70624a55aa2..a9a548575e72 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4314,15 +4314,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
+static inline int non_swapcache_batch(swp_entry_t entry, unsigned int max_nr)
 {
-	struct swap_info_struct *si = swp_info(entry);
-	pgoff_t offset = swp_offset(entry);
-	int i;
+	unsigned int i;
 
 	for (i = 0; i < max_nr; i++) {
-		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
-			return i;
+		/* Page table lock pins the swap entries / swap device */
+		if (swap_cache_check_folio(entry))
+			break;
+		entry.val++;
 	}
 
 	return i;
diff --git a/mm/swap.h b/mm/swap.h
index 467996dafbae..2ae4624a0e48 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -186,6 +186,7 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 extern struct folio *swap_cache_get_folio(swp_entry_t entry);
 extern struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 					  void **shadow, bool swapin);
+extern bool swap_cache_check_folio(swp_entry_t entry);
 extern void *swap_cache_get_shadow(swp_entry_t entry);
 /* Below helpers requires the caller to lock the swap cluster. */
 extern void __swap_cache_del_folio(swp_entry_t entry,
@@ -395,6 +396,11 @@ static inline void *swap_cache_get_shadow(swp_entry_t end)
 	return NULL;
 }
 
+static inline bool swap_cache_check_folio(swp_entry_t entry)
+{
+	return false;
+}
+
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	return 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c8bb16835612..ea6a1741db5c 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -266,6 +266,17 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return folio;
 }
 
+/*
+ * Check if a swap entry has folio cached, may return false positive.
+ * Caller must hold a reference of the swap device or pin it in other ways.
+ */
+bool swap_cache_check_folio(swp_entry_t entry)
+{
+	swp_te_t swp_te;
+	swp_te = __swap_table_get(swp_cluster(entry), swp_offset(entry));
+	return swp_te_is_folio(swp_te);
+}
+
 /*
  * If we are the only user, then try to free up the swap cache.
  *
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ef233466725e..0f2a499ff2c9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -181,15 +181,19 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 #define TTRS_FULL		0x4
 
 static bool swap_only_has_cache(struct swap_info_struct *si,
-			      unsigned long offset, int nr_pages)
+				struct swap_cluster_info *ci,
+				unsigned long offset, int nr_pages)
 {
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
+	swp_te_t entry;
 
 	do {
+		entry = __swap_table_get(ci, offset);
 		VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
-		if (*map != SWAP_HAS_CACHE)
+		if (*map)
 			return false;
+		offset++;
 	} while (++map < map_end);
 
 	return true;
@@ -247,11 +251,11 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 
 	/*
 	 * It's safe to delete the folio from swap cache only if the folio's
-	 * swap_map is HAS_CACHE only, which means the slots have no page table
+	 * entry is swap cache only, which means the slots have no page table
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
 	ci = swap_lock_cluster(si, offset);
-	need_reclaim = swap_only_has_cache(si, offset, nr_pages);
+	need_reclaim = swap_only_has_cache(si, ci, offset, nr_pages);
 	swap_unlock_cluster(ci);
 	if (!need_reclaim)
 		goto out_unlock;
@@ -660,29 +664,21 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 
 	spin_unlock(&ci->lock);
 	do {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
-			offset++;
+		if (swap_count(READ_ONCE(map[offset])))
 			break;
-		case SWAP_HAS_CACHE:
-			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
-			if (nr_reclaim > 0)
-				offset += nr_reclaim;
-			else
-				goto out;
+		nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+		if (nr_reclaim > 0)
+			offset += nr_reclaim;
+		else if (nr_reclaim < 1)
 			break;
-		default:
-			goto out;
-		}
-	} while (offset < end);
-out:
+	} while (++offset < end);
 	spin_lock(&ci->lock);
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
 	 */
 	for (offset = start; offset < end; offset++)
-		if (READ_ONCE(map[offset]))
+		if (map[offset] || !swp_te_is_null(__swap_table_get(ci, offset)))
 			return false;
 
 	return true;
@@ -700,16 +696,13 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 		return true;
 
 	for (offset = start; offset < end; offset++) {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
-			continue;
-		case SWAP_HAS_CACHE:
+		if (swap_count(map[offset]))
+			return false;
+		if (swp_te_is_folio(__swap_table_get(ci, offset))) {
+			VM_WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE));
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
-			continue;
-		default:
-			return false;
 		}
 	}
 
@@ -821,7 +814,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+			if (!swap_count(map[offset]) &&
+			    swp_te_is_folio(__swap_table_get(ci, offset))) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
@@ -1590,7 +1584,7 @@ void __swap_cache_put_entries(struct swap_info_struct *si,
 			      struct swap_cluster_info *ci,
 			      swp_entry_t entry, unsigned int size)
 {
-	if (swap_only_has_cache(si, swp_offset(entry), size))
+	if (swap_only_has_cache(si, ci, swp_offset(entry), size))
 		swap_free_entries(si, ci, swp_offset(entry), size);
 	else
 		for (int i = 0; i < size; i++, entry.val++)
@@ -1802,6 +1796,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	struct swap_info_struct *si;
 	bool any_only_cache = false;
 	unsigned long offset;
+	swp_te_t swp_te;
 
 	si = get_swap_device(entry);
 	if (WARN_ON_ONCE(!si))
@@ -1826,7 +1821,8 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	 */
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
-		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+		swp_te = __swap_table_get(swp_offset_cluster(si, offset), offset);
+		if (!swap_count(si->swap_map[offset]) && swp_te_is_folio(swp_te)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (19 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
                   ` (6 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

All swap allocations are folio based now (except for hibernation),
and swap cache is protected by cluster lock too. So insert the folio
directly in to the swap cache upon allocation while holding the cluster
to avoid problems caused by dropping and re-acquiring the lock.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |   8 ++--
 mm/swap_state.c |  48 +++++++++++++++----
 mm/swapfile.c   | 122 ++++++++++++++++++++----------------------------
 3 files changed, 93 insertions(+), 85 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 2ae4624a0e48..b042609e6eb2 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -185,7 +185,10 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 /* Below helpers requires the caller to pin the swap device. */
 extern struct folio *swap_cache_get_folio(swp_entry_t entry);
 extern struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
-					  void **shadow, bool swapin);
+					  void **shadow);
+extern void __swap_cache_add_folio(swp_entry_t entry,
+				   struct swap_cluster_info *ci,
+				   struct folio *folio);
 extern bool swap_cache_check_folio(swp_entry_t entry);
 extern void *swap_cache_get_shadow(swp_entry_t entry);
 /* Below helpers requires the caller to lock the swap cluster. */
@@ -368,8 +371,7 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return NULL;
 }
 
-static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio,
-				       void **shadow, bool swapin)
+static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
 {
 	return -EINVAL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ea6a1741db5c..9e7d40215958 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -110,12 +110,39 @@ int __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
 	return 0;
 }
 
-/*
- * Return the folio being added on success, or return the existing folio
- * with conflicting index on failure.
- */
+/* For swap allocator's initial allocation of entries to a folio */
+void __swap_cache_add_folio(swp_entry_t entry, struct swap_cluster_info *ci,
+			    struct folio *folio)
+{
+	pgoff_t offset = swp_offset(entry), end;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	/*
+	 * Allocator should always allocate aligned entries so folio based
+	 * operations never crossed more than one cluster.
+	 */
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(offset, nr_pages), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_uptodate(folio), folio);
+
+	end = offset + nr_pages;
+	do {
+		WARN_ON_ONCE(!swp_te_is_null(__swap_table_get(ci, offset)));
+		__swap_table_set_folio(ci, offset, folio);
+	} while (++offset < end);
+
+	folio_ref_add(folio, nr_pages);
+	folio_set_swapcache(folio);
+	folio->swap = entry;
+
+	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+}
+
+/* For swap in or perform IO for an allocated swap entry. */
 struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
-				   void **shadow, bool swapin)
+				   void **shadow)
 {
 	swp_te_t exist;
 	pgoff_t end, start, offset;
@@ -127,9 +154,10 @@ struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 	start = swp_offset(entry);
 	end = start + nr_pages;
 
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(start, nr_pages), folio);
 again:
 	offset = start;
 	existing = NULL;
@@ -141,7 +169,7 @@ struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 			existing = swp_te_folio(exist);
 			goto out_failed;
 		}
-		if (swapin && __swap_cache_set_entry(si, ci, offset))
+		if (__swap_cache_set_entry(si, ci, offset))
 			goto out_failed;
 		if (shadow && swp_te_is_shadow(exist))
 			*shadow = swp_te_shadow(exist);
@@ -381,7 +409,7 @@ static struct folio *__swapin_cache_add_prepare(swp_entry_t entry,
 
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
-	swapcache = swap_cache_add_folio(entry, folio, &shadow, true);
+	swapcache = swap_cache_add_folio(entry, folio, &shadow);
 	if (swapcache != folio) {
 		folio_unlock(folio);
 		return swapcache;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0f2a499ff2c9..91025ba98653 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -709,18 +709,17 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 	return true;
 }
 
-static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
-				unsigned int start, unsigned char usage,
-				unsigned int order)
+static bool cluster_alloc_range(struct swap_info_struct *si,
+				struct swap_cluster_info *ci,
+				struct folio *folio,
+				unsigned int offset)
 {
-	unsigned int nr_pages = 1 << order;
-	unsigned long offset, end = start + nr_pages;
-
-	lockdep_assert_held(&ci->lock);
+	unsigned int order = folio ? folio_order(folio) : 0;
+	swp_entry_t entry = swp_entry(si->type, offset);
+	unsigned long nr_pages = 1 << order;
 
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
-
 	/*
 	 * The first allocation in a cluster makes the
 	 * cluster exclusive to this order
@@ -728,28 +727,33 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	if (cluster_is_empty(ci))
 		ci->order = order;
 
-	for (offset = start; offset < end; offset++) {
-		VM_WARN_ON_ONCE(swap_count(si->swap_map[offset]));
-		VM_WARN_ON_ONCE(!swp_te_is_null(__swap_table_get(ci, offset)));
-		si->swap_map[offset] = usage;
-	}
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
+	if (folio) {
+		/* from folio_alloc_swap */
+		__swap_cache_add_folio(entry, ci, folio);
+		memset(&si->swap_map[offset], SWAP_HAS_CACHE, nr_pages);
+	} else {
+		/* from get_swap_page_of_type */
+		VM_WARN_ON_ONCE(si->swap_map[offset] || swap_cache_check_folio(entry));
+		si->swap_map[offset] = 1;
+	}
+
 	return true;
 }
 
 /* Try use a new cluster for current CPU and allocate from it. */
 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    struct swap_cluster_info *ci,
-					    unsigned long offset,
-					    unsigned int order,
-					    unsigned char usage)
+					    struct folio *folio,
+					    unsigned long offset)
 {
 	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
-	unsigned int nr_pages = 1 << order;
+	unsigned int order = folio ? folio_order(folio) : 0;
+	unsigned long nr_pages = 1 << order;
 	bool need_reclaim, ret;
 
 	lockdep_assert_held(&ci->lock);
@@ -777,7 +781,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 			if (!ret)
 				continue;
 		}
-		if (!cluster_alloc_range(si, ci, offset, usage, order))
+		if (!cluster_alloc_range(si, ci, folio, offset))
 			break;
 		found = offset;
 		offset += nr_pages;
@@ -851,10 +855,11 @@ static void swap_reclaim_work(struct work_struct *work)
  * Try to allocate swap entries with specified order and try set a new
  * cluster for current CPU too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
-					      unsigned char usage)
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
+					      struct folio *folio)
 {
 	struct swap_cluster_info *ci;
+	unsigned int order = folio ? folio_order(folio) : 0;
 	unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 
 	/*
@@ -874,8 +879,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
 				offset = cluster_offset(si, ci);
-			found = alloc_swap_scan_cluster(si, ci, offset,
-							order, usage);
+			found = alloc_swap_scan_cluster(si, ci, folio, offset);
 		} else {
 			swap_unlock_cluster(ci);
 		}
@@ -886,8 +890,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 new_cluster:
 	ci = isolate_lock_cluster(si, &si->free_clusters);
 	if (ci) {
-		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
+		found = alloc_swap_scan_cluster(si, ci, folio, cluster_offset(si, ci));
 		if (found)
 			goto done;
 	}
@@ -898,8 +901,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 
 	if (order < PMD_ORDER) {
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
-			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							order, usage);
+			found = alloc_swap_scan_cluster(si, ci, folio, cluster_offset(si, ci));
 			if (found)
 				goto done;
 		}
@@ -912,8 +914,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 */
 		ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
 		if (ci) {
-			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-					order, usage);
+			found = alloc_swap_scan_cluster(si, ci, folio, cluster_offset(si, ci));
 			if (found)
 				goto done;
 		}
@@ -937,15 +938,13 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
 		while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
-			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							0, usage);
+			found = alloc_swap_scan_cluster(si, ci, folio, cluster_offset(si, ci));
 			if (found)
 				goto done;
 		}
 
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[o]))) {
-			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-							0, usage);
+			found = alloc_swap_scan_cluster(si, ci, folio, cluster_offset(si, ci));
 			if (found)
 				goto done;
 		}
@@ -1138,12 +1137,12 @@ static bool get_swap_device_info(struct swap_info_struct *si)
  * Fast path try to get swap entries with specified order from current
  * CPU's swap entry pool (a cluster).
  */
-static bool swap_alloc_fast(swp_entry_t *entry,
-			    int order)
+static bool swap_alloc_fast(struct folio *folio)
 {
+	unsigned int order = folio_order(folio);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	unsigned int offset, found = SWAP_ENTRY_INVALID;
+	unsigned int offset;
 
 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
@@ -1158,24 +1157,21 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
-		found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
-		if (found)
-			*entry = swp_entry(si->type, found);
+		alloc_swap_scan_cluster(si, ci, folio, offset);
 	} else {
 		swap_unlock_cluster(ci);
 	}
-
 	put_swap_device(si);
-	return !!found;
+	return folio->swap.val != SWAP_ENTRY_INVALID;
 }
 
 /* Rotate the device and switch to a new cluster */
-static bool swap_alloc_slow(swp_entry_t *entry,
-			    int order)
+static void swap_alloc_slow(struct folio *folio)
 {
 	int node;
 	unsigned long offset;
 	struct swap_info_struct *si, *next;
+	unsigned int order = folio_order(folio);
 
 	node = numa_node_id();
 	spin_lock(&swap_avail_lock);
@@ -1185,14 +1181,12 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+			offset = cluster_alloc_swap_entry(si, folio);
 			put_swap_device(si);
-			if (offset) {
-				*entry = swp_entry(si->type, offset);
-				return true;
-			}
+			if (offset)
+				return;
 			if (order)
-				return false;
+				return;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1211,7 +1205,6 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 			goto start_over;
 	}
 	spin_unlock(&swap_avail_lock);
-	return false;
 }
 
 /*
@@ -1278,10 +1271,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 {
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
-	swp_entry_t entry = {};
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
 
 	if (order) {
 		/*
@@ -1302,32 +1291,21 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	}
 
 	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
+	if (!swap_alloc_fast(folio))
+		swap_alloc_slow(folio);
 	local_unlock(&percpu_swap_cluster.lock);
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
-	if (mem_cgroup_try_charge_swap(folio, entry))
-		goto out_free;
-
-	if (!entry.val)
+	if (mem_cgroup_try_charge_swap(folio, folio->swap)) {
+		folio_free_swap_cache(folio);
 		return -ENOMEM;
+	}
 
-	if (WARN_ON(swap_cache_add_folio(entry, folio, NULL, false) != folio))
-		goto out_free;
-
-	/*
-	 * Allocator should always allocate aligned entries so folio based
-	 * operations never crossed more than one cluster.
-	 */
-	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
+	if (!folio->swap.val)
+		return -ENOMEM;
 
 	atomic_long_sub(size, &nr_swap_pages);
 	return 0;
-
-out_free:
-	put_swap_folio(folio, entry);
-	return -ENOMEM;
 }
 
 /*
@@ -1858,7 +1836,7 @@ swp_entry_t get_swap_page_of_type(int type)
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
-			offset = cluster_alloc_swap_entry(si, 0, 1);
+			offset = cluster_alloc_swap_entry(si, NULL);
 			if (offset) {
 				entry = swp_entry(si->type, offset);
 				atomic_long_dec(&nr_swap_pages);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (20 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
                   ` (5 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swap cache is managed with swap table, users will check if an entry
is cached by looking at the swap table type directly. SWAP_HAS_CACHE is
only used to pin a entry temporarily so it won't be used by anyone else.

Previous commits have converted all places checking SWAP_HAS_CACHE to check
the swap table directly, now the only place still sets SWAP_HAS_CACHE is
on folio freeing.

Freeing a cached entry will set its swap map to SWAP_HAS_CACHE first,
keep the entry pinned with SWAP_HAS_CACHE, then free it.

Now as the swap has become the mandatory layer and managed by swap table,
and all users are checking the swap table directly, this can be much
simplified: when removing a folio from swap cache, free all it's entry
that have zero count directly instead of doing a temporarily pin.

After above change, SWAP_HAS_CACHE no longer have any users, remove all
related logic and helpers.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   1 -
 mm/swap.h            |  12 ++-
 mm/swap_state.c      |  22 ++++--
 mm/swapfile.c        | 184 +++++++++++--------------------------------
 4 files changed, 67 insertions(+), 152 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index adac6d51da05..60b126918399 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,7 +224,6 @@ enum {
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /* Bit flag in swap_map */
-#define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
 #define COUNT_CONTINUED	0x80	/* Flag swap_map continuation for full count */
 
 /* Special value in first swap_map */
diff --git a/mm/swap.h b/mm/swap.h
index b042609e6eb2..7cbfca39225f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -135,13 +135,6 @@ static inline void swap_unlock_cluster_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
-extern int __swap_cache_set_entry(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
-				  unsigned long offset);
-extern void __swap_cache_put_entries(struct swap_info_struct *si,
-				     struct swap_cluster_info *ci,
-				     swp_entry_t entry, unsigned int size);
-
 /*
  * All swap entries starts getting allocated by folio_alloc_swap(),
  * and the folio will be added to swap cache.
@@ -161,6 +154,11 @@ int folio_dup_swap(struct folio *folio, struct page *subpage);
 void folio_put_swap(struct folio *folio, struct page *subpage);
 void folio_free_swap_cache(struct folio *folio);
 
+/* For internal use */
+extern void __swap_free_entries(struct swap_info_struct *si,
+			      struct swap_cluster_info *ci,
+			      unsigned long offset, unsigned int nr_pages);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 9e7d40215958..2b145c0f7773 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -169,7 +169,7 @@ struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 			existing = swp_te_folio(exist);
 			goto out_failed;
 		}
-		if (__swap_cache_set_entry(si, ci, offset))
+		if (!__swap_count(swp_entry(si->type, offset)))
 			goto out_failed;
 		if (shadow && swp_te_is_shadow(exist))
 			*shadow = swp_te_shadow(exist);
@@ -191,10 +191,8 @@ struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 	 * We may lose shadow here due to raced swapin, which is rare and OK,
 	 * caller better keep the previous returned shadow.
 	 */
-	while (offset-- > start) {
+	while (offset-- > start)
 		__swap_table_set_shadow(ci, offset, NULL);
-		__swap_cache_put_entries(si, ci, swp_entry(si->type, offset), 1);
-	}
 	swap_unlock_cluster(ci);
 
 	/*
@@ -219,6 +217,7 @@ void __swap_cache_del_folio(swp_entry_t entry,
 	pgoff_t offset, start, end;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
+	bool folio_swapped = false, need_free = false;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -235,13 +234,26 @@ void __swap_cache_del_folio(swp_entry_t entry,
 		exist = __swap_table_get(ci, offset);
 		VM_WARN_ON_ONCE(swp_te_folio(exist) != folio);
 		__swap_table_set_shadow(ci, offset, shadow);
+		if (__swap_count(swp_entry(si->type, offset)))
+			folio_swapped = true;
+		else
+			need_free = true;
 	} while (++offset < end);
 
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
 	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
-	__swap_cache_put_entries(si, ci, entry, nr_pages);
+
+	if (!folio_swapped) {
+		__swap_free_entries(si, ci, start, nr_pages);
+	} else if (need_free) {
+		offset = start;
+		do {
+			if (!__swap_count(swp_entry(si->type, offset)))
+				__swap_free_entries(si, ci, offset, 1);
+		} while (++offset < end);
+	}
 }
 
 /*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 91025ba98653..c2154f19c21b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -49,21 +49,18 @@
 #include <linux/swap_cgroup.h>
 #include "swap_table.h"
 #include "internal.h"
+#include "swap_table.h"
 #include "swap.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_free_entries(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      unsigned long start, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
 static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
 					   struct swap_cluster_info *ci,
-					   swp_entry_t entry,
-					   unsigned char usage);
+					   swp_entry_t entry);
 static bool folio_swapcache_freeable(struct folio *folio);
 
 static DEFINE_SPINLOCK(swap_lock);
@@ -145,11 +142,6 @@ static struct swap_info_struct *swp_get_info(swp_entry_t entry)
 	return swp_type_get_info(swp_type(entry));
 }
 
-static inline unsigned char swap_count(unsigned char ent)
-{
-	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
-}
-
 /*
  * Use the second highest bit of inuse_pages counter as the indicator
  * if one swap device is on the available plist, so the atomic can
@@ -190,7 +182,7 @@ static bool swap_only_has_cache(struct swap_info_struct *si,
 
 	do {
 		entry = __swap_table_get(ci, offset);
-		VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
+		VM_WARN_ON_ONCE(!swp_te_is_folio(entry));
 		if (*map)
 			return false;
 		offset++;
@@ -600,7 +592,6 @@ static void partial_free_cluster(struct swap_info_struct *si,
 {
 	VM_BUG_ON(!ci->count);
 	VM_BUG_ON(ci->count == SWAPFILE_CLUSTER);
-
 	lockdep_assert_held(&ci->lock);
 
 	if (ci->flags != CLUSTER_FLAG_NONFULL)
@@ -664,7 +655,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 
 	spin_unlock(&ci->lock);
 	do {
-		if (swap_count(READ_ONCE(map[offset])))
+		if (READ_ONCE(map[offset]))
 			break;
 		nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 		if (nr_reclaim > 0)
@@ -696,10 +687,9 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 		return true;
 
 	for (offset = start; offset < end; offset++) {
-		if (swap_count(map[offset]))
+		if (map[offset])
 			return false;
 		if (swp_te_is_folio(__swap_table_get(ci, offset))) {
-			VM_WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE));
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
@@ -733,7 +723,6 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 	if (folio) {
 		/* from folio_alloc_swap */
 		__swap_cache_add_folio(entry, ci, folio);
-		memset(&si->swap_map[offset], SWAP_HAS_CACHE, nr_pages);
 	} else {
 		/* from get_swap_page_of_type */
 		VM_WARN_ON_ONCE(si->swap_map[offset] || swap_cache_check_folio(entry));
@@ -818,7 +807,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (!swap_count(map[offset]) &&
+			if (!map[offset] &&
 			    swp_te_is_folio(__swap_table_get(ci, offset))) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
@@ -910,7 +899,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 		 * Scan only one fragment cluster is good enough. Order 0
 		 * allocation will surely success, and mTHP allocation failure
 		 * is not critical, and scanning one cluster still keeps the
-		 * list rotated and scanned (for reclaiming HAS_CACHE).
+		 * list rotated and scanned (for reclaiming swap cachec).
 		 */
 		ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
 		if (ci) {
@@ -1226,10 +1215,9 @@ static bool swap_put_entries(struct swap_info_struct *si,
 	do {
 		swp_te = __swap_table_get(ci, offset);
 		count = si->swap_map[offset];
-		if (WARN_ON_ONCE(!swap_count(count))) {
+		if (WARN_ON_ONCE(!count)) {
 			goto skip;
 		} else if (swp_te_is_folio(swp_te)) {
-			VM_WARN_ON_ONCE(!(count & SWAP_HAS_CACHE));
 			/* Let the swap cache (folio) handle the final free */
 			has_cache = true;
 		} else if (count == 1) {
@@ -1237,16 +1225,16 @@ static bool swap_put_entries(struct swap_info_struct *si,
 			head = head ? head : offset;
 			continue;
 		}
-		swap_put_entry_locked(si, ci, swp_entry(si->type, offset), 1);
+		swap_put_entry_locked(si, ci, swp_entry(si->type, offset));
 skip:
 		if (head) {
-			swap_free_entries(si, ci, head, offset - head);
+			__swap_free_entries(si, ci, head, offset - head);
 			head = SWAP_ENTRY_INVALID;
 		}
 	} while (++offset < cluster_end);
 
 	if (head) {
-		swap_free_entries(si, ci, head, offset - head);
+		__swap_free_entries(si, ci, head, offset - head);
 		head = SWAP_ENTRY_INVALID;
 	}
 
@@ -1296,12 +1284,10 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	local_unlock(&percpu_swap_cluster.lock);
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
-	if (mem_cgroup_try_charge_swap(folio, folio->swap)) {
+	if (mem_cgroup_try_charge_swap(folio, folio->swap))
 		folio_free_swap_cache(folio);
-		return -ENOMEM;
-	}
 
-	if (!folio->swap.val)
+	if (unlikely(!folio->swap.val))
 		return -ENOMEM;
 
 	atomic_long_sub(size, &nr_swap_pages);
@@ -1393,13 +1379,8 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	offset = swp_offset(entry);
 	if (offset >= si->max)
 		goto bad_offset;
-	if (data_race(!si->swap_map[swp_offset(entry)]))
-		goto bad_free;
 	return si;
 
-bad_free:
-	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
-	goto out;
 bad_offset:
 	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
 	goto out;
@@ -1414,22 +1395,13 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 
 static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
 					   struct swap_cluster_info *ci,
-					   swp_entry_t entry,
-					   unsigned char usage)
+					   swp_entry_t entry)
 {
 	unsigned long offset = swp_offset(entry);
 	unsigned char count;
-	unsigned char has_cache;
 
 	count = si->swap_map[offset];
-
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-
-	if (usage == SWAP_HAS_CACHE) {
-		VM_BUG_ON(!has_cache);
-		has_cache = 0;
-	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
+	if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
 		if (count == COUNT_CONTINUED) {
 			if (swap_count_continued(si, offset, count))
 				count = SWAP_MAP_MAX | COUNT_CONTINUED;
@@ -1439,13 +1411,11 @@ static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
 			count--;
 	}
 
-	usage = count | has_cache;
-	if (usage)
-		WRITE_ONCE(si->swap_map[offset], usage);
-	else
-		swap_free_entries(si, ci, offset, 1);
+	WRITE_ONCE(si->swap_map[offset], count);
+	if (!count && !swp_te_is_folio(__swap_table_get(ci, offset)))
+		__swap_free_entries(si, ci, offset, 1);
 
-	return usage;
+	return count;
 }
 
 /*
@@ -1514,25 +1484,12 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-/*
- * Check if it's the last ref of swap entry in the freeing path.
- */
-static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
-{
-	return (count == SWAP_HAS_CACHE) || (count == 1);
-}
-
-/*
- * Drop the last ref of swap entries, caller have to ensure all entries
- * belong to the same cgroup and cluster.
- */
-static void swap_free_entries(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      unsigned long offset, unsigned int nr_pages)
+void __swap_free_entries(struct swap_info_struct *si,
+		       struct swap_cluster_info *ci,
+		       unsigned long offset, unsigned int nr_pages)
 {
 	swp_entry_t entry = swp_entry(si->type, offset);
-	unsigned char *map = si->swap_map + offset;
-	unsigned char *map_end = map + nr_pages;
+	unsigned long end = offset + nr_pages;
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
@@ -1541,10 +1498,10 @@ static void swap_free_entries(struct swap_info_struct *si,
 
 	ci->count -= nr_pages;
 	do {
-		VM_BUG_ON(!swap_is_last_ref(*map));
-		*map = 0;
-	} while (++map < map_end);
+		si->swap_map[offset] = 0;
+	} while (++offset < end);
 
+	offset = swp_offset(entry);
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
 
@@ -1554,46 +1511,12 @@ static void swap_free_entries(struct swap_info_struct *si,
 		partial_free_cluster(si, ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void __swap_cache_put_entries(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int size)
-{
-	if (swap_only_has_cache(si, ci, swp_offset(entry), size))
-		swap_free_entries(si, ci, swp_offset(entry), size);
-	else
-		for (int i = 0; i < size; i++, entry.val++)
-			swap_put_entry_locked(si, ci, entry, SWAP_HAS_CACHE);
-}
-
-/*
- * Called after dropping swapcache to decrease refcnt to swap entries.
- */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-	int size = 1 << swap_entry_order(folio_order(folio));
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return;
-
-	ci = swap_lock_cluster(si, offset);
-	__swap_cache_put_entries(si, ci, entry, size);
-	swap_unlock_cluster(ci);
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = swp_info(entry);
 	pgoff_t offset = swp_offset(entry);
 
-	return swap_count(si->swap_map[offset]);
+	return si->swap_map[offset];
 }
 
 /*
@@ -1608,7 +1531,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 	int count;
 
 	ci = swap_lock_cluster(si, offset);
-	count = swap_count(si->swap_map[offset]);
+	count = si->swap_map[offset];
 	swap_unlock_cluster(ci);
 	return !!count;
 }
@@ -1634,7 +1557,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	ci = swap_lock_cluster(si, offset);
 
-	count = swap_count(si->swap_map[offset]);
+	count = si->swap_map[offset];
 	if (!(count & COUNT_CONTINUED))
 		goto out;
 
@@ -1672,12 +1595,12 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 
 	ci = swap_lock_cluster(si, offset);
 	if (nr_pages == 1) {
-		if (swap_count(map[roffset]))
+		if (map[roffset])
 			ret = true;
 		goto unlock_out;
 	}
 	for (i = 0; i < nr_pages; i++) {
-		if (swap_count(map[offset + i])) {
+		if (map[offset + i]) {
 			ret = true;
 			break;
 		}
@@ -1777,6 +1700,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	swp_te_t swp_te;
 
 	si = get_swap_device(entry);
+
 	if (WARN_ON_ONCE(!si))
 		return;
 	if (WARN_ON_ONCE(end_offset > si->max))
@@ -1800,7 +1724,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
 		swp_te = __swap_table_get(swp_offset_cluster(si, offset), offset);
-		if (!swap_count(si->swap_map[offset]) && swp_te_is_folio(swp_te)) {
+		if (!READ_ONCE(si->swap_map[offset]) && swp_te_is_folio(swp_te)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
@@ -1818,7 +1742,6 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 			nr = ALIGN(offset + 1, nr) - offset;
 		}
 	}
-
 out:
 	put_swap_device(si);
 }
@@ -1860,7 +1783,7 @@ void free_swap_page_of_entry(swp_entry_t entry)
 	if (!si)
 		return;
 	ci = swap_lock_cluster(si, offset);
-	WARN_ON(swap_count(swap_put_entry_locked(si, ci, entry, 1)));
+	WARN_ON(swap_put_entry_locked(si, ci, entry));
 	/* It might got added to swap cache accidentally by read ahead */
 	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 	swap_unlock_cluster(ci);
@@ -2261,6 +2184,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 					unsigned int prev)
 {
 	unsigned int i;
+	swp_te_t swp_te;
 	unsigned char count;
 
 	/*
@@ -2271,7 +2195,10 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 */
 	for (i = prev + 1; i < si->max; i++) {
 		count = READ_ONCE(si->swap_map[i]);
-		if (count && swap_count(count) != SWAP_MAP_BAD)
+		swp_te = __swap_table_get(swp_offset_cluster(si, i), i);
+		if (count == SWAP_MAP_BAD)
+			continue;
+		if (count || swp_te_is_folio(swp_te))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
 			cond_resched();
@@ -3530,7 +3457,7 @@ static int swap_dup_entries(struct swap_info_struct *si,
 			    unsigned char usage, int nr)
 {
 	int i;
-	unsigned char count, has_cache;
+	unsigned char count;
 
 	for (i = 0; i < nr; i++) {
 		count = si->swap_map[offset + i];
@@ -3539,31 +3466,16 @@ static int swap_dup_entries(struct swap_info_struct *si,
 		 * swapin_readahead() doesn't check if a swap entry is valid, so the
 		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
 		 */
-		if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
+		if (unlikely(count == SWAP_MAP_BAD))
 			return -ENOENT;
-		}
-
-		has_cache = count & SWAP_HAS_CACHE;
-		count &= ~SWAP_HAS_CACHE;
 
-		if (!count && !has_cache) {
+		if (!count && !swp_te_is_folio(__swap_table_get(ci, offset)))
 			return -ENOENT;
-		} else if (usage == SWAP_HAS_CACHE) {
-			if (has_cache)
-				return -EEXIST;
-		} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
-			return -EINVAL;
-		}
 	}
 
 	for (i = 0; i < nr; i++) {
 		count = si->swap_map[offset + i];
-		has_cache = count & SWAP_HAS_CACHE;
-		count &= ~SWAP_HAS_CACHE;
-
-		if (usage == SWAP_HAS_CACHE)
-			has_cache = SWAP_HAS_CACHE;
-		else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
 			count += usage;
 		else if (swap_count_continued(si, offset + i, count))
 			count = COUNT_CONTINUED;
@@ -3575,7 +3487,7 @@ static int swap_dup_entries(struct swap_info_struct *si,
 			return -ENOMEM;
 		}
 
-		WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
+		WRITE_ONCE(si->swap_map[offset + i], count);
 	}
 
 	return 0;
@@ -3625,12 +3537,6 @@ int do_dup_swap_entry(swp_entry_t entry)
 	return err;
 }
 
-int __swap_cache_set_entry(struct swap_info_struct *si,
-			   struct swap_cluster_info *ci, unsigned long offset)
-{
-	return swap_dup_entries(si, ci, offset, SWAP_HAS_CACHE, 1);
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
@@ -3676,7 +3582,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 
 	ci = swap_lock_cluster(si, offset);
 
-	count = swap_count(si->swap_map[offset]);
+	count = si->swap_map[offset];
 
 	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
 		/*
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (21 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
                   ` (4 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There is now only two users of _swap_info_get after consolidating these
callers, folio_try_reclaim_swap and swp_swapcount.

folio_try_reclaim_swap holds the folio lock and the folio is in swap
cache, _swap_info_get is redundant.

For swp_swapcount, _swap_info_get is insufficient as the swap entry is
no pinned so the device could be swapped off anytime, it should use
get_swap_device instead.

And after these change, _swap_info_get is no longer used, we can safely
remove it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 63 +++++++++++++--------------------------------------
 1 file changed, 16 insertions(+), 47 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index c2154f19c21b..28bb0a74e4a6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1364,35 +1364,6 @@ void folio_free_swap_cache(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	unsigned long offset;
-
-	if (!entry.val)
-		goto out;
-	si = swp_get_info(entry);
-	if (!si)
-		goto bad_nofile;
-	if (data_race(!(si->flags & SWP_USED)))
-		goto bad_device;
-	offset = swp_offset(entry);
-	if (offset >= si->max)
-		goto bad_offset;
-	return si;
-
-bad_offset:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
-	goto out;
-bad_device:
-	pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
-	goto out;
-bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
-out:
-	return NULL;
-}
-
 static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
 					   struct swap_cluster_info *ci,
 					   swp_entry_t entry)
@@ -1549,7 +1520,7 @@ int swp_swapcount(swp_entry_t entry)
 	pgoff_t offset;
 	unsigned char *map;
 
-	si = _swap_info_get(entry);
+	si = get_swap_device(entry);
 	if (!si)
 		return 0;
 
@@ -1579,6 +1550,7 @@ int swp_swapcount(swp_entry_t entry)
 	} while (tmp_count & COUNT_CONTINUED);
 out:
 	swap_unlock_cluster(ci);
+	put_swap_device(si);
 	return count;
 }
 
@@ -1610,26 +1582,10 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	return ret;
 }
 
-static bool folio_swapped(struct folio *folio)
-{
-	swp_entry_t entry = folio->swap;
-	struct swap_info_struct *si = _swap_info_get(entry);
-
-	if (!si)
-		return false;
-
-	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
-		return swap_entry_swapped(si, entry);
-
-	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
-}
-
 static bool folio_swapcache_freeable(struct folio *folio)
 {
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
-	if (!folio_test_swapcache(folio))
-		return false;
 	if (folio_test_writeback(folio))
 		return false;
 
@@ -1665,9 +1621,22 @@ static bool folio_swapcache_freeable(struct folio *folio)
  */
 bool folio_free_swap(struct folio *folio)
 {
+	bool swapped;
+	struct swap_info_struct *si;
+	swp_entry_t entry = folio->swap;
+
+	if (!folio_test_swapcache(folio))
+		return false;
 	if (!folio_swapcache_freeable(folio))
 		return false;
-	if (folio_swapped(folio))
+
+	si = swp_info(entry);
+	if (!IS_ENABLED(CONFIG_THP_SWAP) || !folio_test_large(folio))
+		swapped = swap_entry_swapped(si, entry);
+	else
+		swapped = swap_page_trans_huge_swapped(si, entry,
+						       folio_order(folio));
+	if (swapped)
 		return false;
 
 	folio_free_swap_cache(folio);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (22 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-15  9:40   ` Klara Modin
  2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
                   ` (3 subsequent siblings)
  27 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

To prepare for using swap table as the unified swap layer, introduce
macro and helpers for storing multiple kind of data in an swap table
entry.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_table.h | 130 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 119 insertions(+), 11 deletions(-)

diff --git a/mm/swap_table.h b/mm/swap_table.h
index 69a074339444..9356004d211a 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -5,9 +5,41 @@
 #include "swap.h"
 
 /*
- * Swap table entry could be a pointer (folio), a XA_VALUE (shadow), or NULL.
+ * Swap table entry type and bit layouts:
+ *
+ * NULL:     | ------------    0   -------------|
+ * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1|
+ * Folio:    | SWAP_COUNT |------ PFN -------|10|
+ * Pointer:  |----------- Pointer ----------|100|
+ *
+ * Usage:
+ * - NULL: Swap Entry is unused.
+ *
+ * - Shadow: Swap Entry is used and not cached (swapped out).
+ *   It's reusing XA_VALUE format to be compatible with workingset
+ *   shadows. SHADOW_VAL part could be all 0.
+ *
+ * - Folio: Swap Entry is in cache.
+ *
+ * - Pointer: Unused yet. Because only the last three bit of a pointer
+ *   is usable so now `100` is reserved for potential pointer use.
  */
 
+#define ENTRY_COUNT_BITS	BITS_PER_BYTE
+#define ENTRY_SHADOW_MARK	0b1UL
+#define ENTRY_PFN_MARK		0b10UL
+#define ENTRY_PFN_LOW_MASK	0b11UL
+#define ENTRY_PFN_SHIFT		2
+#define ENTRY_PFN_MASK		((~0UL) >> ENTRY_COUNT_BITS)
+#define ENTRY_COUNT_MASK	(~((~0UL) >> ENTRY_COUNT_BITS))
+#define ENTRY_COUNT_SHIFT	(BITS_PER_LONG - BITS_PER_BYTE)
+#define ENTRY_COUNT_MAX		((1 << ENTRY_COUNT_BITS) - 2)
+#define ENTRY_COUNT_BAD		((1 << ENTRY_COUNT_BITS) - 1) /* ENTRY_BAD */
+#define ENTRY_BAD		(~0UL)
+
+/* For shadow offset calculation */
+#define SWAP_COUNT_SHIFT	ENTRY_COUNT_BITS
+
 /*
  * Helpers for casting one type of info into a swap table entry.
  */
@@ -19,17 +51,27 @@ static inline swp_te_t null_swp_te(void)
 
 static inline swp_te_t folio_swp_te(struct folio *folio)
 {
-	BUILD_BUG_ON(sizeof(swp_te_t) != sizeof(void *));
-	swp_te_t swp_te = { .counter = (unsigned long)folio };
+	BUILD_BUG_ON((MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) >
+		     (BITS_PER_LONG - ENTRY_PFN_SHIFT - ENTRY_COUNT_BITS));
+	swp_te_t swp_te = {
+		.counter = (folio_pfn(folio) << ENTRY_PFN_SHIFT) | ENTRY_PFN_MARK
+	};
 	return swp_te;
 }
 
 static inline swp_te_t shadow_swp_te(void *shadow)
 {
-	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
-		     BITS_PER_BYTE * sizeof(swp_te_t));
-	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
 	swp_te_t swp_te = { .counter = ((unsigned long)shadow) };
+	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) != BITS_PER_BYTE * sizeof(swp_te_t));
+	BUILD_BUG_ON((unsigned long)xa_mk_value(0) != ENTRY_SHADOW_MARK);
+	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
+	swp_te.counter |= ENTRY_SHADOW_MARK;
+	return swp_te;
+}
+
+static inline swp_te_t bad_swp_te(void)
+{
+	swp_te_t swp_te = { .counter = ENTRY_BAD };
 	return swp_te;
 }
 
@@ -43,7 +85,7 @@ static inline bool swp_te_is_null(swp_te_t swp_te)
 
 static inline bool swp_te_is_folio(swp_te_t swp_te)
 {
-	return !xa_is_value((void *)swp_te.counter) && !swp_te_is_null(swp_te);
+	return ((swp_te.counter & ENTRY_PFN_LOW_MASK) == ENTRY_PFN_MARK);
 }
 
 static inline bool swp_te_is_shadow(swp_te_t swp_te)
@@ -51,19 +93,63 @@ static inline bool swp_te_is_shadow(swp_te_t swp_te)
 	return xa_is_value((void *)swp_te.counter);
 }
 
+static inline bool swp_te_is_valid_shadow(swp_te_t swp_te)
+{
+	/* The shadow could be empty, just for holding the swap count */
+	return xa_is_value((void *)swp_te.counter) &&
+	       xa_to_value((void *)swp_te.counter);
+}
+
+static inline bool swp_te_is_bad(swp_te_t swp_te)
+{
+	return swp_te.counter == ENTRY_BAD;
+}
+
+static inline bool __swp_te_is_countable(swp_te_t ent)
+{
+	return (swp_te_is_shadow(ent) || swp_te_is_folio(ent) ||
+		swp_te_is_null(ent));
+}
+
 /*
  * Helpers for retrieving info from swap table.
  */
 static inline struct folio *swp_te_folio(swp_te_t swp_te)
 {
 	VM_WARN_ON(!swp_te_is_folio(swp_te));
-	return (void *)swp_te.counter;
+	return pfn_folio((swp_te.counter & ENTRY_PFN_MASK) >> ENTRY_PFN_SHIFT);
 }
 
 static inline void *swp_te_shadow(swp_te_t swp_te)
 {
 	VM_WARN_ON(!swp_te_is_shadow(swp_te));
-	return (void *)swp_te.counter;
+	return (void *)(swp_te.counter & ~ENTRY_COUNT_MASK);
+}
+
+static inline unsigned char swp_te_get_count(swp_te_t swp_te)
+{
+	VM_WARN_ON(!__swp_te_is_countable(swp_te));
+	return ((swp_te.counter & ENTRY_COUNT_MASK) >> ENTRY_COUNT_SHIFT);
+}
+
+static inline unsigned char swp_te_try_get_count(swp_te_t swp_te)
+{
+	if (__swp_te_is_countable(swp_te))
+		return swp_te_get_count(swp_te);
+	return 0;
+}
+
+static inline swp_te_t swp_te_set_count(swp_te_t swp_te,
+					unsigned char count)
+{
+	VM_BUG_ON(!__swp_te_is_countable(swp_te));
+	VM_BUG_ON(count > ENTRY_COUNT_MAX);
+
+	swp_te.counter &= ~ENTRY_COUNT_MASK;
+	swp_te.counter |= ((unsigned long)count) << ENTRY_COUNT_SHIFT;
+	VM_BUG_ON(swp_te_get_count(swp_te) != count);
+
+	return swp_te;
 }
 
 /*
@@ -87,17 +173,39 @@ static inline swp_te_t __swap_table_get(struct swap_cluster_info *ci, pgoff_t of
 static inline void __swap_table_set_folio(struct swap_cluster_info *ci, pgoff_t off,
 					  struct folio *folio)
 {
-	__swap_table_set(ci, off, folio_swp_te(folio));
+	swp_te_t swp_te;
+	unsigned char count;
+
+	swp_te = __swap_table_get(ci, off);
+	count = swp_te_get_count(swp_te);
+	swp_te = swp_te_set_count(folio_swp_te(folio), count);
+
+	__swap_table_set(ci, off, swp_te);
 }
 
 static inline void __swap_table_set_shadow(struct swap_cluster_info *ci, pgoff_t off,
 					   void *shadow)
 {
-	__swap_table_set(ci, off, shadow_swp_te(shadow));
+	swp_te_t swp_te;
+	unsigned char count;
+
+	swp_te = __swap_table_get(ci, off);
+	count = swp_te_get_count(swp_te);
+	swp_te = swp_te_set_count(shadow_swp_te(shadow), count);
+
+	__swap_table_set(ci, off, swp_te);
 }
 
 static inline void __swap_table_set_null(struct swap_cluster_info *ci, pgoff_t off)
 {
 	__swap_table_set(ci, off, null_swp_te());
 }
+
+static inline void __swap_table_set_count(struct swap_cluster_info *ci, pgoff_t off,
+					  unsigned char count)
+{
+	swp_te_t swp_te;
+	swp_te = swp_te_set_count(__swap_table_get(ci, off), count);
+	__swap_table_set(ci, off, swp_te);
+}
 #endif
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (23 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
                   ` (2 subsequent siblings)
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap table entry will need 8 bits reserved for swap count, so anon
shadow should have 8 bits remain 0.

This should be OK for foreseeable future, take 52 bits physical address
space as example: for 4K pages, there would be at most 40 bits for
addressable pages. Currently we have 36 bits available (with NODES_SHIFT
set to 10, this can be decreased for more bits), so in worst case
refault distance compare will be done for every 64K sized bucket.

This commit may increases the bucket size to 16M, which should be fine
as the workingset size will be way larger than the bucket size for such
large machines.

For MGLRU 28 bits can track a huge amount of gens already, there should
be no problem either.

And the 8 bits can be changed to 6 or even fewer bits later.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_table.h |  1 +
 mm/workingset.c | 39 ++++++++++++++++++++++++++-------------
 2 files changed, 27 insertions(+), 13 deletions(-)

diff --git a/mm/swap_table.h b/mm/swap_table.h
index 9356004d211a..afb2953d408a 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -65,6 +65,7 @@ static inline swp_te_t shadow_swp_te(void *shadow)
 	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) != BITS_PER_BYTE * sizeof(swp_te_t));
 	BUILD_BUG_ON((unsigned long)xa_mk_value(0) != ENTRY_SHADOW_MARK);
 	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
+	VM_WARN_ON((unsigned long)shadow & ENTRY_COUNT_MASK);
 	swp_te.counter |= ENTRY_SHADOW_MARK;
 	return swp_te;
 }
diff --git a/mm/workingset.c b/mm/workingset.c
index 6e7f4cb1b9a7..86a549a17ae1 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -16,6 +16,7 @@
 #include <linux/dax.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include "swap_table.h"
 #include "internal.h"
 
 /*
@@ -184,7 +185,9 @@
 #define EVICTION_SHIFT	((BITS_PER_LONG - BITS_PER_XA_VALUE) +	\
 			 WORKINGSET_SHIFT + NODES_SHIFT + \
 			 MEM_CGROUP_ID_SHIFT)
+#define EVICTION_SHIFT_ANON	(EVICTION_SHIFT + SWAP_COUNT_SHIFT)
 #define EVICTION_MASK	(~0UL >> EVICTION_SHIFT)
+#define EVICTION_MASK_ANON	(~0UL >> EVICTION_SHIFT_ANON)
 
 /*
  * Eviction timestamps need to be able to cover the full range of
@@ -194,12 +197,16 @@
  * that case, we have to sacrifice granularity for distance, and group
  * evictions into coarser buckets by shaving off lower timestamp bits.
  */
-static unsigned int bucket_order __read_mostly;
+static unsigned int bucket_order[ANON_AND_FILE] __read_mostly;
 
 static void *pack_shadow(int memcgid, pg_data_t *pgdat, unsigned long eviction,
-			 bool workingset)
+			 bool workingset, bool file)
 {
-	eviction &= EVICTION_MASK;
+	if (file)
+		eviction &= EVICTION_MASK;
+	else
+		eviction &= EVICTION_MASK_ANON;
+
 	eviction = (eviction << MEM_CGROUP_ID_SHIFT) | memcgid;
 	eviction = (eviction << NODES_SHIFT) | pgdat->node_id;
 	eviction = (eviction << WORKINGSET_SHIFT) | workingset;
@@ -244,7 +251,8 @@ static void *lru_gen_eviction(struct folio *folio)
 	struct mem_cgroup *memcg = folio_memcg(folio);
 	struct pglist_data *pgdat = folio_pgdat(folio);
 
-	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH > BITS_PER_LONG - EVICTION_SHIFT);
+	BUILD_BUG_ON(LRU_GEN_WIDTH + LRU_REFS_WIDTH >
+		     BITS_PER_LONG - max(EVICTION_SHIFT, EVICTION_SHIFT_ANON));
 
 	lruvec = mem_cgroup_lruvec(memcg, pgdat);
 	lrugen = &lruvec->lrugen;
@@ -254,7 +262,7 @@ static void *lru_gen_eviction(struct folio *folio)
 	hist = lru_hist_from_seq(min_seq);
 	atomic_long_add(delta, &lrugen->evicted[hist][type][tier]);
 
-	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset);
+	return pack_shadow(mem_cgroup_id(memcg), pgdat, token, workingset, type);
 }
 
 /*
@@ -381,6 +389,7 @@ void workingset_age_nonresident(struct lruvec *lruvec, unsigned long nr_pages)
 void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 {
 	struct pglist_data *pgdat = folio_pgdat(folio);
+	int file = folio_is_file_lru(folio);
 	unsigned long eviction;
 	struct lruvec *lruvec;
 	int memcgid;
@@ -397,10 +406,10 @@ void *workingset_eviction(struct folio *folio, struct mem_cgroup *target_memcg)
 	/* XXX: target_memcg can be NULL, go through lruvec */
 	memcgid = mem_cgroup_id(lruvec_memcg(lruvec));
 	eviction = atomic_long_read(&lruvec->nonresident_age);
-	eviction >>= bucket_order;
+	eviction >>= bucket_order[file];
 	workingset_age_nonresident(lruvec, folio_nr_pages(folio));
 	return pack_shadow(memcgid, pgdat, eviction,
-				folio_test_workingset(folio));
+			   folio_test_workingset(folio), folio_is_file_lru(folio));
 }
 
 /**
@@ -438,7 +447,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset,
 
 	rcu_read_lock();
 	unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
-	eviction <<= bucket_order;
+	eviction <<= bucket_order[file];
 
 	/*
 	 * Look up the memcg associated with the stored ID. It might
@@ -780,8 +789,8 @@ static struct lock_class_key shadow_nodes_key;
 
 static int __init workingset_init(void)
 {
+	unsigned int timestamp_bits, timestamp_bits_anon;
 	struct shrinker *workingset_shadow_shrinker;
-	unsigned int timestamp_bits;
 	unsigned int max_order;
 	int ret = -ENOMEM;
 
@@ -794,11 +803,15 @@ static int __init workingset_init(void)
 	 * double the initial memory by using totalram_pages as-is.
 	 */
 	timestamp_bits = BITS_PER_LONG - EVICTION_SHIFT;
+	timestamp_bits_anon = BITS_PER_LONG - EVICTION_SHIFT_ANON;
 	max_order = fls_long(totalram_pages() - 1);
-	if (max_order > timestamp_bits)
-		bucket_order = max_order - timestamp_bits;
-	pr_info("workingset: timestamp_bits=%d max_order=%d bucket_order=%u\n",
-	       timestamp_bits, max_order, bucket_order);
+	if (max_order > (BITS_PER_LONG - EVICTION_SHIFT))
+		bucket_order[WORKINGSET_FILE] = max_order - timestamp_bits;
+	if (max_order > timestamp_bits_anon)
+		bucket_order[WORKINGSET_ANON] = max_order - timestamp_bits_anon;
+	pr_info("workingset: timestamp_bits=%d (anon: %d) max_order=%d bucket_order=%u (anon: %d)\n",
+		timestamp_bits, timestamp_bits_anon, max_order,
+		bucket_order[WORKINGSET_FILE], bucket_order[WORKINGSET_ANON]);
 
 	workingset_shadow_shrinker = shrinker_alloc(SHRINKER_NUMA_AWARE |
 						    SHRINKER_MEMCG_AWARE,
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 26/28] mm, swap: minor clean up for swapon
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (24 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
  2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Setup cluster info first as now it's the most basic info for swap device
now, and clean up swap_map setting. There is no need to pass the
them as argument for multiple functions, they are never set to NULL once
allocated, so just set it once.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 62 +++++++++++++++++++++------------------------------
 1 file changed, 26 insertions(+), 36 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 28bb0a74e4a6..c50cbf6578d3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2440,8 +2440,6 @@ static int swap_node(struct swap_info_struct *si)
 }
 
 static void setup_swap_info(struct swap_info_struct *si, int prio,
-			    unsigned char *swap_map,
-			    struct swap_cluster_info *cluster_info,
 			    unsigned long *zeromap)
 {
 	int i;
@@ -2465,8 +2463,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
 				si->avail_lists[i].prio = -si->prio;
 		}
 	}
-	si->swap_map = swap_map;
-	si->cluster_info = cluster_info;
 	si->zeromap = zeromap;
 }
 
@@ -2493,13 +2489,11 @@ static void _enable_swap_info(struct swap_info_struct *si)
 }
 
 static void enable_swap_info(struct swap_info_struct *si, int prio,
-				unsigned char *swap_map,
-				struct swap_cluster_info *cluster_info,
 				unsigned long *zeromap)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, prio, swap_map, cluster_info, zeromap);
+	setup_swap_info(si, prio, zeromap);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
 	/*
@@ -2517,7 +2511,7 @@ static void reinsert_swap_info(struct swap_info_struct *si)
 {
 	spin_lock(&swap_lock);
 	spin_lock(&si->lock);
-	setup_swap_info(si, si->prio, si->swap_map, si->cluster_info, si->zeromap);
+	setup_swap_info(si, si->prio, si->zeromap);
 	_enable_swap_info(si);
 	spin_unlock(&si->lock);
 	spin_unlock(&swap_lock);
@@ -2541,13 +2535,13 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	}
 }
 
-static void free_cluster_info(struct swap_cluster_info *cluster_info,
-			      unsigned long maxpages)
+static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
+				   unsigned long max)
 {
-	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-
+	int i, nr_clusters = DIV_ROUND_UP(max, SWAPFILE_CLUSTER);
 	if (!cluster_info)
 		return;
+	VM_WARN_ON(!nr_clusters);
 	for (i = 0; i < nr_clusters; i++)
 		cluster_table_free(&cluster_info[i]);
 	kvfree(cluster_info);
@@ -2580,7 +2574,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	struct swap_info_struct *p = NULL;
 	unsigned char *swap_map;
 	unsigned long *zeromap;
-	struct swap_cluster_info *cluster_info;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
 	struct inode *inode;
@@ -2687,14 +2680,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
-	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
-	cluster_info = p->cluster_info;
-	free_cluster_info(cluster_info, p->max);
+	free_swap_cluster_info(p->cluster_info, p->max);
 	p->cluster_info = NULL;
+	p->max = 0;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 	arch_swap_invalidate_area(p->type);
@@ -3067,7 +3059,6 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si,
 
 	if (nr_good_pages) {
 		swap_map[0] = SWAP_MAP_BAD;
-		si->max = maxpages;
 		si->pages = nr_good_pages;
 		nr_extents = setup_swap_extents(si, span);
 		if (nr_extents < 0)
@@ -3089,13 +3080,12 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si,
 #define SWAP_CLUSTER_COLS						\
 	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
 
-static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
-						union swap_header *swap_header,
-						unsigned long maxpages)
+static int setup_swap_clusters_info(struct swap_info_struct *si,
+				    union swap_header *swap_header,
+				    unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	int err = -ENOMEM;
 	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
@@ -3151,11 +3141,12 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 			list_add_tail(&ci->list, &si->free_clusters);
 		}
 	}
-	return cluster_info;
+	si->cluster_info = cluster_info;
+	return 0;
 err_free:
-	free_cluster_info(cluster_info, maxpages);
+	free_swap_cluster_info(cluster_info, maxpages);
 err:
-	return ERR_PTR(err);
+	return -ENOMEM;
 }
 
 SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
@@ -3173,7 +3164,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	unsigned long maxpages;
 	unsigned char *swap_map = NULL;
 	unsigned long *zeromap = NULL;
-	struct swap_cluster_info *cluster_info = NULL;
 	struct folio *folio = NULL;
 	struct inode *inode = NULL;
 	bool inced_nr_rotate_swap = false;
@@ -3241,13 +3231,19 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	swap_header = kmap_local_folio(folio, 0);
 
 	maxpages = read_swap_header(si, swap_header, inode);
+	si->max = maxpages;
 	if (unlikely(!maxpages)) {
 		error = -EINVAL;
 		goto bad_swap_unlock_inode;
 	}
 
+	error = setup_swap_clusters_info(si, swap_header, maxpages);
+	if (error)
+		goto bad_swap_unlock_inode;
+
 	/* OK, set up the swap map and apply the bad block list */
 	swap_map = vzalloc(maxpages);
+	si->swap_map = swap_map;
 	if (!swap_map) {
 		error = -ENOMEM;
 		goto bad_swap_unlock_inode;
@@ -3288,13 +3284,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		inced_nr_rotate_swap = true;
 	}
 
-	cluster_info = setup_clusters(si, swap_header, maxpages);
-	if (IS_ERR(cluster_info)) {
-		error = PTR_ERR(cluster_info);
-		cluster_info = NULL;
-		goto bad_swap_unlock_inode;
-	}
-
 	if ((swap_flags & SWAP_FLAG_DISCARD) &&
 	    si->bdev && bdev_max_discard_sectors(si->bdev)) {
 		/*
@@ -3345,7 +3334,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	prio = -1;
 	if (swap_flags & SWAP_FLAG_PREFER)
 		prio = swap_flags & SWAP_FLAG_PRIO_MASK;
-	enable_swap_info(si, prio, swap_map, cluster_info, zeromap);
+	enable_swap_info(si, prio, zeromap);
 
 	pr_info("Adding %uk swap on %s.  Priority:%d extents:%d across:%lluk %s%s%s%s\n",
 		K(si->pages), name->name, si->prio, nr_extents,
@@ -3375,10 +3364,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	si->swap_file = NULL;
 	si->flags = 0;
 	spin_unlock(&swap_lock);
-	vfree(swap_map);
+	vfree(si->swap_map);
+	si->swap_map = NULL;
+	free_swap_cluster_info(si->cluster_info, si->max);
+	si->cluster_info = NULL;
 	kvfree(zeromap);
-	if (cluster_info)
-		free_cluster_info(cluster_info, maxpages);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
 	if (swap_file)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 27/28] mm, swap: use swap table to track swap count
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (25 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
  27 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now all the data is ready, switch to use swap table only.

Introduce a new set of functions based on swap table for manipulating
swap counts:

- swap_dup_entry_locked
  Increase the swap count of one swap entry. The entry must be allocated
  by folio_alloc_swap.

- swap_dup_entries
  Increase the swap count of a set of swap entries. The entries must be
  allocated by folio_alloc_swap.

- swap_put_entry_locked (already exist but rewritten)
  Decrease the swap count of one swap entry. The entry must be allocated
  by folio_alloc_swap.
  This won't free the entries completely if they are bounded to a folio
  in swap cache, the entries will be freed when swap cache is freed.

- swap_put_entries (already exist but rewritten)
  Decrease the swap count of a set of swap entries. The entries must be
  allocated by folio_alloc_swap.
  This won't free the entries completely if they are bounded to a folio
  in swap cache, the entries will be freed when swap cache is freed.

And use these helper to replace all existing callers. This helps to
simplify the count tracking by a lot, and the swap_map is gone.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  11 +-
 mm/memory.c          |   2 +-
 mm/swap.h            |   9 +-
 mm/swap_state.c      |  19 +-
 mm/swapfile.c        | 671 +++++++++++++++----------------------------
 5 files changed, 253 insertions(+), 459 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 60b126918399..b9796bb9e7e7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,8 +264,7 @@ struct swap_info_struct {
 	signed short	prio;		/* swap priority of this type */
 	struct plist_node list;		/* entry in swap_active_head */
 	signed char	type;		/* strange name for an index */
-	unsigned int	max;		/* extent of the swap_map */
-	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
+	unsigned int	max;		/* size of this swap device */
 	unsigned long *zeromap;		/* kvmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
@@ -284,7 +283,7 @@ struct swap_info_struct {
 	struct completion comp;		/* seldom referenced */
 	spinlock_t lock;		/*
 					 * protect map scan related fields like
-					 * swap_map, lowest_bit, highest_bit,
+					 * lowest_bit, highest_bit,
 					 * inuse_pages, cluster_next,
 					 * cluster_nr, lowest_alloc,
 					 * highest_alloc, free/discard cluster
@@ -437,7 +436,6 @@ static inline long get_nr_swap_pages(void)
 
 extern void si_swapinfo(struct sysinfo *);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -502,11 +500,6 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
-static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static inline int do_dup_swap_entry(swp_entry_t ent)
 {
 	return 0;
diff --git a/mm/memory.c b/mm/memory.c
index a9a548575e72..a33f860317f5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1207,7 +1207,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 
 	if (ret == -EIO) {
 		VM_WARN_ON_ONCE(!entry.val);
-		if (add_swap_count_continuation(entry, GFP_KERNEL) < 0) {
+		if (swap_retry_table_alloc(entry, GFP_KERNEL) < 0) {
 			ret = -ENOMEM;
 			goto out;
 		}
diff --git a/mm/swap.h b/mm/swap.h
index 7cbfca39225f..228195e54c9d 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -39,6 +39,7 @@ struct swap_cluster_info {
 	u8 flags;
 	u8 order;
 	swp_te_t *table;
+	unsigned long *extend_table; /* Only used for extended swap count */
 	struct list_head list;
 };
 
@@ -135,6 +136,8 @@ static inline void swap_unlock_cluster_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
+extern int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp);
+
 /*
  * All swap entries starts getting allocated by folio_alloc_swap(),
  * and the folio will be added to swap cache.
@@ -198,7 +201,6 @@ extern int __swap_cache_replace_folio(struct swap_cluster_info *ci,
 extern void __swap_cache_override_folio(struct swap_cluster_info *ci,
 					swp_entry_t entry, struct folio *old,
 					struct folio *new);
-extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
 
 /*
  * Return the swap device position of the swap entry.
@@ -364,6 +366,11 @@ static inline int swap_writepage(struct page *p, struct writeback_control *wbc)
 	return 0;
 }
 
+static inline int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
+{
+	return -EINVAL;
+}
+
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2b145c0f7773..b08d26e7dda5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -169,7 +169,7 @@ struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 			existing = swp_te_folio(exist);
 			goto out_failed;
 		}
-		if (!__swap_count(swp_entry(si->type, offset)))
+		if (!swp_te_get_count(exist))
 			goto out_failed;
 		if (shadow && swp_te_is_shadow(exist))
 			*shadow = swp_te_shadow(exist);
@@ -234,7 +234,7 @@ void __swap_cache_del_folio(swp_entry_t entry,
 		exist = __swap_table_get(ci, offset);
 		VM_WARN_ON_ONCE(swp_te_folio(exist) != folio);
 		__swap_table_set_shadow(ci, offset, shadow);
-		if (__swap_count(swp_entry(si->type, offset)))
+		if (swp_te_get_count(exist))
 			folio_swapped = true;
 		else
 			need_free = true;
@@ -250,7 +250,7 @@ void __swap_cache_del_folio(swp_entry_t entry,
 	} else if (need_free) {
 		offset = start;
 		do {
-			if (!__swap_count(swp_entry(si->type, offset)))
+			if (!swp_te_get_count(__swap_table_get(ci, offset)))
 				__swap_free_entries(si, ci, offset, 1);
 		} while (++offset < end);
 	}
@@ -270,19 +270,6 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 	return swp_te_is_shadow(swp_te) ? swp_te_shadow(swp_te) : NULL;
 }
 
-void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
-{
-	struct swap_cluster_info *ci;
-	pgoff_t offset = swp_offset(entry), end;
-
-	ci = swp_offset_cluster(swp_info(entry), offset);
-	end = offset + nr_ents;
-	do {
-		WARN_ON_ONCE(swp_te_is_folio(__swap_table_get(ci, offset)));
-		__swap_table_set_null(ci, offset);
-	} while (++offset < end);
-}
-
 /*
  * Lookup a swap entry in the swap cache. A found folio will be returned
  * unlocked and with its refcount incremented.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c50cbf6578d3..17b592e938bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -52,15 +52,8 @@
 #include "swap_table.h"
 #include "swap.h"
 
-static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
-				 unsigned char);
-static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
-static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
-					   struct swap_cluster_info *ci,
-					   swp_entry_t entry);
 static bool folio_swapcache_freeable(struct folio *folio);
 
 static DEFINE_SPINLOCK(swap_lock);
@@ -172,21 +165,18 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim the swap entry if swap is getting full */
 #define TTRS_FULL		0x4
 
-static bool swap_only_has_cache(struct swap_info_struct *si,
-				struct swap_cluster_info *ci,
-				unsigned long offset, int nr_pages)
+static bool swap_only_has_cache(struct swap_cluster_info *ci,
+				unsigned long start, int nr_pages)
 {
-	unsigned char *map = si->swap_map + offset;
-	unsigned char *map_end = map + nr_pages;
+	unsigned long offset = start, end = start + nr_pages;
 	swp_te_t entry;
 
 	do {
 		entry = __swap_table_get(ci, offset);
 		VM_WARN_ON_ONCE(!swp_te_is_folio(entry));
-		if (*map)
+		if (swp_te_get_count(entry))
 			return false;
-		offset++;
-	} while (++map < map_end);
+	} while (++offset < end);
 
 	return true;
 }
@@ -247,7 +237,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
 	ci = swap_lock_cluster(si, offset);
-	need_reclaim = swap_only_has_cache(si, ci, offset, nr_pages);
+	need_reclaim = swap_only_has_cache(ci, offset, nr_pages);
 	swap_unlock_cluster(ci);
 	if (!need_reclaim)
 		goto out_unlock;
@@ -434,13 +424,16 @@ static int cluster_table_alloc(struct swap_cluster_info *ci)
 
 static void cluster_table_free(struct swap_cluster_info *ci)
 {
+	swp_te_t swp_te;
 	unsigned int offset;
 
 	if (!ci->table)
 		return;
 
-	for (offset = 0; offset <= SWAPFILE_CLUSTER; offset++)
-		WARN_ON(!swp_te_is_null(__swap_table_get(ci, offset)));
+	for (offset = 0; offset <= SWAPFILE_CLUSTER; offset++) {
+		swp_te = __swap_table_get(ci, offset);
+		WARN_ON_ONCE(!swp_te_is_null(swp_te) && !swp_te_is_bad(swp_te));
+	}
 
 	kfree(ci->table);
 	ci->table = NULL;
@@ -649,13 +642,14 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
 				  unsigned long start, unsigned long end)
 {
-	unsigned char *map = si->swap_map;
 	unsigned long offset = start;
+	swp_te_t entry;
 	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
 	do {
-		if (READ_ONCE(map[offset]))
+		entry = __swap_table_get(ci, offset);
+		if (swp_te_get_count(entry))
 			break;
 		nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 		if (nr_reclaim > 0)
@@ -668,9 +662,11 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
 	 */
-	for (offset = start; offset < end; offset++)
-		if (map[offset] || !swp_te_is_null(__swap_table_get(ci, offset)))
+	for (offset = start; offset < end; offset++) {
+		entry = __swap_table_get(ci, offset);
+		if (!swp_te_is_null(entry))
 			return false;
+	}
 
 	return true;
 }
@@ -681,21 +677,27 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 			       bool *need_reclaim)
 {
 	unsigned long offset, end = start + nr_pages;
-	unsigned char *map = si->swap_map;
+	swp_te_t entry;
 
 	if (cluster_is_empty(ci))
 		return true;
 
 	for (offset = start; offset < end; offset++) {
-		if (map[offset])
+		entry = __swap_table_get(ci, offset);
+		if (swp_te_get_count(entry))
 			return false;
-		if (swp_te_is_folio(__swap_table_get(ci, offset))) {
+		if (swp_te_is_folio(entry)) {
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
+		} else {
+			/*
+			 * Something leaked, should never see anything
+			 * with zero count other than clean cached folio.
+			 */
+			WARN_ON_ONCE(!swp_te_is_null(entry));
 		}
 	}
-
 	return true;
 }
 
@@ -725,8 +727,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 		__swap_cache_add_folio(entry, ci, folio);
 	} else {
 		/* from get_swap_page_of_type */
-		VM_WARN_ON_ONCE(si->swap_map[offset] || swap_cache_check_folio(entry));
-		si->swap_map[offset] = 1;
+		VM_WARN_ON_ONCE(swap_cache_check_folio(entry));
+		__swap_table_set(ci, offset, swp_te_set_count(null_swp_te(), 1));
 	}
 
 	return true;
@@ -795,7 +797,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	long to_scan = 1;
 	unsigned long offset, end;
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
+	swp_te_t entry;
 	int nr_reclaim;
 
 	if (force)
@@ -807,8 +809,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (!map[offset] &&
-			    swp_te_is_folio(__swap_table_get(ci, offset))) {
+			entry = __swap_table_get(ci, offset);
+			if (swp_te_is_folio(entry) && !swp_te_get_count(entry)) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
@@ -1089,7 +1091,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			si->bdev->bd_disk->fops->swap_slot_free_notify;
 	else
 		swap_slot_free_notify = NULL;
-	__swap_cache_clear_shadow(swp_entry(si->type, offset), nr_entries);
 	while (offset <= end) {
 		arch_swap_invalidate_page(si->type, offset);
 		if (swap_slot_free_notify)
@@ -1196,6 +1197,95 @@ static void swap_alloc_slow(struct folio *folio)
 	spin_unlock(&swap_avail_lock);
 }
 
+static int swap_extend_table_alloc(struct swap_info_struct *si,
+				   struct swap_cluster_info *ci, gfp_t gfp)
+{
+	void *table;
+	table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, gfp);
+	if (!table)
+		return -ENOMEM;
+
+	spin_lock(&ci->lock);
+	if (!ci->extend_table)
+		ci->extend_table = table;
+	else
+		kfree(table);
+	spin_unlock(&ci->lock);
+	return 0;
+}
+
+int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
+{
+	int ret;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+
+	si = get_swap_device(entry);
+	if (!si)
+		return 0;
+
+	ci = swp_offset_cluster(si, offset);
+	ret = swap_extend_table_alloc(si, ci, gfp);
+
+	put_swap_device(si);
+	return ret;
+}
+
+static void swap_extend_table_try_free(struct swap_info_struct *si,
+				       struct swap_cluster_info *ci)
+{
+	unsigned long i;
+	bool can_free = true;
+
+	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		if (ci->extend_table[i])
+			can_free = false;
+	}
+
+	if (can_free) {
+		kfree(ci->extend_table);
+		ci->extend_table = NULL;
+	}
+}
+
+static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
+					   struct swap_cluster_info *ci,
+					   unsigned long offset)
+{
+	unsigned long count;
+	swp_te_t entry;
+
+	lockdep_assert_held(&ci->lock);
+
+	entry = __swap_table_get(ci, offset);
+	count = swp_te_get_count(entry);
+
+	VM_WARN_ON_ONCE(!count);
+	VM_WARN_ON_ONCE(count > ENTRY_COUNT_MAX);
+
+	if (count == ENTRY_COUNT_MAX) {
+		count = ci->extend_table[offset % SWAPFILE_CLUSTER];
+		VM_WARN_ON_ONCE(count < ENTRY_COUNT_MAX);
+		count--;
+		if (count == (ENTRY_COUNT_MAX - 1)) {
+			ci->extend_table[offset % SWAPFILE_CLUSTER] = 0;
+			__swap_table_set(ci, offset, swp_te_set_count(entry, count));
+			swap_extend_table_try_free(si, ci);
+		} else {
+			ci->extend_table[offset % SWAPFILE_CLUSTER] = count;
+		}
+	} else {
+		count--;
+		__swap_table_set(ci, offset, swp_te_set_count(entry, count));
+	}
+
+	if (!count && !swp_te_is_folio(__swap_table_get(ci, offset)))
+		__swap_free_entries(si, ci, offset, 1);
+
+	return count;
+}
+
 /*
  * Put the ref count of entries, caller must ensure the entries'
  * swap table count are not zero. This won't free up the swap cache.
@@ -1214,7 +1304,7 @@ static bool swap_put_entries(struct swap_info_struct *si,
 	cluster_end = min(cluster_offset(si, ci) + SWAPFILE_CLUSTER, end);
 	do {
 		swp_te = __swap_table_get(ci, offset);
-		count = si->swap_map[offset];
+		count = swp_te_get_count(swp_te);
 		if (WARN_ON_ONCE(!count)) {
 			goto skip;
 		} else if (swp_te_is_folio(swp_te)) {
@@ -1225,7 +1315,7 @@ static bool swap_put_entries(struct swap_info_struct *si,
 			head = head ? head : offset;
 			continue;
 		}
-		swap_put_entry_locked(si, ci, swp_entry(si->type, offset));
+		swap_put_entry_locked(si, ci, offset);
 skip:
 		if (head) {
 			__swap_free_entries(si, ci, head, offset - head);
@@ -1245,6 +1335,78 @@ static bool swap_put_entries(struct swap_info_struct *si,
 	return has_cache;
 }
 
+static int swap_dup_entry_locked(struct swap_info_struct *si,
+				 struct swap_cluster_info *ci,
+				 unsigned long offset)
+{
+	swp_te_t entry = __swap_table_get(ci, offset);
+	unsigned int count = swp_te_get_count(entry);
+
+	lockdep_assert_held(&ci->lock);
+
+	if (WARN_ON_ONCE(count == ENTRY_COUNT_BAD))
+		return -ENOENT;
+	if (WARN_ON_ONCE(!count && !swp_te_is_folio(entry)))
+		return -ENOENT;
+	if (WARN_ON_ONCE(offset > si->max))
+		return -EINVAL;
+
+	if (likely(count < (ENTRY_COUNT_MAX - 1))) {
+		__swap_table_set_count(ci, offset, count + 1);
+		VM_WARN_ON_ONCE(ci->extend_table && ci->extend_table[offset % SWAPFILE_CLUSTER]);
+	} else if (count == (ENTRY_COUNT_MAX - 1)) {
+		if (ci->extend_table) {
+			VM_WARN_ON_ONCE(ci->extend_table[offset % SWAPFILE_CLUSTER]);
+			ci->extend_table[offset % SWAPFILE_CLUSTER] = ENTRY_COUNT_MAX;
+			__swap_table_set_count(ci, offset, ENTRY_COUNT_MAX);
+		} else {
+			return -ENOMEM;
+		}
+	} else if (count == ENTRY_COUNT_MAX) {
+		++ci->extend_table[offset % SWAPFILE_CLUSTER];
+	} else {
+		/* Never happens unless counting went wrong */
+		WARN_ON_ONCE(1);
+	}
+
+	return 0;
+}
+
+/*
+ * Increase the swap count of each specified entries by one.
+ */
+static bool swap_dup_entries(struct swap_info_struct *si,
+			     unsigned long start, int nr)
+{
+	int err;
+	struct swap_cluster_info *ci;
+	unsigned long offset = start, end = start + nr;
+
+	ci = swap_lock_cluster(si, offset);
+	VM_WARN_ON_ONCE(ci != swp_offset_cluster(si, start + nr - 1));
+restart:
+	do {
+		err = swap_dup_entry_locked(si, ci, offset);
+		if (unlikely(err)) {
+			if (err == -ENOMEM) {
+				swap_unlock_cluster(ci);
+				err = swap_extend_table_alloc(si, ci, GFP_ATOMIC);
+				ci = swap_lock_cluster(si, offset);
+				if (!err)
+					goto restart;
+			}
+			goto failed;
+		}
+	} while (++offset < end);
+	swap_unlock_cluster(ci);
+	return 0;
+failed:
+	while (offset-- > start)
+		swap_put_entry_locked(si, ci, offset);
+	swap_unlock_cluster(ci);
+	return err;
+}
+
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -1302,7 +1464,6 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
  */
 int folio_dup_swap(struct folio *folio, struct page *subpage)
 {
-	int err = 0;
 	swp_entry_t entry = folio->swap;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
@@ -1314,10 +1475,7 @@ int folio_dup_swap(struct folio *folio, struct page *subpage)
 		nr_pages = 1;
 	}
 
-	while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
-		err = add_swap_count_continuation(entry, GFP_ATOMIC);
-
-	return err;
+	return swap_dup_entries(swp_info(entry), swp_offset(entry), nr_pages);
 }
 
 /*
@@ -1364,31 +1522,6 @@ void folio_free_swap_cache(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
-static unsigned char swap_put_entry_locked(struct swap_info_struct *si,
-					   struct swap_cluster_info *ci,
-					   swp_entry_t entry)
-{
-	unsigned long offset = swp_offset(entry);
-	unsigned char count;
-
-	count = si->swap_map[offset];
-	if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
-		if (count == COUNT_CONTINUED) {
-			if (swap_count_continued(si, offset, count))
-				count = SWAP_MAP_MAX | COUNT_CONTINUED;
-			else
-				count = SWAP_MAP_MAX;
-		} else
-			count--;
-	}
-
-	WRITE_ONCE(si->swap_map[offset], count);
-	if (!count && !swp_te_is_folio(__swap_table_get(ci, offset)))
-		__swap_free_entries(si, ci, offset, 1);
-
-	return count;
-}
-
 /*
  * When we get a swap entry, if there aren't some other ways to
  * prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -1457,10 +1590,10 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 
 void __swap_free_entries(struct swap_info_struct *si,
 		       struct swap_cluster_info *ci,
-		       unsigned long offset, unsigned int nr_pages)
+		       unsigned long start, unsigned int nr_pages)
 {
-	swp_entry_t entry = swp_entry(si->type, offset);
-	unsigned long end = offset + nr_pages;
+	swp_entry_t entry = swp_entry(si->type, start);
+	unsigned long offset = start, end = start + nr_pages;
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
@@ -1469,12 +1602,13 @@ void __swap_free_entries(struct swap_info_struct *si,
 
 	ci->count -= nr_pages;
 	do {
-		si->swap_map[offset] = 0;
+		/* It should be either a real shadow or empty shadow */
+		VM_WARN_ON_ONCE(!swp_te_is_shadow(__swap_table_get(ci, offset)));
+		__swap_table_set_null(ci, offset);
 	} while (++offset < end);
 
-	offset = swp_offset(entry);
 	mem_cgroup_uncharge_swap(entry, nr_pages);
-	swap_range_free(si, offset, nr_pages);
+	swap_range_free(si, start, nr_pages);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -1485,9 +1619,10 @@ void __swap_free_entries(struct swap_info_struct *si,
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = swp_info(entry);
+	struct swap_cluster_info *ci;
 	pgoff_t offset = swp_offset(entry);
-
-	return si->swap_map[offset];
+	ci = swp_offset_cluster(si, offset);
+	return swp_te_get_count(__swap_table_get(ci, offset));
 }
 
 /*
@@ -1499,12 +1634,13 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 {
 	pgoff_t offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
-	int count;
+	swp_te_t swp_te;
 
 	ci = swap_lock_cluster(si, offset);
-	count = si->swap_map[offset];
+	swp_te = __swap_table_get(ci, offset);
 	swap_unlock_cluster(ci);
-	return !!count;
+
+	return __swp_te_is_countable(swp_te) && swp_te_get_count(swp_te);
 }
 
 /*
@@ -1513,42 +1649,22 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
  */
 int swp_swapcount(swp_entry_t entry)
 {
-	int count, tmp_count, n;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
-	struct page *page;
+	swp_te_t ste;
 	pgoff_t offset;
-	unsigned char *map;
+	int count;
 
 	si = get_swap_device(entry);
 	if (!si)
 		return 0;
 
 	offset = swp_offset(entry);
-
 	ci = swap_lock_cluster(si, offset);
-
-	count = si->swap_map[offset];
-	if (!(count & COUNT_CONTINUED))
-		goto out;
-
-	count &= ~COUNT_CONTINUED;
-	n = SWAP_MAP_MAX + 1;
-
-	page = vmalloc_to_page(si->swap_map + offset);
-	offset &= ~PAGE_MASK;
-	VM_BUG_ON(page_private(page) != SWP_CONTINUED);
-
-	do {
-		page = list_next_entry(page, lru);
-		map = kmap_local_page(page);
-		tmp_count = map[offset];
-		kunmap_local(map);
-
-		count += (tmp_count & ~COUNT_CONTINUED) * n;
-		n *= (SWAP_CONT_MAX + 1);
-	} while (tmp_count & COUNT_CONTINUED);
-out:
+	ste = __swap_table_get(ci, offset);
+	count = swp_te_get_count(ste);
+	if (count == ENTRY_COUNT_MAX)
+		count = ci->extend_table[offset % SWAPFILE_CLUSTER];
 	swap_unlock_cluster(ci);
 	put_swap_device(si);
 	return count;
@@ -1558,7 +1674,6 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 					 swp_entry_t entry, int order)
 {
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
 	unsigned int nr_pages = 1 << order;
 	unsigned long roffset = swp_offset(entry);
 	unsigned long offset = round_down(roffset, nr_pages);
@@ -1567,12 +1682,12 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 
 	ci = swap_lock_cluster(si, offset);
 	if (nr_pages == 1) {
-		if (map[roffset])
+		if (swp_te_get_count(__swap_table_get(ci, roffset)))
 			ret = true;
 		goto unlock_out;
 	}
 	for (i = 0; i < nr_pages; i++) {
-		if (map[offset + i]) {
+		if (swp_te_get_count(__swap_table_get(ci, offset + i))) {
 			ret = true;
 			break;
 		}
@@ -1678,7 +1793,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	/*
 	 * First free all entries in the range.
 	 */
-	any_only_cache = swap_put_entries(swp_info(entry), swp_offset(entry), nr);
+	any_only_cache = swap_put_entries(si, start_offset, nr);
 
 	/*
 	 * Short-circuit the below loop if none of the entries had their
@@ -1693,7 +1808,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
 		swp_te = __swap_table_get(swp_offset_cluster(si, offset), offset);
-		if (!READ_ONCE(si->swap_map[offset]) && swp_te_is_folio(swp_te)) {
+		if (swp_te_is_folio(swp_te) && !swp_te_get_count(swp_te)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
@@ -1752,7 +1867,7 @@ void free_swap_page_of_entry(swp_entry_t entry)
 	if (!si)
 		return;
 	ci = swap_lock_cluster(si, offset);
-	WARN_ON(swap_put_entry_locked(si, ci, entry));
+	WARN_ON(swap_put_entry_locked(si, ci, offset));
 	/* It might got added to swap cache accidentally by read ahead */
 	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 	swap_unlock_cluster(ci);
@@ -2012,7 +2127,8 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 						&vmf);
 		}
 		if (!folio) {
-			swp_count = READ_ONCE(si->swap_map[offset]);
+			swp_count = swp_te_get_count(__swap_table_get(swp_cluster(entry),
+						     swp_offset(entry)));
 			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
 				continue;
 			return -ENOMEM;
@@ -2144,7 +2260,7 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 }
 
 /*
- * Scan swap_map from current position to next entry still in use.
+ * Scan swap table from current position to next entry still in use.
  * Return 0 if there are no inuse entries after prev till end of
  * the map.
  */
@@ -2154,7 +2270,6 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 {
 	unsigned int i;
 	swp_te_t swp_te;
-	unsigned char count;
 
 	/*
 	 * No need for swap_lock here: we're just looking
@@ -2163,11 +2278,8 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 * allocations from this area (while holding swap_lock).
 	 */
 	for (i = prev + 1; i < si->max; i++) {
-		count = READ_ONCE(si->swap_map[i]);
 		swp_te = __swap_table_get(swp_offset_cluster(si, i), i);
-		if (count == SWAP_MAP_BAD)
-			continue;
-		if (count || swp_te_is_folio(swp_te))
+		if (!swp_te_is_null(swp_te) && !swp_te_is_bad(swp_te))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
 			cond_resched();
@@ -2572,7 +2684,6 @@ static void flush_percpu_swap_cluster(struct swap_info_struct *si)
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
-	unsigned char *swap_map;
 	unsigned long *zeromap;
 	struct file *swap_file, *victim;
 	struct address_space *mapping;
@@ -2668,7 +2779,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	destroy_swap_extents(p);
 	if (p->flags & SWP_CONTINUED)
-		free_swap_count_continuations(p);
 
 	if (!p->bdev || !bdev_nonrot(p->bdev))
 		atomic_dec(&nr_rotate_swap);
@@ -2680,8 +2790,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
-	swap_map = p->swap_map;
-	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
 	free_swap_cluster_info(p->cluster_info, p->max);
@@ -2694,7 +2802,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	kfree(p->global_cluster);
 	p->global_cluster = NULL;
-	vfree(swap_map);
 	kvfree(zeromap);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
@@ -2754,7 +2861,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
 		return SEQ_START_TOKEN;
 
 	for (type = 0; (si = swp_type_get_info(type)); type++) {
-		if (!(si->flags & SWP_USED) || !si->swap_map)
+		if (!(si->flags & SWP_USED))
 			continue;
 		if (!--l)
 			return si;
@@ -2775,7 +2882,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
 
 	++(*pos);
 	for (; (si = swp_type_get_info(type)); type++) {
-		if (!(si->flags & SWP_USED) || !si->swap_map)
+		if (!(si->flags & SWP_USED))
 			continue;
 		return si;
 	}
@@ -3037,7 +3144,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 
 static int setup_swap_map_and_extents(struct swap_info_struct *si,
 					union swap_header *swap_header,
-					unsigned char *swap_map,
 					unsigned long maxpages,
 					sector_t *span)
 {
@@ -3052,13 +3158,14 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si,
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)
 			return -EINVAL;
 		if (page_nr < maxpages) {
-			swap_map[page_nr] = SWAP_MAP_BAD;
+			__swap_table_set(&si->cluster_info[page_nr / SWAPFILE_CLUSTER],
+					 page_nr, bad_swp_te());
 			nr_good_pages--;
 		}
 	}
 
 	if (nr_good_pages) {
-		swap_map[0] = SWAP_MAP_BAD;
+		__swap_table_set(&si->cluster_info[0], 0, bad_swp_te());
 		si->pages = nr_good_pages;
 		nr_extents = setup_swap_extents(si, span);
 		if (nr_extents < 0)
@@ -3162,7 +3269,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	int nr_extents;
 	sector_t span;
 	unsigned long maxpages;
-	unsigned char *swap_map = NULL;
 	unsigned long *zeromap = NULL;
 	struct folio *folio = NULL;
 	struct inode *inode = NULL;
@@ -3241,20 +3347,11 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	/* OK, set up the swap map and apply the bad block list */
-	swap_map = vzalloc(maxpages);
-	si->swap_map = swap_map;
-	if (!swap_map) {
-		error = -ENOMEM;
-		goto bad_swap_unlock_inode;
-	}
-
 	error = swap_cgroup_swapon(si->type, maxpages);
 	if (error)
 		goto bad_swap_unlock_inode;
 
-	nr_extents = setup_swap_map_and_extents(si, swap_header, swap_map,
-						maxpages, &span);
+	nr_extents = setup_swap_map_and_extents(si, swap_header, maxpages, &span);
 	if (unlikely(nr_extents < 0)) {
 		error = nr_extents;
 		goto bad_swap_unlock_inode;
@@ -3364,8 +3461,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	si->swap_file = NULL;
 	si->flags = 0;
 	spin_unlock(&swap_lock);
-	vfree(si->swap_map);
-	si->swap_map = NULL;
 	free_swap_cluster_info(si->cluster_info, si->max);
 	si->cluster_info = NULL;
 	kvfree(zeromap);
@@ -3400,80 +3495,6 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
-/*
- * Verify that nr swap entries are valid and increment their swap map counts.
- *
- * Returns error code in following case.
- * - success -> 0
- * - swp_entry is invalid -> EINVAL
- * - swap-cache reference is requested but there is already one. -> EEXIST
- * - swap-cache reference is requested but the entry is not used. -> ENOENT
- * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
- */
-static int swap_dup_entries(struct swap_info_struct *si,
-			    struct swap_cluster_info *ci,
-			    unsigned long offset,
-			    unsigned char usage, int nr)
-{
-	int i;
-	unsigned char count;
-
-	for (i = 0; i < nr; i++) {
-		count = si->swap_map[offset + i];
-
-		/*
-		 * swapin_readahead() doesn't check if a swap entry is valid, so the
-		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
-		 */
-		if (unlikely(count == SWAP_MAP_BAD))
-			return -ENOENT;
-
-		if (!count && !swp_te_is_folio(__swap_table_get(ci, offset)))
-			return -ENOENT;
-	}
-
-	for (i = 0; i < nr; i++) {
-		count = si->swap_map[offset + i];
-		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
-			count += usage;
-		else if (swap_count_continued(si, offset + i, count))
-			count = COUNT_CONTINUED;
-		else {
-			/*
-			 * Don't need to rollback changes, because if
-			 * usage == 1, there must be nr == 1.
-			 */
-			return -ENOMEM;
-		}
-
-		WRITE_ONCE(si->swap_map[offset + i], count);
-	}
-
-	return 0;
-}
-
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset;
-	int err;
-
-	si = swp_get_info(entry);
-	if (WARN_ON_ONCE(!si)) {
-		pr_err("%s%08lx\n", Bad_file, entry.val);
-		return -EINVAL;
-	}
-
-	offset = swp_offset(entry);
-	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	ci = swap_lock_cluster(si, offset);
-	err = swap_dup_entries(si, ci, offset, usage, nr);
-	swap_unlock_cluster(ci);
-
-	return err;
-}
-
 /**
  * do_dup_swap_entry() - Increase reference count of a swap entry by one.
  *
@@ -3489,233 +3510,19 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
  * to protect the entry from being freed.
  */
 int do_dup_swap_entry(swp_entry_t entry)
-{
-	int err = 0;
-	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
-		err = add_swap_count_continuation(entry, GFP_ATOMIC);
-	return err;
-}
-
-/*
- * add_swap_count_continuation - called when a swap count is duplicated
- * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
- * page of the original vmalloc'ed swap_map, to hold the continuation count
- * (for that entry and for its neighbouring PAGE_SIZE swap entries).  Called
- * again when count is duplicated beyond SWAP_MAP_MAX * SWAP_CONT_MAX, etc.
- *
- * These continuation pages are seldom referenced: the common paths all work
- * on the original swap_map, only referring to a continuation page when the
- * low "digit" of a count is incremented or decremented through SWAP_MAP_MAX.
- *
- * add_swap_count_continuation(, GFP_ATOMIC) can be called while holding
- * page table locks; if it fails, add_swap_count_continuation(, GFP_KERNEL)
- * can be called after dropping locks.
- */
-int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 {
 	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	struct page *head;
-	struct page *page;
-	struct page *list_page;
-	pgoff_t offset;
-	unsigned char count;
-	int ret = 0;
-
-	/*
-	 * When debugging, it's easier to use __GFP_ZERO here; but it's better
-	 * for latency not to zero a page while GFP_ATOMIC and holding locks.
-	 */
-	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
-
-	si = get_swap_device(entry);
-	if (!si) {
-		/*
-		 * An acceptable race has occurred since the failing
-		 * __swap_duplicate(): the swap device may be swapoff
-		 */
-		goto outer;
-	}
-
-	offset = swp_offset(entry);
-
-	ci = swap_lock_cluster(si, offset);
-
-	count = si->swap_map[offset];
-
-	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
-		/*
-		 * The higher the swap count, the more likely it is that tasks
-		 * will race to add swap count continuation: we need to avoid
-		 * over-provisioning.
-		 */
-		goto out;
-	}
-
-	if (!page) {
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	head = vmalloc_to_page(si->swap_map + offset);
-	offset &= ~PAGE_MASK;
-
-	spin_lock(&si->cont_lock);
-	/*
-	 * Page allocation does not initialize the page's lru field,
-	 * but it does always reset its private field.
-	 */
-	if (!page_private(head)) {
-		BUG_ON(count & COUNT_CONTINUED);
-		INIT_LIST_HEAD(&head->lru);
-		set_page_private(head, SWP_CONTINUED);
-		si->flags |= SWP_CONTINUED;
-	}
-
-	list_for_each_entry(list_page, &head->lru, lru) {
-		unsigned char *map;
-
-		/*
-		 * If the previous map said no continuation, but we've found
-		 * a continuation page, free our allocation and use this one.
-		 */
-		if (!(count & COUNT_CONTINUED))
-			goto out_unlock_cont;
-
-		map = kmap_local_page(list_page) + offset;
-		count = *map;
-		kunmap_local(map);
-
-		/*
-		 * If this continuation count now has some space in it,
-		 * free our allocation and use this one.
-		 */
-		if ((count & ~COUNT_CONTINUED) != SWAP_CONT_MAX)
-			goto out_unlock_cont;
-	}
-
-	list_add_tail(&page->lru, &head->lru);
-	page = NULL;			/* now it's attached, don't free it */
-out_unlock_cont:
-	spin_unlock(&si->cont_lock);
-out:
-	swap_unlock_cluster(ci);
-	put_swap_device(si);
-outer:
-	if (page)
-		__free_page(page);
-	return ret;
-}
-
-/*
- * swap_count_continued - when the original swap_map count is incremented
- * from SWAP_MAP_MAX, check if there is already a continuation page to carry
- * into, carry if so, or else fail until a new continuation page is allocated;
- * when the original swap_map count is decremented from 0 with continuation,
- * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or caller of swap_put_entry_locked()
- * holds cluster lock.
- */
-static bool swap_count_continued(struct swap_info_struct *si,
-				 pgoff_t offset, unsigned char count)
-{
-	struct page *head;
-	struct page *page;
-	unsigned char *map;
-	bool ret;
-
-	head = vmalloc_to_page(si->swap_map + offset);
-	if (page_private(head) != SWP_CONTINUED) {
-		BUG_ON(count & COUNT_CONTINUED);
-		return false;		/* need to add count continuation */
-	}
-
-	spin_lock(&si->cont_lock);
-	offset &= ~PAGE_MASK;
-	page = list_next_entry(head, lru);
-	map = kmap_local_page(page) + offset;
-
-	if (count == SWAP_MAP_MAX)	/* initial increment from swap_map */
-		goto init_map;		/* jump over SWAP_CONT_MAX checks */
-
-	if (count == (SWAP_MAP_MAX | COUNT_CONTINUED)) { /* incrementing */
-		/*
-		 * Think of how you add 1 to 999
-		 */
-		while (*map == (SWAP_CONT_MAX | COUNT_CONTINUED)) {
-			kunmap_local(map);
-			page = list_next_entry(page, lru);
-			BUG_ON(page == head);
-			map = kmap_local_page(page) + offset;
-		}
-		if (*map == SWAP_CONT_MAX) {
-			kunmap_local(map);
-			page = list_next_entry(page, lru);
-			if (page == head) {
-				ret = false;	/* add count continuation */
-				goto out;
-			}
-			map = kmap_local_page(page) + offset;
-init_map:		*map = 0;		/* we didn't zero the page */
-		}
-		*map += 1;
-		kunmap_local(map);
-		while ((page = list_prev_entry(page, lru)) != head) {
-			map = kmap_local_page(page) + offset;
-			*map = COUNT_CONTINUED;
-			kunmap_local(map);
-		}
-		ret = true;			/* incremented */
+	int err = 0;
 
-	} else {				/* decrementing */
-		/*
-		 * Think of how you subtract 1 from 1000
-		 */
-		BUG_ON(count != COUNT_CONTINUED);
-		while (*map == COUNT_CONTINUED) {
-			kunmap_local(map);
-			page = list_next_entry(page, lru);
-			BUG_ON(page == head);
-			map = kmap_local_page(page) + offset;
-		}
-		BUG_ON(*map == 0);
-		*map -= 1;
-		if (*map == 0)
-			count = 0;
-		kunmap_local(map);
-		while ((page = list_prev_entry(page, lru)) != head) {
-			map = kmap_local_page(page) + offset;
-			*map = SWAP_CONT_MAX | count;
-			count = COUNT_CONTINUED;
-			kunmap_local(map);
-		}
-		ret = count == COUNT_CONTINUED;
+	si = swp_get_info(entry);
+	if (WARN_ON_ONCE(!si)) {
+		pr_err("%s%08lx\n", Bad_file, entry.val);
+		return -EINVAL;
 	}
-out:
-	spin_unlock(&si->cont_lock);
-	return ret;
-}
 
-/*
- * free_swap_count_continuations - swapoff free all the continuation pages
- * appended to the swap_map, after swap_map is quiesced, before vfree'ing it.
- */
-static void free_swap_count_continuations(struct swap_info_struct *si)
-{
-	pgoff_t offset;
-
-	for (offset = 0; offset < si->max; offset += PAGE_SIZE) {
-		struct page *head;
-		head = vmalloc_to_page(si->swap_map + offset);
-		if (page_private(head)) {
-			struct page *page, *next;
+	err = swap_dup_entries(si, swp_offset(entry), 1);
 
-			list_for_each_entry_safe(page, next, &head->lru, lru) {
-				list_del(&page->lru);
-				__free_page(page);
-			}
-		}
-	}
+	return err;
 }
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* [PATCH 28/28] mm, swap: implement dynamic allocation of swap table
  2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
                   ` (26 preceding siblings ...)
  2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
@ 2025-05-14 20:17 ` Kairui Song
  2025-05-21 18:36   ` Nhat Pham
  27 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-14 20:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swap table is cluster based, which means free clusters can free its
table since no one should modify it.

There could be speculative readers, like swap cache look up, protect
them by making them RCU safe. All swap table should be filled with null
entries before free, so such readers will either see a NULL pointer or
a null filled table being lazy freed.

On allocation, allocate the table when a cluster is used by any order.

This way, we can reduce the memory usage of large swap device
significantly.

This idea to dynamically release unused swap cluster data was initially
suggested by Chris Li while proposing the cluster swap allocator and
I found it suits the swap table idea very well.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |   8 +-
 mm/swap_state.c |  11 +--
 mm/swap_table.h |  25 ++++-
 mm/swapfile.c   | 241 +++++++++++++++++++++++++++++++++++-------------
 4 files changed, 213 insertions(+), 72 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 228195e54c9d..dfe9fc1552e8 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -24,6 +24,12 @@ extern struct swap_info_struct *swap_info[];
  */
 typedef atomic_long_t swp_te_t;
 
+/* A typical flat array as swap table */
+struct swap_table_flat {
+	swp_te_t entries[SWAPFILE_CLUSTER];
+	struct rcu_head rcu;
+};
+
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
@@ -38,7 +44,7 @@ struct swap_cluster_info {
 	u16 count;
 	u8 flags;
 	u8 order;
-	swp_te_t *table;
+	swp_te_t __rcu *table;
 	unsigned long *extend_table; /* Only used for extended swap count */
 	struct list_head list;
 };
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b08d26e7dda5..dd14e110f273 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -163,6 +163,8 @@ struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
 	existing = NULL;
 	si = swp_info(entry);
 	ci = swap_lock_cluster(si, offset);
+	if (!ci->table)
+		goto out_failed;
 	do {
 		exist = __swap_table_get(ci, offset);
 		if (unlikely(swp_te_is_folio(exist))) {
@@ -263,10 +265,8 @@ void __swap_cache_del_folio(swp_entry_t entry,
 void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	swp_te_t swp_te;
-
 	pgoff_t offset = swp_offset(entry);
-	swp_te = __swap_table_get(swp_cluster(entry), offset);
-
+	swp_te = swap_table_try_get(swp_cluster(entry), offset);
 	return swp_te_is_shadow(swp_te) ? swp_te_shadow(swp_te) : NULL;
 }
 
@@ -281,8 +281,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	swp_te_t swp_te;
 	struct folio *folio;
-	swp_te = __swap_table_get(swp_cluster(entry), swp_offset(entry));
-
+	swp_te = swap_table_try_get(swp_cluster(entry), swp_offset(entry));
 	if (!swp_te_is_folio(swp_te))
 		return NULL;
 
@@ -300,7 +299,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 bool swap_cache_check_folio(swp_entry_t entry)
 {
 	swp_te_t swp_te;
-	swp_te = __swap_table_get(swp_cluster(entry), swp_offset(entry));
+	swp_te = swap_table_try_get(swp_cluster(entry), swp_offset(entry));
 	return swp_te_is_folio(swp_te);
 }
 
diff --git a/mm/swap_table.h b/mm/swap_table.h
index afb2953d408a..6f0b80fee03c 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -2,6 +2,7 @@
 #ifndef _MM_SWAP_TABLE_H
 #define _MM_SWAP_TABLE_H
 
+#include <linux/rcupdate.h>
 #include "swap.h"
 
 /*
@@ -161,13 +162,31 @@ static inline void __swap_table_set(struct swap_cluster_info *ci, pgoff_t off,
 				    swp_te_t swp_te)
 {
 	atomic_long_set(&ci->table[off % SWAPFILE_CLUSTER], swp_te.counter);
+	lockdep_assert_held(&ci->lock);
+	swp_te_t *table = rcu_dereference_protected(ci->table, true);
+	atomic_long_set(&table[off % SWAPFILE_CLUSTER], swp_te.counter);
+}
+
+static inline swp_te_t swap_table_try_get(struct swap_cluster_info *ci, pgoff_t off)
+{
+	swp_te_t swp_te;
+	rcu_read_lock();
+	swp_te_t *table = rcu_dereference_check(ci->table,
+						lockdep_is_held(&ci->lock));
+	if (table)
+		swp_te.counter = atomic_long_read(&table[off % SWAPFILE_CLUSTER]);
+	else
+		swp_te = null_swp_te();
+	rcu_read_unlock();
+	return swp_te;
 }
 
 static inline swp_te_t __swap_table_get(struct swap_cluster_info *ci, pgoff_t off)
 {
-	swp_te_t swp_te = {
-		.counter = atomic_long_read(&ci->table[off % SWAPFILE_CLUSTER])
-	};
+	swp_te_t swp_te;
+	swp_te_t *table = rcu_dereference_check(ci->table,
+						lockdep_is_held(&ci->lock));
+	swp_te.counter = atomic_long_read(&table[off % SWAPFILE_CLUSTER]);
 	return swp_te;
 }
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 17b592e938bc..b2d2d501ef8e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -101,6 +101,8 @@ static DEFINE_SPINLOCK(swap_avail_lock);
 
 struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
+static struct kmem_cache *swap_table_cachep;
+
 static DEFINE_MUTEX(swapon_mutex);
 
 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
@@ -373,6 +375,11 @@ static inline bool cluster_is_discard(struct swap_cluster_info *info)
 	return info->flags == CLUSTER_FLAG_DISCARD;
 }
 
+static inline bool cluster_need_populate(struct swap_cluster_info *ci)
+{
+	return rcu_access_pointer(ci->table) == NULL;
+}
+
 static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
 {
 	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
@@ -412,17 +419,22 @@ static void move_cluster(struct swap_info_struct *si,
 	ci->flags = new_flags;
 }
 
-static int cluster_table_alloc(struct swap_cluster_info *ci)
+/* Allocate tables for reserved (bad) entries */
+static int cluster_populate_init_table(struct swap_cluster_info *ci)
 {
-	WARN_ON(ci->table);
-	ci->table = kzalloc(sizeof(swp_te_t) * SWAPFILE_CLUSTER,
-			    GFP_KERNEL);
-	if (!ci->table)
-		return -ENOMEM;
+	void *table;
+
+	if (!ci->table) {
+		table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
+		if (!table)
+			return -ENOMEM;
+		rcu_assign_pointer(ci->table, table);
+	}
+
 	return 0;
 }
 
-static void cluster_table_free(struct swap_cluster_info *ci)
+static void cluster_free_init_table(struct swap_cluster_info *ci)
 {
 	swp_te_t swp_te;
 	unsigned int offset;
@@ -431,12 +443,36 @@ static void cluster_table_free(struct swap_cluster_info *ci)
 		return;
 
 	for (offset = 0; offset <= SWAPFILE_CLUSTER; offset++) {
-		swp_te = __swap_table_get(ci, offset);
+		swp_te = swap_table_try_get(ci, offset);
 		WARN_ON_ONCE(!swp_te_is_null(swp_te) && !swp_te_is_bad(swp_te));
 	}
 
+	rcu_assign_pointer(ci->table, NULL);
 	kfree(ci->table);
-	ci->table = NULL;
+}
+
+static void cluster_populate(struct swap_cluster_info *ci, void *alloced)
+{
+	VM_WARN_ON_ONCE(!cluster_is_empty(ci));
+	VM_WARN_ON_ONCE(!cluster_need_populate(ci));
+	rcu_assign_pointer(ci->table, alloced);
+}
+
+static void swap_table_flat_free(struct swap_table_flat *table)
+{
+	unsigned int offset;
+
+	for (offset = 0; offset < SWAPFILE_CLUSTER; offset++)
+		WARN_ON_ONCE(!swp_te_is_null(table->entries[offset]));
+
+	kmem_cache_free(swap_table_cachep, table);
+}
+
+static void swap_table_flat_free_cb(struct rcu_head *head)
+{
+	struct swap_table_flat *table;
+	table = container_of(head, struct swap_table_flat, rcu);
+	swap_table_flat_free(table);
 }
 
 /* Add a cluster to discard list and schedule it to do discard */
@@ -450,7 +486,13 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
+	struct swap_table_flat *table;
 	lockdep_assert_held(&ci->lock);
+
+	table = (void *)rcu_access_pointer(ci->table);
+	rcu_assign_pointer(ci->table, NULL);
+	call_rcu(&table->rcu, swap_table_flat_free_cb);
+
 	move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
@@ -470,10 +512,6 @@ static struct swap_cluster_info *isolate_lock_cluster(
 	struct swap_cluster_info *ci, *ret = NULL;
 
 	spin_lock(&si->lock);
-
-	if (unlikely(!(si->flags & SWP_WRITEOK)))
-		goto out;
-
 	list_for_each_entry(ci, list, list) {
 		if (!spin_trylock(&ci->lock))
 			continue;
@@ -488,12 +526,73 @@ static struct swap_cluster_info *isolate_lock_cluster(
 		ret = ci;
 		break;
 	}
-out:
 	spin_unlock(&si->lock);
 
 	return ret;
 }
 
+/* Free cluster need to be populated before use. */
+static struct swap_cluster_info *isolate_lock_free_cluster(
+		struct swap_info_struct *si, int order)
+{
+	struct list_head *free_clusters = &si->free_clusters;
+	struct swap_cluster_info *ci, *ret = NULL;
+	void *table;
+
+	if (list_empty(free_clusters))
+		return NULL;
+
+	table = kmem_cache_zalloc(swap_table_cachep, GFP_ATOMIC);
+	if (!table) {
+		if (!(si->flags & SWP_SOLIDSTATE))
+			spin_unlock(&si->global_cluster_lock);
+		local_unlock(&percpu_swap_cluster.lock);
+
+		table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
+
+		local_lock(&percpu_swap_cluster.lock);
+		if (!(si->flags & SWP_SOLIDSTATE))
+			spin_lock(&si->global_cluster_lock);
+
+		/*
+		 * If migrated to a new CPU with usable local cluster,
+		 * use that instead to prevent contention and fragmentation.
+		 */
+		if (this_cpu_read(percpu_swap_cluster.offset[order])) {
+			if (table)
+				kmem_cache_free(swap_table_cachep, table);
+			return ERR_PTR(-EAGAIN);
+		}
+		if (!table)
+			return ERR_PTR(-ENOMEM);
+	}
+
+	spin_lock(&si->lock);
+	list_for_each_entry(ci, &si->free_clusters, list) {
+		if (!spin_trylock(&ci->lock))
+			continue;
+		list_del(&ci->list);
+
+		VM_WARN_ON_ONCE(ci->flags != CLUSTER_FLAG_FREE);
+		cluster_populate(ci, table);
+		/*
+		 * Set order here, the cluster will surely be used unless
+		 * raced with swapoff (!SWP_WRITEOK), in that case it will
+		 * be freed again by relocate_cluster (may lead to discard
+		 * on empty space, but that's a really rare case).
+		 */
+		ci->order = order;
+		ci->flags = CLUSTER_FLAG_NONE;
+		ret = ci;
+		break;
+	}
+	spin_unlock(&si->lock);
+
+	if (!ret)
+		kmem_cache_free(swap_table_cachep, table);
+	return ret;
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
  * will be added to free cluster list. Discard cluster is a bit special as
@@ -648,7 +747,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 
 	spin_unlock(&ci->lock);
 	do {
-		entry = __swap_table_get(ci, offset);
+		entry = swap_table_try_get(ci, offset);
 		if (swp_te_get_count(entry))
 			break;
 		nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
@@ -663,7 +762,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 	 * could have been be freed while we are not holding the lock.
 	 */
 	for (offset = start; offset < end; offset++) {
-		entry = __swap_table_get(ci, offset);
+		entry = swap_table_try_get(ci, offset);
 		if (!swp_te_is_null(entry))
 			return false;
 	}
@@ -710,14 +809,10 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 	swp_entry_t entry = swp_entry(si->type, offset);
 	unsigned long nr_pages = 1 << order;
 
+	VM_WARN_ON_ONCE(ci->order != order && order);
+
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
-	/*
-	 * The first allocation in a cluster makes the
-	 * cluster exclusive to this order
-	 */
-	if (cluster_is_empty(ci))
-		ci->order = order;
 
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
@@ -735,12 +830,12 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 }
 
 /* Try use a new cluster for current CPU and allocate from it. */
-static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
-					    struct swap_cluster_info *ci,
-					    struct folio *folio,
-					    unsigned long offset)
+static long alloc_swap_scan_cluster(struct swap_info_struct *si,
+				    struct swap_cluster_info *ci,
+				    struct folio *folio,
+				    unsigned long offset)
 {
-	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	long next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int order = folio ? folio_order(folio) : 0;
@@ -759,16 +854,16 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		if (need_reclaim) {
 			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
 			/*
-			 * Reclaim drops ci->lock and cluster could be used
-			 * by another order. Not checking flag as off-list
-			 * cluster has no flag set, and change of list
-			 * won't cause fragmentation.
+			 * Reclaim drops ci->lock and cluster could be modified
+			 * by others. Need to check the cluster status.
 			 */
+			if (cluster_is_empty(ci)) {
+				found = -EAGAIN;
+				goto out;
+			}
 			if (!cluster_is_usable(ci, order))
 				goto out;
-			if (cluster_is_empty(ci))
-				offset = start;
-			/* Reclaim failed but cluster is usable, try next */
+			/* Reclaim failed but cluster is still usable, go on */
 			if (!ret)
 				continue;
 		}
@@ -809,7 +904,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			entry = __swap_table_get(ci, offset);
+			entry = swap_table_try_get(ci, offset);
 			if (swp_te_is_folio(entry) && !swp_te_get_count(entry)) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
@@ -851,7 +946,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 {
 	struct swap_cluster_info *ci;
 	unsigned int order = folio ? folio_order(folio) : 0;
-	unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	unsigned long offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 
 	/*
 	 * Swapfile is not block device so unable
@@ -866,10 +961,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 		offset = si->global_cluster->next[order];
 
 		ci = swap_lock_cluster(si, offset);
-		/* Cluster could have been used by another order */
-		if (cluster_is_usable(ci, order)) {
-			if (cluster_is_empty(ci))
-				offset = cluster_offset(si, ci);
+		/* Cluster could have been modified by another order */
+		if (cluster_is_usable(ci, order) && !cluster_is_empty(ci)) {
 			found = alloc_swap_scan_cluster(si, ci, folio, offset);
 		} else {
 			swap_unlock_cluster(ci);
@@ -879,8 +972,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 	}
 
 new_cluster:
-	ci = isolate_lock_cluster(si, &si->free_clusters);
-	if (ci) {
+	ci = isolate_lock_free_cluster(si, order);
+	if (!IS_ERR_OR_NULL(ci)) {
 		found = alloc_swap_scan_cluster(si, ci, folio, cluster_offset(si, ci));
 		if (found)
 			goto done;
@@ -941,8 +1034,13 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 		}
 	}
 done:
+	/* The unlocked reclaim may release a complete new cluster */
+	if (found == -EAGAIN)
+		goto new_cluster;
+
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
+
 	return found;
 }
 
@@ -1143,13 +1241,17 @@ static bool swap_alloc_fast(struct folio *folio)
 	if (!si || !offset || !get_swap_device_info(si))
 		return false;
 
+	/*
+	 * Don't use non usable cluster, and don't use empty cluster
+	 * either. Empty cluster need to be populated before use.
+	 */
 	ci = swap_lock_cluster(si, offset);
-	if (cluster_is_usable(ci, order)) {
-		if (cluster_is_empty(ci))
-			offset = cluster_offset(si, ci);
+	if (cluster_is_usable(ci, order) && !cluster_is_empty(ci)) {
 		alloc_swap_scan_cluster(si, ci, folio, offset);
 	} else {
 		swap_unlock_cluster(ci);
+		this_cpu_write(percpu_swap_cluster.offset[order],
+			       SWAP_ENTRY_INVALID);
 	}
 	put_swap_device(si);
 	return folio->swap.val != SWAP_ENTRY_INVALID;
@@ -1634,10 +1736,11 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 {
 	pgoff_t offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
-	swp_te_t swp_te;
+	swp_te_t swp_te = null_swp_te();
 
 	ci = swap_lock_cluster(si, offset);
-	swp_te = __swap_table_get(ci, offset);
+	if (ci->table)
+		swp_te = __swap_table_get(ci, offset);
 	swap_unlock_cluster(ci);
 
 	return __swp_te_is_countable(swp_te) && swp_te_get_count(swp_te);
@@ -1651,7 +1754,7 @@ int swp_swapcount(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
-	swp_te_t ste;
+	swp_te_t ste = null_swp_te();
 	pgoff_t offset;
 	int count;
 
@@ -1661,7 +1764,8 @@ int swp_swapcount(swp_entry_t entry)
 
 	offset = swp_offset(entry);
 	ci = swap_lock_cluster(si, offset);
-	ste = __swap_table_get(ci, offset);
+	if (ci->table)
+		ste = __swap_table_get(ci, offset);
 	count = swp_te_get_count(ste);
 	if (count == ENTRY_COUNT_MAX)
 		count = ci->extend_table[offset % SWAPFILE_CLUSTER];
@@ -1681,6 +1785,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	bool ret = false;
 
 	ci = swap_lock_cluster(si, offset);
+	if (!ci->table)
+		return false;
 	if (nr_pages == 1) {
 		if (swp_te_get_count(__swap_table_get(ci, roffset)))
 			ret = true;
@@ -1807,7 +1913,7 @@ void do_put_swap_entries(swp_entry_t entry, int nr)
 	 */
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
-		swp_te = __swap_table_get(swp_offset_cluster(si, offset), offset);
+		swp_te = swap_table_try_get(swp_offset_cluster(si, offset), offset);
 		if (swp_te_is_folio(swp_te) && !swp_te_get_count(swp_te)) {
 			/*
 			 * Folios are always naturally aligned in swap so
@@ -2127,7 +2233,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 						&vmf);
 		}
 		if (!folio) {
-			swp_count = swp_te_get_count(__swap_table_get(swp_cluster(entry),
+			swp_count = swp_te_get_count(swap_table_try_get(swp_cluster(entry),
 						     swp_offset(entry)));
 			if (swp_count == 0 || swp_count == SWAP_MAP_BAD)
 				continue;
@@ -2278,7 +2384,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 * allocations from this area (while holding swap_lock).
 	 */
 	for (i = prev + 1; i < si->max; i++) {
-		swp_te = __swap_table_get(swp_offset_cluster(si, i), i);
+		swp_te = swap_table_try_get(swp_offset_cluster(si, i), i);
 		if (!swp_te_is_null(swp_te) && !swp_te_is_bad(swp_te))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
@@ -2651,11 +2757,11 @@ static void free_swap_cluster_info(struct swap_cluster_info *cluster_info,
 				   unsigned long max)
 {
 	int i, nr_clusters = DIV_ROUND_UP(max, SWAPFILE_CLUSTER);
-	if (!cluster_info)
+
+	if (!cluster_info || WARN_ON(!nr_clusters))
 		return;
-	VM_WARN_ON(!nr_clusters);
 	for (i = 0; i < nr_clusters; i++)
-		cluster_table_free(&cluster_info[i]);
+		cluster_free_init_table(&cluster_info[i]);
 	kvfree(cluster_info);
 }
 
@@ -3147,6 +3253,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si,
 					unsigned long maxpages,
 					sector_t *span)
 {
+	struct swap_cluster_info *ci;
 	unsigned int nr_good_pages;
 	unsigned long i;
 	int nr_extents;
@@ -3158,14 +3265,23 @@ static int setup_swap_map_and_extents(struct swap_info_struct *si,
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)
 			return -EINVAL;
 		if (page_nr < maxpages) {
-			__swap_table_set(&si->cluster_info[page_nr / SWAPFILE_CLUSTER],
-					 page_nr, bad_swp_te());
+			ci = &si->cluster_info[page_nr / SWAPFILE_CLUSTER];
+			if (cluster_populate_init_table(ci))
+				return -ENOMEM;
+			spin_lock(&ci->lock);
+			__swap_table_set(ci, page_nr, bad_swp_te());
+			spin_unlock(&ci->lock);
 			nr_good_pages--;
 		}
 	}
 
 	if (nr_good_pages) {
-		__swap_table_set(&si->cluster_info[0], 0, bad_swp_te());
+		ci = &si->cluster_info[0];
+		if (cluster_populate_init_table(ci))
+			return -ENOMEM;
+		spin_lock(&ci->lock);
+		__swap_table_set(ci, 0, bad_swp_te());
+		spin_unlock(&ci->lock);
 		si->pages = nr_good_pages;
 		nr_extents = setup_swap_extents(si, span);
 		if (nr_extents < 0)
@@ -3199,11 +3315,8 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	if (!cluster_info)
 		goto err;
 
-	for (i = 0; i < nr_clusters; i++) {
+	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
-		if (cluster_table_alloc(&cluster_info[i]))
-			goto err_free;
-	}
 
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
@@ -3580,6 +3693,10 @@ static int __init swapfile_init(void)
 
 	swapfile_maximum_size = arch_max_swapfile_size();
 
+	swap_table_cachep = kmem_cache_create("swap_table",
+			    sizeof(struct swap_table_flat),
+			    0, SLAB_PANIC, NULL);
+
 #ifdef CONFIG_MIGRATION
 	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
 		swap_migration_ad_supported = true;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
  2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
@ 2025-05-15  9:31   ` Klara Modin
  2025-05-15  9:39     ` Kairui Song
  2025-05-19  7:08   ` Barry Song
  1 sibling, 1 reply; 56+ messages in thread
From: Klara Modin @ 2025-05-15  9:31 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

Hi,

On 2025-05-15 04:17:11 +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Move all mTHP swapin check into can_swapin_thp and use it for both pre
> IO check and post IO check. This way the code is more consolidated and
> make later commit easier to maintain.

From what I can see, can_swapin_thp is gated behind
CONFIG_TRANSPARENT_HUGEPAGE and this fails to build when it's not
enabled.

> 
> Also clean up the comments while at it. The current comment of
> non_swapcache_batch is not correct: swap in bypassing swap cache won't
> reach the swap device as long as the entry is cached, because it still
> sets the SWAP_HAS_CACHE flag. If the folio is already in swap cache, raced
> swap in will either fail due to -EEXIST with swapcache_prepare, or see the
> cached folio.
> 
> The real reason this non_swapcache_batch is needed is that if a smaller
> folio is in the swap cache but not mapped, mTHP swapin will be blocked
> forever as it won't see the folio due to index offset, nor it can set the
> SWAP_HAS_CACHE bit, so it has to fallback to order 0 swap in.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 90 ++++++++++++++++++++++++-----------------------------
>  1 file changed, 41 insertions(+), 49 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f2897d9059f2..1b6e192de6ec 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4319,12 +4319,6 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  	pgoff_t offset = swp_offset(entry);
>  	int i;
>  
> -	/*
> -	 * While allocating a large folio and doing swap_read_folio, which is
> -	 * the case the being faulted pte doesn't have swapcache. We need to
> -	 * ensure all PTEs have no cache as well, otherwise, we might go to
> -	 * swap devices while the content is in swapcache.
> -	 */
>  	for (i = 0; i < max_nr; i++) {
>  		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
>  			return i;

> @@ -4334,34 +4328,30 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  }
>  
>  /*
> - * Check if the PTEs within a range are contiguous swap entries
> - * and have consistent swapcache, zeromap.
> + * Check if the page table is still suitable for large folio swap in.
> + * @vmf: The fault triggering the swap-in.
> + * @ptep: Pointer to the PTE that should be the head of the swap in folio.
> + * @addr: The address corresponding to the PTE.
> + * @nr_pages: Number of pages of the folio that suppose to be swapped in.
>   */
> -static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +			   unsigned long addr, unsigned int nr_pages)
>  {
> -	unsigned long addr;
> -	swp_entry_t entry;
> -	int idx;
> -	pte_t pte;
> -
> -	addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> -	idx = (vmf->address - addr) / PAGE_SIZE;
> -	pte = ptep_get(ptep);
> +	pte_t pte = ptep_get(ptep);
> +	unsigned long addr_end = addr + (PAGE_SIZE * nr_pages);
> +	unsigned long pte_offset = (vmf->address - addr) / PAGE_SIZE;
>  
> -	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(addr, PAGE_SIZE) ||
> +			addr > vmf->address || addr_end <= vmf->address);
> +	if (unlikely(addr < max(addr & PMD_MASK, vmf->vma->vm_start) ||
> +		     addr_end > pmd_addr_end(addr, vmf->vma->vm_end)))
>  		return false;
> -	entry = pte_to_swp_entry(pte);
> -	if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
> -		return false;
> -
>  	/*
> -	 * swap_read_folio() can't handle the case a large folio is hybridly
> -	 * from different backends. And they are likely corner cases. Similar
> -	 * things might be added once zswap support large folios.
> +	 * All swap entries must from the same swap device, in same
> +	 * cgroup, with same exclusiveness, only differs in offset.
>  	 */
> -	if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
> -		return false;
> -	if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
> +	if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -pte_offset)) ||
> +	    swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
>  		return false;
>  
>  	return true;
> @@ -4441,13 +4431,24 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  	 * completely swap entries with contiguous swap offsets.
>  	 */
>  	order = highest_order(orders);
> -	while (orders) {
> -		addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> -		if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
> -			break;
> -		order = next_order(&orders, order);
> +	for (; orders; order = next_order(&orders, order)) {
> +		unsigned long nr_pages = 1 << order;
> +		swp_entry_t swap_entry = { .val = ALIGN_DOWN(entry.val, nr_pages) };
> +		addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
> +		if (!can_swapin_thp(vmf, pte + pte_index(addr), addr, nr_pages))
> +			continue;
> +		/*
> +		 * If there is already a smaller folio in cache, it will
> +		 * conflict with the larger folio in the swap cache layer
> +		 * and block the swap in.
> +		 */
> +		if (unlikely(non_swapcache_batch(swap_entry, nr_pages) != nr_pages))
> +			continue;
> +		/* Zero map doesn't work with large folio yet. */
> +		if (unlikely(swap_zeromap_batch(swap_entry, nr_pages, NULL) != nr_pages))
> +			continue;
> +		break;
>  	}
> -
>  	pte_unmap_unlock(pte, ptl);
>  
>  	/* Try allocating the highest of the remaining orders. */
> @@ -4731,27 +4732,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	page_idx = 0;
>  	address = vmf->address;
>  	ptep = vmf->pte;
> +
>  	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> -		int nr = folio_nr_pages(folio);
> +		unsigned long nr = folio_nr_pages(folio);
>  		unsigned long idx = folio_page_idx(folio, page);
> -		unsigned long folio_start = address - idx * PAGE_SIZE;
> -		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> -		pte_t *folio_ptep;
> -		pte_t folio_pte;
> +		unsigned long folio_address = address - idx * PAGE_SIZE;
> +		pte_t *folio_ptep = vmf->pte - idx;
>  
> -		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> -			goto check_folio;
> -		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
> -			goto check_folio;
> -
> -		folio_ptep = vmf->pte - idx;
> -		folio_pte = ptep_get(folio_ptep);
> -		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> -		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)

> +		if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))
>  			goto check_folio;

At this point we're outside CONFIG_TRANSPARENT_HUGEPAGE.

>  
>  		page_idx = idx;
> -		address = folio_start;
> +		address = folio_address;
>  		ptep = folio_ptep;
>  		nr_pages = nr;
>  		entry = folio->swap;
> -- 
> 2.49.0
> 

Regards,
Klara Modin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
  2025-05-15  9:31   ` Klara Modin
@ 2025-05-15  9:39     ` Kairui Song
  0 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-15  9:39 UTC (permalink / raw)
  To: Klara Modin
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

On Thu, May 15, 2025 at 5:31 PM Klara Modin <klarasmodin@gmail.com> wrote:
>
> Hi,
>
> On 2025-05-15 04:17:11 +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Move all mTHP swapin check into can_swapin_thp and use it for both pre
> > IO check and post IO check. This way the code is more consolidated and
> > make later commit easier to maintain.
>
> From what I can see, can_swapin_thp is gated behind
> CONFIG_TRANSPARENT_HUGEPAGE and this fails to build when it's not
> enabled.

Thanks for the review.

Right, I might have to add an empty one for
!CONFIG_TRANSPARENT_HUGEPAGE to pass the build.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table
  2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
@ 2025-05-15  9:40   ` Klara Modin
  2025-05-16  2:35     ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Klara Modin @ 2025-05-15  9:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

Hi,

On 2025-05-15 04:17:24 +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> To prepare for using swap table as the unified swap layer, introduce
> macro and helpers for storing multiple kind of data in an swap table
> entry.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_table.h | 130 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 119 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index 69a074339444..9356004d211a 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -5,9 +5,41 @@
>  #include "swap.h"
>  
>  /*
> - * Swap table entry could be a pointer (folio), a XA_VALUE (shadow), or NULL.
> + * Swap table entry type and bit layouts:
> + *
> + * NULL:     | ------------    0   -------------|
> + * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1|
> + * Folio:    | SWAP_COUNT |------ PFN -------|10|
> + * Pointer:  |----------- Pointer ----------|100|
> + *
> + * Usage:
> + * - NULL: Swap Entry is unused.
> + *
> + * - Shadow: Swap Entry is used and not cached (swapped out).
> + *   It's reusing XA_VALUE format to be compatible with workingset
> + *   shadows. SHADOW_VAL part could be all 0.
> + *
> + * - Folio: Swap Entry is in cache.
> + *
> + * - Pointer: Unused yet. Because only the last three bit of a pointer
> + *   is usable so now `100` is reserved for potential pointer use.
>   */
>  
> +#define ENTRY_COUNT_BITS	BITS_PER_BYTE
> +#define ENTRY_SHADOW_MARK	0b1UL
> +#define ENTRY_PFN_MARK		0b10UL
> +#define ENTRY_PFN_LOW_MASK	0b11UL
> +#define ENTRY_PFN_SHIFT		2
> +#define ENTRY_PFN_MASK		((~0UL) >> ENTRY_COUNT_BITS)
> +#define ENTRY_COUNT_MASK	(~((~0UL) >> ENTRY_COUNT_BITS))
> +#define ENTRY_COUNT_SHIFT	(BITS_PER_LONG - BITS_PER_BYTE)
> +#define ENTRY_COUNT_MAX		((1 << ENTRY_COUNT_BITS) - 2)
> +#define ENTRY_COUNT_BAD		((1 << ENTRY_COUNT_BITS) - 1) /* ENTRY_BAD */
> +#define ENTRY_BAD		(~0UL)
> +
> +/* For shadow offset calculation */
> +#define SWAP_COUNT_SHIFT	ENTRY_COUNT_BITS
> +
>  /*
>   * Helpers for casting one type of info into a swap table entry.
>   */
> @@ -19,17 +51,27 @@ static inline swp_te_t null_swp_te(void)
>  
>  static inline swp_te_t folio_swp_te(struct folio *folio)
>  {
> -	BUILD_BUG_ON(sizeof(swp_te_t) != sizeof(void *));
> -	swp_te_t swp_te = { .counter = (unsigned long)folio };

> +	BUILD_BUG_ON((MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) >
> +		     (BITS_PER_LONG - ENTRY_PFN_SHIFT - ENTRY_COUNT_BITS));

MAX_POSSIBLE_PHYSMEM_BITS does not seem to be available on all
arches/configs.

E.g. zsmalloc seems to set it to MAX_PHYSMEM_BITS when this is the case
but I don't know if that applies here.

> +	swp_te_t swp_te = {
> +		.counter = (folio_pfn(folio) << ENTRY_PFN_SHIFT) | ENTRY_PFN_MARK
> +	};
>  	return swp_te;
>  }
>  
>  static inline swp_te_t shadow_swp_te(void *shadow)
>  {
> -	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
> -		     BITS_PER_BYTE * sizeof(swp_te_t));
> -	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
>  	swp_te_t swp_te = { .counter = ((unsigned long)shadow) };
> +	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) != BITS_PER_BYTE * sizeof(swp_te_t));
> +	BUILD_BUG_ON((unsigned long)xa_mk_value(0) != ENTRY_SHADOW_MARK);
> +	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
> +	swp_te.counter |= ENTRY_SHADOW_MARK;
> +	return swp_te;
> +}
> +
> +static inline swp_te_t bad_swp_te(void)
> +{
> +	swp_te_t swp_te = { .counter = ENTRY_BAD };
>  	return swp_te;
>  }
>  
> @@ -43,7 +85,7 @@ static inline bool swp_te_is_null(swp_te_t swp_te)
>  
>  static inline bool swp_te_is_folio(swp_te_t swp_te)
>  {
> -	return !xa_is_value((void *)swp_te.counter) && !swp_te_is_null(swp_te);
> +	return ((swp_te.counter & ENTRY_PFN_LOW_MASK) == ENTRY_PFN_MARK);
>  }
>  
>  static inline bool swp_te_is_shadow(swp_te_t swp_te)
> @@ -51,19 +93,63 @@ static inline bool swp_te_is_shadow(swp_te_t swp_te)
>  	return xa_is_value((void *)swp_te.counter);
>  }
>  
> +static inline bool swp_te_is_valid_shadow(swp_te_t swp_te)
> +{
> +	/* The shadow could be empty, just for holding the swap count */
> +	return xa_is_value((void *)swp_te.counter) &&
> +	       xa_to_value((void *)swp_te.counter);
> +}
> +
> +static inline bool swp_te_is_bad(swp_te_t swp_te)
> +{
> +	return swp_te.counter == ENTRY_BAD;
> +}
> +
> +static inline bool __swp_te_is_countable(swp_te_t ent)
> +{
> +	return (swp_te_is_shadow(ent) || swp_te_is_folio(ent) ||
> +		swp_te_is_null(ent));
> +}
> +
>  /*
>   * Helpers for retrieving info from swap table.
>   */
>  static inline struct folio *swp_te_folio(swp_te_t swp_te)
>  {
>  	VM_WARN_ON(!swp_te_is_folio(swp_te));
> -	return (void *)swp_te.counter;
> +	return pfn_folio((swp_te.counter & ENTRY_PFN_MASK) >> ENTRY_PFN_SHIFT);
>  }
>  
>  static inline void *swp_te_shadow(swp_te_t swp_te)
>  {
>  	VM_WARN_ON(!swp_te_is_shadow(swp_te));
> -	return (void *)swp_te.counter;
> +	return (void *)(swp_te.counter & ~ENTRY_COUNT_MASK);
> +}
> +
> +static inline unsigned char swp_te_get_count(swp_te_t swp_te)
> +{
> +	VM_WARN_ON(!__swp_te_is_countable(swp_te));
> +	return ((swp_te.counter & ENTRY_COUNT_MASK) >> ENTRY_COUNT_SHIFT);
> +}
> +
> +static inline unsigned char swp_te_try_get_count(swp_te_t swp_te)
> +{
> +	if (__swp_te_is_countable(swp_te))
> +		return swp_te_get_count(swp_te);
> +	return 0;
> +}
> +
> +static inline swp_te_t swp_te_set_count(swp_te_t swp_te,
> +					unsigned char count)
> +{
> +	VM_BUG_ON(!__swp_te_is_countable(swp_te));
> +	VM_BUG_ON(count > ENTRY_COUNT_MAX);
> +
> +	swp_te.counter &= ~ENTRY_COUNT_MASK;
> +	swp_te.counter |= ((unsigned long)count) << ENTRY_COUNT_SHIFT;
> +	VM_BUG_ON(swp_te_get_count(swp_te) != count);
> +
> +	return swp_te;
>  }
>  
>  /*
> @@ -87,17 +173,39 @@ static inline swp_te_t __swap_table_get(struct swap_cluster_info *ci, pgoff_t of
>  static inline void __swap_table_set_folio(struct swap_cluster_info *ci, pgoff_t off,
>  					  struct folio *folio)
>  {
> -	__swap_table_set(ci, off, folio_swp_te(folio));
> +	swp_te_t swp_te;
> +	unsigned char count;
> +
> +	swp_te = __swap_table_get(ci, off);
> +	count = swp_te_get_count(swp_te);
> +	swp_te = swp_te_set_count(folio_swp_te(folio), count);
> +
> +	__swap_table_set(ci, off, swp_te);
>  }
>  
>  static inline void __swap_table_set_shadow(struct swap_cluster_info *ci, pgoff_t off,
>  					   void *shadow)
>  {
> -	__swap_table_set(ci, off, shadow_swp_te(shadow));
> +	swp_te_t swp_te;
> +	unsigned char count;
> +
> +	swp_te = __swap_table_get(ci, off);
> +	count = swp_te_get_count(swp_te);
> +	swp_te = swp_te_set_count(shadow_swp_te(shadow), count);
> +
> +	__swap_table_set(ci, off, swp_te);
>  }
>  
>  static inline void __swap_table_set_null(struct swap_cluster_info *ci, pgoff_t off)
>  {
>  	__swap_table_set(ci, off, null_swp_te());
>  }
> +
> +static inline void __swap_table_set_count(struct swap_cluster_info *ci, pgoff_t off,
> +					  unsigned char count)
> +{
> +	swp_te_t swp_te;
> +	swp_te = swp_te_set_count(__swap_table_get(ci, off), count);
> +	__swap_table_set(ci, off, swp_te);
> +}
>  #endif
> -- 
> 2.49.0
> 

Regards,
Klara Modin


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table
  2025-05-15  9:40   ` Klara Modin
@ 2025-05-16  2:35     ` Kairui Song
  0 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-16  2:35 UTC (permalink / raw)
  To: Klara Modin
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Baoquan He, Barry Song,
	Kalesh Singh, Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

On Thu, May 15, 2025 at 5:42 PM Klara Modin <klarasmodin@gmail.com> wrote:
>
> Hi,
>
> On 2025-05-15 04:17:24 +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > To prepare for using swap table as the unified swap layer, introduce
> > macro and helpers for storing multiple kind of data in an swap table
> > entry.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_table.h | 130 ++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 119 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/swap_table.h b/mm/swap_table.h
> > index 69a074339444..9356004d211a 100644
> > --- a/mm/swap_table.h
> > +++ b/mm/swap_table.h
> > @@ -5,9 +5,41 @@
> >  #include "swap.h"
> >
> >  /*
> > - * Swap table entry could be a pointer (folio), a XA_VALUE (shadow), or NULL.
> > + * Swap table entry type and bit layouts:
> > + *
> > + * NULL:     | ------------    0   -------------|
> > + * Shadow:   | SWAP_COUNT |---- SHADOW_VAL ---|1|
> > + * Folio:    | SWAP_COUNT |------ PFN -------|10|
> > + * Pointer:  |----------- Pointer ----------|100|
> > + *
> > + * Usage:
> > + * - NULL: Swap Entry is unused.
> > + *
> > + * - Shadow: Swap Entry is used and not cached (swapped out).
> > + *   It's reusing XA_VALUE format to be compatible with workingset
> > + *   shadows. SHADOW_VAL part could be all 0.
> > + *
> > + * - Folio: Swap Entry is in cache.
> > + *
> > + * - Pointer: Unused yet. Because only the last three bit of a pointer
> > + *   is usable so now `100` is reserved for potential pointer use.
> >   */
> >
> > +#define ENTRY_COUNT_BITS     BITS_PER_BYTE
> > +#define ENTRY_SHADOW_MARK    0b1UL
> > +#define ENTRY_PFN_MARK               0b10UL
> > +#define ENTRY_PFN_LOW_MASK   0b11UL
> > +#define ENTRY_PFN_SHIFT              2
> > +#define ENTRY_PFN_MASK               ((~0UL) >> ENTRY_COUNT_BITS)
> > +#define ENTRY_COUNT_MASK     (~((~0UL) >> ENTRY_COUNT_BITS))
> > +#define ENTRY_COUNT_SHIFT    (BITS_PER_LONG - BITS_PER_BYTE)
> > +#define ENTRY_COUNT_MAX              ((1 << ENTRY_COUNT_BITS) - 2)
> > +#define ENTRY_COUNT_BAD              ((1 << ENTRY_COUNT_BITS) - 1) /* ENTRY_BAD */
> > +#define ENTRY_BAD            (~0UL)
> > +
> > +/* For shadow offset calculation */
> > +#define SWAP_COUNT_SHIFT     ENTRY_COUNT_BITS
> > +
> >  /*
> >   * Helpers for casting one type of info into a swap table entry.
> >   */
> > @@ -19,17 +51,27 @@ static inline swp_te_t null_swp_te(void)
> >
> >  static inline swp_te_t folio_swp_te(struct folio *folio)
> >  {
> > -     BUILD_BUG_ON(sizeof(swp_te_t) != sizeof(void *));
> > -     swp_te_t swp_te = { .counter = (unsigned long)folio };
>
> > +     BUILD_BUG_ON((MAX_POSSIBLE_PHYSMEM_BITS - PAGE_SHIFT) >
> > +                  (BITS_PER_LONG - ENTRY_PFN_SHIFT - ENTRY_COUNT_BITS));
>
> MAX_POSSIBLE_PHYSMEM_BITS does not seem to be available on all
> arches/configs.
>
> E.g. zsmalloc seems to set it to MAX_PHYSMEM_BITS when this is the case
> but I don't know if that applies here.
>

Thanks, I think I'll just copy the snip from zsmalloc, it's basically
doing the same check to ensure there are still enough bits left after
embedding a PFN in a LONG.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
@ 2025-05-19  4:38   ` Barry Song
  2025-05-20  3:31     ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-19  4:38 UTC (permalink / raw)
  To: ryncsn
  Cc: akpm, baohua, baolin.wang, bhe, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, nphamcs,
	ryan.roberts, shikemeng, tim.c.chen, willy, ying.huang,
	yosryahmed

> From: Kairui Song <kasong@tencent.com>

> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index e5a0db7f3331..5b4f01aecf35 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
>  				goto retry;
>  			}
>  		}
> +		if (!folio_swap_contains(src_folio, entry)) {
> +			err = -EBUSY;
> +			goto out;
> +		}

It seems we don't need this. In move_swap_pte(), we have been checking pte pages
are stable:

        if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
                                 dst_pmd, dst_pmdval)) {
                double_pt_unlock(dst_ptl, src_ptl);
                return -EAGAIN;
        } 

Also, -EBUSY is somehow incorrect error code.

> 		err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
>  				orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
>  				dst_ptl, src_ptl, src_folio);
>

Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers
  2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
@ 2025-05-19  6:26   ` Barry Song
  2025-05-20  3:50     ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-19  6:26 UTC (permalink / raw)
  To: ryncsn
  Cc: akpm, baohua, baolin.wang, bhe, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, nphamcs,
	ryan.roberts, shikemeng, tim.c.chen, willy, ying.huang,
	yosryahmed

> From: Kairui Song <kasong@tencent.com>

> @@ -889,10 +849,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		/* Serialize HDD SWAP allocation for each device. */
>  		spin_lock(&si->global_cluster_lock);
>  		offset = si->global_cluster->next[order];
> -		if (offset == SWAP_ENTRY_INVALID)
> -			goto new_cluster;

We are implicitly dropping this. Does it mean the current code is wrong?
Do we need some clarification about this?

>  
> -		ci = lock_cluster(si, offset);
> +		ci = swap_lock_cluster(si, offset);
>  		/* Cluster could have been used by another order */
>  		if (cluster_is_usable(ci, order)) {
>  			if (cluster_is_empty(ci))

Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
  2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
  2025-05-15  9:31   ` Klara Modin
@ 2025-05-19  7:08   ` Barry Song
  2025-05-19 11:09     ` Kairui Song
  1 sibling, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-19  7:08 UTC (permalink / raw)
  To: ryncsn
  Cc: akpm, baohua, baolin.wang, bhe, chrisl, david, hannes, hughd,
	kaleshsingh, kasong, linux-kernel, linux-mm, nphamcs,
	ryan.roberts, shikemeng, tim.c.chen, willy, ying.huang,
	yosryahmed

> From: Kairui Song <kasong@tencent.com>


> -static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> +			   unsigned long addr, unsigned int nr_pages)

> +	if (unlikely(addr < max(addr & PMD_MASK, vmf->vma->vm_start) ||
> +		     addr_end > pmd_addr_end(addr, vmf->vma->vm_end)))


> @@ -4731,27 +4732,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	page_idx = 0;
>  	address = vmf->address;
>  	ptep = vmf->pte;
> +
>  	if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> -		int nr = folio_nr_pages(folio);
> +		unsigned long nr = folio_nr_pages(folio);
>  		unsigned long idx = folio_page_idx(folio, page);
> -		unsigned long folio_start = address - idx * PAGE_SIZE;
> -		unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> -		pte_t *folio_ptep;
> -		pte_t folio_pte;
> +		unsigned long folio_address = address - idx * PAGE_SIZE;
> +		pte_t *folio_ptep = vmf->pte - idx;
>  
> -		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> -			goto check_folio;

We are handling a corner case a large folio is remapped to an unaligned address.
For example,

A 64KiB mTHP at address:  XGB + 2MB +4KB,

Its start address will be XGB + 2MB - 60KB which is another PMD.

The previous code will return false; now your can_swapin_thp() will return true
as you are using XGB + 2MB - 60KB as the argument "addr" in can_swapin_thp().

> -		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
> -			goto check_folio;
> -
> -		folio_ptep = vmf->pte - idx;
> -		folio_pte = ptep_get(folio_ptep);
> -		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
> -		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
> +		if (!can_swapin_thp(vmf, folio_ptep, folio_address, nr))


Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
  2025-05-19  7:08   ` Barry Song
@ 2025-05-19 11:09     ` Kairui Song
  2025-05-19 11:57       ` Barry Song
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-19 11:09 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, baohua, baolin.wang, bhe, chrisl, david, hannes, hughd,
	kaleshsingh, linux-kernel, linux-mm, nphamcs, ryan.roberts,
	shikemeng, tim.c.chen, willy, ying.huang, yosryahmed

On Mon, May 19, 2025 at 3:08 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
>
>
> > -static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> > +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> > +                        unsigned long addr, unsigned int nr_pages)
>
> > +     if (unlikely(addr < max(addr & PMD_MASK, vmf->vma->vm_start) ||
> > +                  addr_end > pmd_addr_end(addr, vmf->vma->vm_end)))
>
>
> > @@ -4731,27 +4732,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       page_idx = 0;
> >       address = vmf->address;
> >       ptep = vmf->pte;
> > +
> >       if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > -             int nr = folio_nr_pages(folio);
> > +             unsigned long nr = folio_nr_pages(folio);
> >               unsigned long idx = folio_page_idx(folio, page);
> > -             unsigned long folio_start = address - idx * PAGE_SIZE;
> > -             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > -             pte_t *folio_ptep;
> > -             pte_t folio_pte;
> > +             unsigned long folio_address = address - idx * PAGE_SIZE;
> > +             pte_t *folio_ptep = vmf->pte - idx;
> >
> > -             if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> > -                     goto check_folio;
>
> We are handling a corner case a large folio is remapped to an unaligned address.
> For example,
>
> A 64KiB mTHP at address:  XGB + 2MB +4KB,
>
> Its start address will be XGB + 2MB - 60KB which is another PMD.
>
> The previous code will return false; now your can_swapin_thp() will return true
> as you are using XGB + 2MB - 60KB as the argument "addr" in can_swapin_thp().

Thanks very much for the info and explanation.

You are right, I need to keep using vmf->address in can_swapin_thp:

if (unlikely(addr < max(vmf->address & PMD_MASK, vmf->vma->vm_start) ||
     addr_end > pmd_addr_end(vmf->address, vmf->vma->vm_end)))
return false;

But one thing I'm not so sure is how that happens? And there isn't an
address checking in the direct swapin mTHP check above?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check
  2025-05-19 11:09     ` Kairui Song
@ 2025-05-19 11:57       ` Barry Song
  0 siblings, 0 replies; 56+ messages in thread
From: Barry Song @ 2025-05-19 11:57 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, baolin.wang, bhe, chrisl, david, hannes, hughd, kaleshsingh,
	linux-kernel, linux-mm, nphamcs, ryan.roberts, shikemeng,
	tim.c.chen, willy, ying.huang, yosryahmed

On Mon, May 19, 2025 at 7:10 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, May 19, 2025 at 3:08 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > From: Kairui Song <kasong@tencent.com>
> >
> >
> > > -static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
> > > +static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep,
> > > +                        unsigned long addr, unsigned int nr_pages)
> >
> > > +     if (unlikely(addr < max(addr & PMD_MASK, vmf->vma->vm_start) ||
> > > +                  addr_end > pmd_addr_end(addr, vmf->vma->vm_end)))
> >
> >
> > > @@ -4731,27 +4732,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >       page_idx = 0;
> > >       address = vmf->address;
> > >       ptep = vmf->pte;
> > > +
> > >       if (folio_test_large(folio) && folio_test_swapcache(folio)) {
> > > -             int nr = folio_nr_pages(folio);
> > > +             unsigned long nr = folio_nr_pages(folio);
> > >               unsigned long idx = folio_page_idx(folio, page);
> > > -             unsigned long folio_start = address - idx * PAGE_SIZE;
> > > -             unsigned long folio_end = folio_start + nr * PAGE_SIZE;
> > > -             pte_t *folio_ptep;
> > > -             pte_t folio_pte;
> > > +             unsigned long folio_address = address - idx * PAGE_SIZE;
> > > +             pte_t *folio_ptep = vmf->pte - idx;
> > >
> > > -             if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> > > -                     goto check_folio;
> >
> > We are handling a corner case a large folio is remapped to an unaligned address.
> > For example,
> >
> > A 64KiB mTHP at address:  XGB + 2MB +4KB,
> >
> > Its start address will be XGB + 2MB - 60KB which is another PMD.
> >
> > The previous code will return false; now your can_swapin_thp() will return true
> > as you are using XGB + 2MB - 60KB as the argument "addr" in can_swapin_thp().
>
> Thanks very much for the info and explanation.
>
> You are right, I need to keep using vmf->address in can_swapin_thp:
>
> if (unlikely(addr < max(vmf->address & PMD_MASK, vmf->vma->vm_start) ||
>      addr_end > pmd_addr_end(vmf->address, vmf->vma->vm_end)))
> return false;
>
> But one thing I'm not so sure is how that happens? And there isn't an
> address checking in the direct swapin mTHP check above?

In page faults, we always make the start address aligned with
PAGE_SIZE * nr_pages.
but for a mremap, we can't actually control the dst address.

so the original code can exclude this case for direct mTHP swapin by
the below you are
dropping:

- if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
-               goto check_folio;
- if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
-               goto check_folio;

Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-19  4:38   ` Barry Song
@ 2025-05-20  3:31     ` Kairui Song
  2025-05-20  4:41       ` Barry Song
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-20  3:31 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, baohua, baolin.wang, bhe, chrisl, david, hannes, hughd,
	kaleshsingh, linux-kernel, linux-mm, nphamcs, ryan.roberts,
	shikemeng, tim.c.chen, willy, ying.huang, yosryahmed

On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
>
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index e5a0db7f3331..5b4f01aecf35 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> >                               goto retry;
> >                       }
> >               }
> > +             if (!folio_swap_contains(src_folio, entry)) {
> > +                     err = -EBUSY;
> > +                     goto out;
> > +             }
>
> It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> are stable:
>
>         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
>                                  dst_pmd, dst_pmdval)) {
>                 double_pt_unlock(dst_ptl, src_ptl);
>                 return -EAGAIN;
>         }

The tricky part is when swap_cache_get_folio returns the folio, both
folio and ptes are unlocked. So is it possible that someone else
swapped in the entries, then swapped them out again using the same
entries?

The folio will be different here but PTEs are still the same value to
they will pass the is_pte_pages_stable check, we previously saw
similar races with anon fault or shmem. I think more strict checking
won't hurt here.

>
> Also, -EBUSY is somehow incorrect error code.

Yes, thanks, I'll use EAGAIN here just like move_swap_pte.


>
> >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> >                               dst_ptl, src_ptl, src_folio);
> >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers
  2025-05-19  6:26   ` Barry Song
@ 2025-05-20  3:50     ` Kairui Song
  0 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-20  3:50 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, baohua, baolin.wang, bhe, chrisl, david, hannes, hughd,
	kaleshsingh, linux-kernel, linux-mm, nphamcs, ryan.roberts,
	shikemeng, tim.c.chen, willy, ying.huang, yosryahmed

On Mon, May 19, 2025 at 2:26 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
>
> > @@ -889,10 +849,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >               /* Serialize HDD SWAP allocation for each device. */
> >               spin_lock(&si->global_cluster_lock);
> >               offset = si->global_cluster->next[order];
> > -             if (offset == SWAP_ENTRY_INVALID)
> > -                     goto new_cluster;
>
> We are implicitly dropping this. Does it mean the current code is wrong?
> Do we need some clarification about this?

Sorry, my bad, this change has nothing to do with this commit, I'll
drop this change in the next version.

>
> >
> > -             ci = lock_cluster(si, offset);
> > +             ci = swap_lock_cluster(si, offset);
> >               /* Cluster could have been used by another order */
> >               if (cluster_is_usable(ci, order)) {
> >                       if (cluster_is_empty(ci))
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-20  3:31     ` Kairui Song
@ 2025-05-20  4:41       ` Barry Song
  2025-05-20 19:09         ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-20  4:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, baolin.wang, bhe, chrisl, david, hannes, hughd, kaleshsingh,
	linux-kernel, linux-mm, nphamcs, ryan.roberts, shikemeng,
	tim.c.chen, willy, ying.huang, yosryahmed

On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > From: Kairui Song <kasong@tencent.com>
> >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index e5a0db7f3331..5b4f01aecf35 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                               goto retry;
> > >                       }
> > >               }
> > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > +                     err = -EBUSY;
> > > +                     goto out;
> > > +             }
> >
> > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > are stable:
> >
> >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> >                                  dst_pmd, dst_pmdval)) {
> >                 double_pt_unlock(dst_ptl, src_ptl);
> >                 return -EAGAIN;
> >         }
>
> The tricky part is when swap_cache_get_folio returns the folio, both
> folio and ptes are unlocked. So is it possible that someone else
> swapped in the entries, then swapped them out again using the same
> entries?
>
> The folio will be different here but PTEs are still the same value to
> they will pass the is_pte_pages_stable check, we previously saw
> similar races with anon fault or shmem. I think more strict checking
> won't hurt here.

This doesn't seem to be the same case as the one you fixed in
do_swap_page(). Here, we're hitting the swap cache, whereas in that
case, there was no one hitting the swap cache, and you used
swap_prepare() to set up the cache to fix the issue.

By the way, if we're not hitting the swap cache, src_folio will be
NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
not guard against that case either.

But I suspect we won't have a problem, since we're not swapping in —
we didn't read any stale data, right? Swap-in will only occur after we
move the PTEs.

>
> >
> > Also, -EBUSY is somehow incorrect error code.
>
> Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
>
>
> >
> > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > >                               dst_ptl, src_ptl, src_folio);
> > >
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-20  4:41       ` Barry Song
@ 2025-05-20 19:09         ` Kairui Song
  2025-05-20 22:33           ` Barry Song
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-20 19:09 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, baolin.wang, bhe, chrisl, david, hannes, hughd, kaleshsingh,
	linux-kernel, linux-mm, nphamcs, ryan.roberts, shikemeng,
	tim.c.chen, willy, ying.huang, yosryahmed

On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > From: Kairui Song <kasong@tencent.com>
> > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                               goto retry;
> > > >                       }
> > > >               }
> > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > +                     err = -EBUSY;
> > > > +                     goto out;
> > > > +             }
> > >
> > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > are stable:
> > >
> > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > >                                  dst_pmd, dst_pmdval)) {
> > >                 double_pt_unlock(dst_ptl, src_ptl);
> > >                 return -EAGAIN;
> > >         }
> >
> > The tricky part is when swap_cache_get_folio returns the folio, both
> > folio and ptes are unlocked. So is it possible that someone else
> > swapped in the entries, then swapped them out again using the same
> > entries?
> >
> > The folio will be different here but PTEs are still the same value to
> > they will pass the is_pte_pages_stable check, we previously saw
> > similar races with anon fault or shmem. I think more strict checking
> > won't hurt here.
>
> This doesn't seem to be the same case as the one you fixed in
> do_swap_page(). Here, we're hitting the swap cache, whereas in that
> case, there was no one hitting the swap cache, and you used
> swap_prepare() to set up the cache to fix the issue.
>
> By the way, if we're not hitting the swap cache, src_folio will be
> NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> not guard against that case either.

Ah, that's true, it should be moved inside the if (folio) {...} block
above. Thanks for catching this!

> But I suspect we won't have a problem, since we're not swapping in —
> we didn't read any stale data, right? Swap-in will only occur after we
> move the PTEs.

My concern is that a parallel swapin / swapout could result in the
folio to be a completely irrelevant or invalid folio.

It's not about the dst, but in the move src side, something like:

CPU1                             CPU2
move_pages_pte
  folio = swap_cache_get_folio(...)
    | Got folio A here
  move_swap_pte
                                 <swapin src_pte, using folio A>
                                 <swapout src_pte, put folio A>
                                   | Now folio A is no longer valid.
                                   | It's very unlikely but here SWAP
                                   | could reuse the same entry as above.
    double_pt_lock
    is_pte_pages_stable
      | Passed because of entry reuse.
    folio_move_anon_rmap(...)
      | Moved invalid folio A.

And could it be possible that the swap_cache_get_folio returns NULL
here, but later right before the double_pt_lock, a folio is added to
swap cache? Maybe we better check the swap cache after clear and
releasing dst lock, but before releasing src lock?


>
> >
> > >
> > > Also, -EBUSY is somehow incorrect error code.
> >
> > Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
> >
> >
> > >
> > > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > >                               dst_ptl, src_ptl, src_folio);
> > > >
> > >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-20 19:09         ` Kairui Song
@ 2025-05-20 22:33           ` Barry Song
  2025-05-21  2:45             ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-20 22:33 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, baolin.wang, bhe, chrisl, david, hannes, hughd, kaleshsingh,
	linux-kernel, linux-mm, nphamcs, ryan.roberts, shikemeng,
	tim.c.chen, willy, ying.huang, yosryahmed

On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > --- a/mm/userfaultfd.c
> > > > > +++ b/mm/userfaultfd.c
> > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > >                               goto retry;
> > > > >                       }
> > > > >               }
> > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > +                     err = -EBUSY;
> > > > > +                     goto out;
> > > > > +             }
> > > >
> > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > are stable:
> > > >
> > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > >                                  dst_pmd, dst_pmdval)) {
> > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > >                 return -EAGAIN;
> > > >         }
> > >
> > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > folio and ptes are unlocked. So is it possible that someone else
> > > swapped in the entries, then swapped them out again using the same
> > > entries?
> > >
> > > The folio will be different here but PTEs are still the same value to
> > > they will pass the is_pte_pages_stable check, we previously saw
> > > similar races with anon fault or shmem. I think more strict checking
> > > won't hurt here.
> >
> > This doesn't seem to be the same case as the one you fixed in
> > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > case, there was no one hitting the swap cache, and you used
> > swap_prepare() to set up the cache to fix the issue.
> >
> > By the way, if we're not hitting the swap cache, src_folio will be
> > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > not guard against that case either.
>
> Ah, that's true, it should be moved inside the if (folio) {...} block
> above. Thanks for catching this!
>
> > But I suspect we won't have a problem, since we're not swapping in —
> > we didn't read any stale data, right? Swap-in will only occur after we
> > move the PTEs.
>
> My concern is that a parallel swapin / swapout could result in the
> folio to be a completely irrelevant or invalid folio.
>
> It's not about the dst, but in the move src side, something like:
>
> CPU1                             CPU2
> move_pages_pte
>   folio = swap_cache_get_folio(...)
>     | Got folio A here
>   move_swap_pte
>                                  <swapin src_pte, using folio A>
>                                  <swapout src_pte, put folio A>
>                                    | Now folio A is no longer valid.
>                                    | It's very unlikely but here SWAP
>                                    | could reuse the same entry as above.


swap_cache_get_folio() does increment the folio's refcount, but it seems this
doesn't prevent do_swap_page() from freeing the swap entry after swapping
in src_pte with folio A, if it's a read fault.
for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
will be false:

static inline bool should_try_to_free_swap(struct folio *folio,
                                           struct vm_area_struct *vma,
                                           unsigned int fault_flags)
{
       ...

        /*
         * If we want to map a page that's in the swapcache writable, we
         * have to detect via the refcount if we're really the exclusive
         * user. Try freeing the swapcache to get rid of the swapcache
         * reference only in case it's likely that we'll be the exlusive user.
         */
        return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
                folio_ref_count(folio) == (1 + folio_nr_pages(folio));
}

and for swapout, __removing_mapping does check refcount as well:

static int __remove_mapping(struct address_space *mapping, struct folio *folio,
                            bool reclaimed, struct mem_cgroup *target_memcg)
{
        refcount = 1 + folio_nr_pages(folio);
        if (!folio_ref_freeze(folio, refcount))
                goto cannot_free;

}

However, since __remove_mapping() occurs after pageout(), it seems
this also doesn't prevent swapout from allocating a new swap entry to
fill src_pte.

It seems your concern is valid—unless I'm missing something.
Do you have a reproducer? If so, this will likely need a separate fix
patch rather than being hidden in this patchset.

>     double_pt_lock
>     is_pte_pages_stable
>       | Passed because of entry reuse.
>     folio_move_anon_rmap(...)
>       | Moved invalid folio A.
>
> And could it be possible that the swap_cache_get_folio returns NULL
> here, but later right before the double_pt_lock, a folio is added to
> swap cache? Maybe we better check the swap cache after clear and
> releasing dst lock, but before releasing src lock?

It seems you're suggesting that a parallel swap-in allocates and adds
a folio to the swap cache, but the PTE has not yet been updated from
a swap entry to a present mapping?

As long as do_swap_page() adds the folio to the swap cache
before updating the PTE to present, this scenario seems possible.

It seems we need to double-check:

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..976053bd2bf1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
struct vm_area_struct *dst_vma,
        if (src_folio) {
                folio_move_anon_rmap(src_folio, dst_vma);
                src_folio->index = linear_page_index(dst_vma, dst_addr);
+       } else {
+               struct folio *folio =
filemap_get_folio(swap_address_space(entry),
+                                       swap_cache_index(entry));
+               if (!IS_ERR_OR_NULL(folio)) {
+                       double_pt_unlock(dst_ptl, src_ptl);
+                       return -EAGAIN;
+               }
        }
-
        orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
 #ifdef CONFIG_MEM_SOFT_DIRTY
        orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);

Let me run test case [1] to check whether this ever happens. I guess I need to
hack kernel a bit to always add folio to swapcache even for SYNC IO.

[1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gmail.com/

>
>
> >
> > >
> > > >
> > > > Also, -EBUSY is somehow incorrect error code.
> > >
> > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
> > >
> > >
> > > >
> > > > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > > > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > > >                               dst_ptl, src_ptl, src_folio);
> > > > >
> > > >
> >

Thanks
Barry


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-20 22:33           ` Barry Song
@ 2025-05-21  2:45             ` Kairui Song
  2025-05-21  3:24               ` Barry Song
  2025-05-23  2:29               ` Barry Song
  0 siblings, 2 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-21  2:45 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
>
> On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > --- a/mm/userfaultfd.c
> > > > > > +++ b/mm/userfaultfd.c
> > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > >                               goto retry;
> > > > > >                       }
> > > > > >               }
> > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > +                     err = -EBUSY;
> > > > > > +                     goto out;
> > > > > > +             }
> > > > >
> > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > are stable:
> > > > >
> > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > >                                  dst_pmd, dst_pmdval)) {
> > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > >                 return -EAGAIN;
> > > > >         }
> > > >
> > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > folio and ptes are unlocked. So is it possible that someone else
> > > > swapped in the entries, then swapped them out again using the same
> > > > entries?
> > > >
> > > > The folio will be different here but PTEs are still the same value to
> > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > similar races with anon fault or shmem. I think more strict checking
> > > > won't hurt here.
> > >
> > > This doesn't seem to be the same case as the one you fixed in
> > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > case, there was no one hitting the swap cache, and you used
> > > swap_prepare() to set up the cache to fix the issue.
> > >
> > > By the way, if we're not hitting the swap cache, src_folio will be
> > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > not guard against that case either.
> >
> > Ah, that's true, it should be moved inside the if (folio) {...} block
> > above. Thanks for catching this!
> >
> > > But I suspect we won't have a problem, since we're not swapping in —
> > > we didn't read any stale data, right? Swap-in will only occur after we
> > > move the PTEs.
> >
> > My concern is that a parallel swapin / swapout could result in the
> > folio to be a completely irrelevant or invalid folio.
> >
> > It's not about the dst, but in the move src side, something like:
> >
> > CPU1                             CPU2
> > move_pages_pte
> >   folio = swap_cache_get_folio(...)
> >     | Got folio A here
> >   move_swap_pte
> >                                  <swapin src_pte, using folio A>
> >                                  <swapout src_pte, put folio A>
> >                                    | Now folio A is no longer valid.
> >                                    | It's very unlikely but here SWAP
> >                                    | could reuse the same entry as above.
>
>
> swap_cache_get_folio() does increment the folio's refcount, but it seems this
> doesn't prevent do_swap_page() from freeing the swap entry after swapping
> in src_pte with folio A, if it's a read fault.
> for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> will be false:
>
> static inline bool should_try_to_free_swap(struct folio *folio,
>                                            struct vm_area_struct *vma,
>                                            unsigned int fault_flags)
> {
>        ...
>
>         /*
>          * If we want to map a page that's in the swapcache writable, we
>          * have to detect via the refcount if we're really the exclusive
>          * user. Try freeing the swapcache to get rid of the swapcache
>          * reference only in case it's likely that we'll be the exlusive user.
>          */
>         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
>                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> }
>
> and for swapout, __removing_mapping does check refcount as well:
>
> static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>                             bool reclaimed, struct mem_cgroup *target_memcg)
> {
>         refcount = 1 + folio_nr_pages(folio);
>         if (!folio_ref_freeze(folio, refcount))
>                 goto cannot_free;
>
> }
>
> However, since __remove_mapping() occurs after pageout(), it seems
> this also doesn't prevent swapout from allocating a new swap entry to
> fill src_pte.
>
> It seems your concern is valid—unless I'm missing something.
> Do you have a reproducer? If so, this will likely need a separate fix
> patch rather than being hidden in this patchset.

Thanks for the analysis. I don't have a reproducer yet, I did some
local experiments and that seems possible, but the race window is so
tiny and it's very difficult to make the swap entry reuse to collide
with that, I'll try more but in theory this seems possible, or at
least looks very fragile.

And yeah, let's patch the kernel first if that's a real issue.

>
> >     double_pt_lock
> >     is_pte_pages_stable
> >       | Passed because of entry reuse.
> >     folio_move_anon_rmap(...)
> >       | Moved invalid folio A.
> >
> > And could it be possible that the swap_cache_get_folio returns NULL
> > here, but later right before the double_pt_lock, a folio is added to
> > swap cache? Maybe we better check the swap cache after clear and
> > releasing dst lock, but before releasing src lock?
>
> It seems you're suggesting that a parallel swap-in allocates and adds
> a folio to the swap cache, but the PTE has not yet been updated from
> a swap entry to a present mapping?
>
> As long as do_swap_page() adds the folio to the swap cache
> before updating the PTE to present, this scenario seems possible.

Yes, that's two kinds of problems here. I suspected there could be an
ABA problem while working on the series, but wasn't certain. And just
realised there could be another missed cache read here thanks to your
review and discussion :)

>
> It seems we need to double-check:
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..976053bd2bf1 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> struct vm_area_struct *dst_vma,
>         if (src_folio) {
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> +       } else {
> +               struct folio *folio =
> filemap_get_folio(swap_address_space(entry),
> +                                       swap_cache_index(entry));
> +               if (!IS_ERR_OR_NULL(folio)) {
> +                       double_pt_unlock(dst_ptl, src_ptl);
> +                       return -EAGAIN;
> +               }
>         }
> -
>         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
>  #ifdef CONFIG_MEM_SOFT_DIRTY
>         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);

Maybe it has to get even dirtier here to call swapcache_prepare too to
cover the SYNC_IO case?

>
> Let me run test case [1] to check whether this ever happens. I guess I need to
> hack kernel a bit to always add folio to swapcache even for SYNC IO.

That will cause quite a performance regression I think. Good thing is,
that's exactly the problem this series is solving by dropping the SYNC
IO swapin path and never bypassing the swap cache, while improving the
performance, eliminating things like this. One more reason to justify
the approach :)

>
> [1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gmail.com/

I'll try this too.

>
> >
> >
> > >
> > > >
> > > > >
> > > > > Also, -EBUSY is somehow incorrect error code.
> > > >
> > > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
> > > >
> > > >
> > > > >
> > > > > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > > > > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > > > >                               dst_ptl, src_ptl, src_folio);
> > > > > >
> > > > >
> > >
>
> Thanks
> Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-21  2:45             ` Kairui Song
@ 2025-05-21  3:24               ` Barry Song
  2025-05-23  2:29               ` Barry Song
  1 sibling, 0 replies; 56+ messages in thread
From: Barry Song @ 2025-05-21  3:24 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> >
> > On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > >
> > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > --- a/mm/userfaultfd.c
> > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > >                               goto retry;
> > > > > > >                       }
> > > > > > >               }
> > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > +                     err = -EBUSY;
> > > > > > > +                     goto out;
> > > > > > > +             }
> > > > > >
> > > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > > are stable:
> > > > > >
> > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > >                 return -EAGAIN;
> > > > > >         }
> > > > >
> > > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > > folio and ptes are unlocked. So is it possible that someone else
> > > > > swapped in the entries, then swapped them out again using the same
> > > > > entries?
> > > > >
> > > > > The folio will be different here but PTEs are still the same value to
> > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > similar races with anon fault or shmem. I think more strict checking
> > > > > won't hurt here.
> > > >
> > > > This doesn't seem to be the same case as the one you fixed in
> > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > > case, there was no one hitting the swap cache, and you used
> > > > swap_prepare() to set up the cache to fix the issue.
> > > >
> > > > By the way, if we're not hitting the swap cache, src_folio will be
> > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > > not guard against that case either.
> > >
> > > Ah, that's true, it should be moved inside the if (folio) {...} block
> > > above. Thanks for catching this!
> > >
> > > > But I suspect we won't have a problem, since we're not swapping in —
> > > > we didn't read any stale data, right? Swap-in will only occur after we
> > > > move the PTEs.
> > >
> > > My concern is that a parallel swapin / swapout could result in the
> > > folio to be a completely irrelevant or invalid folio.
> > >
> > > It's not about the dst, but in the move src side, something like:
> > >
> > > CPU1                             CPU2
> > > move_pages_pte
> > >   folio = swap_cache_get_folio(...)
> > >     | Got folio A here
> > >   move_swap_pte
> > >                                  <swapin src_pte, using folio A>
> > >                                  <swapout src_pte, put folio A>
> > >                                    | Now folio A is no longer valid.
> > >                                    | It's very unlikely but here SWAP
> > >                                    | could reuse the same entry as above.
> >
> >
> > swap_cache_get_folio() does increment the folio's refcount, but it seems this
> > doesn't prevent do_swap_page() from freeing the swap entry after swapping
> > in src_pte with folio A, if it's a read fault.
> > for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> > will be false:
> >
> > static inline bool should_try_to_free_swap(struct folio *folio,
> >                                            struct vm_area_struct *vma,
> >                                            unsigned int fault_flags)
> > {
> >        ...
> >
> >         /*
> >          * If we want to map a page that's in the swapcache writable, we
> >          * have to detect via the refcount if we're really the exclusive
> >          * user. Try freeing the swapcache to get rid of the swapcache
> >          * reference only in case it's likely that we'll be the exlusive user.
> >          */
> >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> >                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > }
> >
> > and for swapout, __removing_mapping does check refcount as well:
> >
> > static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > {
> >         refcount = 1 + folio_nr_pages(folio);
> >         if (!folio_ref_freeze(folio, refcount))
> >                 goto cannot_free;
> >
> > }
> >
> > However, since __remove_mapping() occurs after pageout(), it seems
> > this also doesn't prevent swapout from allocating a new swap entry to
> > fill src_pte.
> >
> > It seems your concern is valid—unless I'm missing something.
> > Do you have a reproducer? If so, this will likely need a separate fix
> > patch rather than being hidden in this patchset.
>
> Thanks for the analysis. I don't have a reproducer yet, I did some
> local experiments and that seems possible, but the race window is so
> tiny and it's very difficult to make the swap entry reuse to collide
> with that, I'll try more but in theory this seems possible, or at
> least looks very fragile.

I think we don't necessarily need to prove the same swap entry. As long
as we can prove the original and current ptes are both swap with the same
offset or not, this could be the possibility they will have the same offset.
This can increase the likelihood of occurrence and help validate the
conceptual model.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..f072d4a5bcd4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c

@@ -1090,6 +1090,10 @@ static int move_swap_pte(struct mm_struct *mm,
struct vm_area_struct *dst_vma,
        if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
                                 dst_pmd, dst_pmdval)) {
+               if (src_folio && !non_swap_entry(orig_src_pte) &&
+                       !non_swap_entry(ptep_get(src_pte)))
+                               the case is true;
+
                double_pt_unlock(dst_ptl, src_ptl);
                return -EAGAIN;
        }


>
> And yeah, let's patch the kernel first if that's a real issue.
>
> >
> > >     double_pt_lock
> > >     is_pte_pages_stable
> > >       | Passed because of entry reuse.
> > >     folio_move_anon_rmap(...)
> > >       | Moved invalid folio A.
> > >
> > > And could it be possible that the swap_cache_get_folio returns NULL
> > > here, but later right before the double_pt_lock, a folio is added to
> > > swap cache? Maybe we better check the swap cache after clear and
> > > releasing dst lock, but before releasing src lock?
> >
> > It seems you're suggesting that a parallel swap-in allocates and adds
> > a folio to the swap cache, but the PTE has not yet been updated from
> > a swap entry to a present mapping?
> >
> > As long as do_swap_page() adds the folio to the swap cache
> > before updating the PTE to present, this scenario seems possible.
>
> Yes, that's two kinds of problems here. I suspected there could be an
> ABA problem while working on the series, but wasn't certain. And just
> realised there could be another missed cache read here thanks to your
> review and discussion :)
>
> >
> > It seems we need to double-check:
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..976053bd2bf1 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               struct folio *folio =
> > filemap_get_folio(swap_address_space(entry),
> > +                                       swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(folio)) {
> > +                       double_pt_unlock(dst_ptl, src_ptl);
> > +                       return -EAGAIN;
> > +               }
> >         }
> > -
> >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> >  #ifdef CONFIG_MEM_SOFT_DIRTY
> >         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
>
> Maybe it has to get even dirtier here to call swapcache_prepare too to
> cover the SYNC_IO case?
>
> >
> > Let me run test case [1] to check whether this ever happens. I guess I need to
> > hack kernel a bit to always add folio to swapcache even for SYNC IO.
>
> That will cause quite a performance regression I think. Good thing is,
> that's exactly the problem this series is solving by dropping the SYNC
> IO swapin path and never bypassing the swap cache, while improving the
> performance, eliminating things like this. One more reason to justify
> the approach :)
>
> >
> > [1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gmail.com/
>
> I'll try this too.
>
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Also, -EBUSY is somehow incorrect error code.
> > > > >
> > > > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
> > > > >
> > > > >
> > > > > >
> > > > > > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > > > > > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > > > > >                               dst_ptl, src_ptl, src_folio);
> > > > > > >
> > > > > >
> > > >
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 28/28] mm, swap: implement dynamic allocation of swap table
  2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
@ 2025-05-21 18:36   ` Nhat Pham
  2025-05-22  4:13     ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Nhat Pham @ 2025-05-21 18:36 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Johannes Weiner,
	Baolin Wang, Baoquan He, Barry Song, Kalesh Singh, Kemeng Shi,
	Tim Chen, Ryan Roberts, linux-kernel

On Wed, May 14, 2025 at 1:20 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now swap table is cluster based, which means free clusters can free its
> table since no one should modify it.
>
> There could be speculative readers, like swap cache look up, protect
> them by making them RCU safe. All swap table should be filled with null
> entries before free, so such readers will either see a NULL pointer or
> a null filled table being lazy freed.
>
> On allocation, allocate the table when a cluster is used by any order.
>
> This way, we can reduce the memory usage of large swap device
> significantly.
>
> This idea to dynamically release unused swap cluster data was initially
> suggested by Chris Li while proposing the cluster swap allocator and
> I found it suits the swap table idea very well.
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Nice optimization!

However, please correct me if I'm wrong - but we are only dynamically
allocating the swap table with this patch. What we are getting here is
the dynamic allocation of the swap entries' metadata (through the swap
table), which my virtual swap prototype already provides. The cluster
metadata struct (struct swap_cluster_info) itself is statically
allocated still (at swapon time), correct? That will not work for a
large virtual swap space :( So unfortunately, even with this swap
table series, swap virtualization is still not trivial - definitely
not as trivial as a new swap device type...

Reading your physical swapfile allocator gives me some ideas though -
let me build it into my prototype :) I'll send it out once it's ready.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 28/28] mm, swap: implement dynamic allocation of swap table
  2025-05-21 18:36   ` Nhat Pham
@ 2025-05-22  4:13     ` Kairui Song
  0 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-22  4:13 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Johannes Weiner,
	Baolin Wang, Baoquan He, Barry Song, Kalesh Singh, Kemeng Shi,
	Tim Chen, Ryan Roberts, linux-kernel

On Thu, May 22, 2025 at 3:38 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, May 14, 2025 at 1:20 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now swap table is cluster based, which means free clusters can free its
> > table since no one should modify it.
> >
> > There could be speculative readers, like swap cache look up, protect
> > them by making them RCU safe. All swap table should be filled with null
> > entries before free, so such readers will either see a NULL pointer or
> > a null filled table being lazy freed.
> >
> > On allocation, allocate the table when a cluster is used by any order.
> >
> > This way, we can reduce the memory usage of large swap device
> > significantly.
> >
> > This idea to dynamically release unused swap cluster data was initially
> > suggested by Chris Li while proposing the cluster swap allocator and
> > I found it suits the swap table idea very well.
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Nice optimization!

Thanks!

>
> However, please correct me if I'm wrong - but we are only dynamically
> allocating the swap table with this patch. What we are getting here is
> the dynamic allocation of the swap entries' metadata (through the swap
> table), which my virtual swap prototype already provides. The cluster
> metadata struct (struct swap_cluster_info) itself is statically
> allocated still (at swapon time), correct?

That's true for now, but noticing the static data is much smaller and unified
now, and that enables more work in the following ways:
(I didn't include it in the series because it is getting too long already..)

The static data is only 48 bytes per 2M swap space, so
for example if you have a 1TB swap device / space, it's only 20M
in total, previously it would be at least 768M (could be much higher
as I'm only counting swap_map and cgroup array here).
Now the memory overhead is 0.0019% of the swap space.

And the static data is now only an intermediate cluster table, and only
used in one place (si->cluster_info), so reallocating is doable now:
Readers of the actual swap table are protected by RCU and won't
modifying the cluster metadata, the only updater of cluster metadata
is allocation/freeing, and they can be organized in better ways to
allow the cluster data to be reallocated.

And due to the low memory overhead of cluster metadata, it's totally
acceptable to preallocate a much larger space now, for example we can
always preallocate a 4TB space on boot, tha't 80M in total. Might
seems not that trivial, but there is another planned series to make
the vmalloc space dynamic too, leverage the page table directly, so
the 20M per TB overhead can be avoided too. Not sure if it will be
needed though, the overhead is so tiny already.

So in summary what I have in mind is we can either:
- Extend the cluster data when it's not enough (or getting fragmented),
since the table data is still accessible during the reallocate and copied
data is minimal, so it shouldn't be a heavy lifting operation.
- Preallocate a larger amount of cluster data on swapon, the
overhead is still very controllable.
- (Once we have a dynamic vmalloc) preallocate a super large space
for swap and allocate each page when needed.

These ideas can be somehow combined, or related to each other.

> That will not work for a
> large virtual swap space :( So unfortunately, even with this swap
> table series, swap virtualization is still not trivial - definitely
> not as trivial as a new swap device type...
>
> Reading your physical swapfile allocator gives me some ideas though -
> let me build it into my prototype :) I'll send it out once it's ready.
>

Yeah, a virtual swap is definitely not trivial, instead it's
challenging and very important, just like you have demonstrated.
It requires quite some work other than just metadata level things,
I never expected it to be just as simple as a "just another swap
table entry type" :)
What I meant is that to be done with minimal overhead and better
flexibility, swap needs better infrastructures, which this series is working on.

^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-21  2:45             ` Kairui Song
  2025-05-21  3:24               ` Barry Song
@ 2025-05-23  2:29               ` Barry Song
  2025-05-23 20:01                 ` Kairui Song
  1 sibling, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-23  2:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> >
> > On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > >
> > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > --- a/mm/userfaultfd.c
> > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > >                               goto retry;
> > > > > > >                       }
> > > > > > >               }
> > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > +                     err = -EBUSY;
> > > > > > > +                     goto out;
> > > > > > > +             }
> > > > > >
> > > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > > are stable:
> > > > > >
> > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > >                 return -EAGAIN;
> > > > > >         }
> > > > >
> > > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > > folio and ptes are unlocked. So is it possible that someone else
> > > > > swapped in the entries, then swapped them out again using the same
> > > > > entries?
> > > > >
> > > > > The folio will be different here but PTEs are still the same value to
> > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > similar races with anon fault or shmem. I think more strict checking
> > > > > won't hurt here.
> > > >
> > > > This doesn't seem to be the same case as the one you fixed in
> > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > > case, there was no one hitting the swap cache, and you used
> > > > swap_prepare() to set up the cache to fix the issue.
> > > >
> > > > By the way, if we're not hitting the swap cache, src_folio will be
> > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > > not guard against that case either.
> > >
> > > Ah, that's true, it should be moved inside the if (folio) {...} block
> > > above. Thanks for catching this!
> > >
> > > > But I suspect we won't have a problem, since we're not swapping in —
> > > > we didn't read any stale data, right? Swap-in will only occur after we
> > > > move the PTEs.
> > >
> > > My concern is that a parallel swapin / swapout could result in the
> > > folio to be a completely irrelevant or invalid folio.
> > >
> > > It's not about the dst, but in the move src side, something like:
> > >
> > > CPU1                             CPU2
> > > move_pages_pte
> > >   folio = swap_cache_get_folio(...)
> > >     | Got folio A here
> > >   move_swap_pte
> > >                                  <swapin src_pte, using folio A>
> > >                                  <swapout src_pte, put folio A>
> > >                                    | Now folio A is no longer valid.
> > >                                    | It's very unlikely but here SWAP
> > >                                    | could reuse the same entry as above.
> >
> >
> > swap_cache_get_folio() does increment the folio's refcount, but it seems this
> > doesn't prevent do_swap_page() from freeing the swap entry after swapping
> > in src_pte with folio A, if it's a read fault.
> > for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> > will be false:
> >
> > static inline bool should_try_to_free_swap(struct folio *folio,
> >                                            struct vm_area_struct *vma,
> >                                            unsigned int fault_flags)
> > {
> >        ...
> >
> >         /*
> >          * If we want to map a page that's in the swapcache writable, we
> >          * have to detect via the refcount if we're really the exclusive
> >          * user. Try freeing the swapcache to get rid of the swapcache
> >          * reference only in case it's likely that we'll be the exlusive user.
> >          */
> >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> >                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > }
> >
> > and for swapout, __removing_mapping does check refcount as well:
> >
> > static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > {
> >         refcount = 1 + folio_nr_pages(folio);
> >         if (!folio_ref_freeze(folio, refcount))
> >                 goto cannot_free;
> >
> > }
> >
> > However, since __remove_mapping() occurs after pageout(), it seems
> > this also doesn't prevent swapout from allocating a new swap entry to
> > fill src_pte.
> >
> > It seems your concern is valid—unless I'm missing something.
> > Do you have a reproducer? If so, this will likely need a separate fix
> > patch rather than being hidden in this patchset.
>
> Thanks for the analysis. I don't have a reproducer yet, I did some
> local experiments and that seems possible, but the race window is so
> tiny and it's very difficult to make the swap entry reuse to collide
> with that, I'll try more but in theory this seems possible, or at
> least looks very fragile.
>
> And yeah, let's patch the kernel first if that's a real issue.
>
> >
> > >     double_pt_lock
> > >     is_pte_pages_stable
> > >       | Passed because of entry reuse.
> > >     folio_move_anon_rmap(...)
> > >       | Moved invalid folio A.
> > >
> > > And could it be possible that the swap_cache_get_folio returns NULL
> > > here, but later right before the double_pt_lock, a folio is added to
> > > swap cache? Maybe we better check the swap cache after clear and
> > > releasing dst lock, but before releasing src lock?
> >
> > It seems you're suggesting that a parallel swap-in allocates and adds
> > a folio to the swap cache, but the PTE has not yet been updated from
> > a swap entry to a present mapping?
> >
> > As long as do_swap_page() adds the folio to the swap cache
> > before updating the PTE to present, this scenario seems possible.
>
> Yes, that's two kinds of problems here. I suspected there could be an
> ABA problem while working on the series, but wasn't certain. And just
> realised there could be another missed cache read here thanks to your
> review and discussion :)
>
> >
> > It seems we need to double-check:
> >
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index bc473ad21202..976053bd2bf1 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > struct vm_area_struct *dst_vma,
> >         if (src_folio) {
> >                 folio_move_anon_rmap(src_folio, dst_vma);
> >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > +       } else {
> > +               struct folio *folio =
> > filemap_get_folio(swap_address_space(entry),
> > +                                       swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(folio)) {
> > +                       double_pt_unlock(dst_ptl, src_ptl);
> > +                       return -EAGAIN;
> > +               }
> >         }
> > -
> >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> >  #ifdef CONFIG_MEM_SOFT_DIRTY
> >         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
>
> Maybe it has to get even dirtier here to call swapcache_prepare too to
> cover the SYNC_IO case?
>
> >
> > Let me run test case [1] to check whether this ever happens. I guess I need to
> > hack kernel a bit to always add folio to swapcache even for SYNC IO.
>
> That will cause quite a performance regression I think. Good thing is,
> that's exactly the problem this series is solving by dropping the SYNC
> IO swapin path and never bypassing the swap cache, while improving the
> performance, eliminating things like this. One more reason to justify
> the approach :)

I attempted to reproduce the scenario where a folio is added to the swapcache
after filemap_get_folio() returns NULL but before move_swap_pte()
moves the swap PTE
using non-synchronized I/O. Technically, this seems possible; however,
I was unable
to reproduce it, likely because the time window between swapin_readahead and
taking the page table lock within do_swap_page() is too short.

Upon reconsideration, even if this situation occurs, it is not an issue because
move_swap_pte() obtains both the source and destination page table locks,
and *clears* the source PTE. Thus, when do_swap_page() subsequently acquires
the source page table lock for src, it cannot map the new swapcache folio
to the PTE since pte_same will return false.

>
> >
> > [1] https://lore.kernel.org/linux-mm/20250219112519.92853-1-21cnbao@gmail.com/
>
> I'll try this too.
>
> >
> > >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > Also, -EBUSY is somehow incorrect error code.
> > > > >
> > > > > Yes, thanks, I'll use EAGAIN here just like move_swap_pte.
> > > > >
> > > > >
> > > > > >
> > > > > > >               err = move_swap_pte(mm, dst_vma, dst_addr, src_addr, dst_pte, src_pte,
> > > > > > >                               orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > > > > >                               dst_ptl, src_ptl, src_folio);
> > > > > > >
> > > > > >
> > > >
> >

Thanks
Barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-23  2:29               ` Barry Song
@ 2025-05-23 20:01                 ` Kairui Song
  2025-05-27  7:58                   ` Barry Song
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-23 20:01 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> > >
> > > On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > >
> > > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > >
> > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > > --- a/mm/userfaultfd.c
> > > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > > >                               goto retry;
> > > > > > > >                       }
> > > > > > > >               }
> > > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > > +                     err = -EBUSY;
> > > > > > > > +                     goto out;
> > > > > > > > +             }
> > > > > > >
> > > > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > > > are stable:
> > > > > > >
> > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > > >                 return -EAGAIN;
> > > > > > >         }
> > > > > >
> > > > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > > > folio and ptes are unlocked. So is it possible that someone else
> > > > > > swapped in the entries, then swapped them out again using the same
> > > > > > entries?
> > > > > >
> > > > > > The folio will be different here but PTEs are still the same value to
> > > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > > similar races with anon fault or shmem. I think more strict checking
> > > > > > won't hurt here.
> > > > >
> > > > > This doesn't seem to be the same case as the one you fixed in
> > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > > > case, there was no one hitting the swap cache, and you used
> > > > > swap_prepare() to set up the cache to fix the issue.
> > > > >
> > > > > By the way, if we're not hitting the swap cache, src_folio will be
> > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > > > not guard against that case either.
> > > >
> > > > Ah, that's true, it should be moved inside the if (folio) {...} block
> > > > above. Thanks for catching this!
> > > >
> > > > > But I suspect we won't have a problem, since we're not swapping in —
> > > > > we didn't read any stale data, right? Swap-in will only occur after we
> > > > > move the PTEs.
> > > >
> > > > My concern is that a parallel swapin / swapout could result in the
> > > > folio to be a completely irrelevant or invalid folio.
> > > >
> > > > It's not about the dst, but in the move src side, something like:
> > > >
> > > > CPU1                             CPU2
> > > > move_pages_pte
> > > >   folio = swap_cache_get_folio(...)
> > > >     | Got folio A here
> > > >   move_swap_pte
> > > >                                  <swapin src_pte, using folio A>
> > > >                                  <swapout src_pte, put folio A>
> > > >                                    | Now folio A is no longer valid.
> > > >                                    | It's very unlikely but here SWAP
> > > >                                    | could reuse the same entry as above.
> > >
> > >
> > > swap_cache_get_folio() does increment the folio's refcount, but it seems this
> > > doesn't prevent do_swap_page() from freeing the swap entry after swapping
> > > in src_pte with folio A, if it's a read fault.
> > > for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> > > will be false:
> > >
> > > static inline bool should_try_to_free_swap(struct folio *folio,
> > >                                            struct vm_area_struct *vma,
> > >                                            unsigned int fault_flags)
> > > {
> > >        ...
> > >
> > >         /*
> > >          * If we want to map a page that's in the swapcache writable, we
> > >          * have to detect via the refcount if we're really the exclusive
> > >          * user. Try freeing the swapcache to get rid of the swapcache
> > >          * reference only in case it's likely that we'll be the exlusive user.
> > >          */
> > >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > >                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > > }
> > >
> > > and for swapout, __removing_mapping does check refcount as well:
> > >
> > > static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> > >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > > {
> > >         refcount = 1 + folio_nr_pages(folio);
> > >         if (!folio_ref_freeze(folio, refcount))
> > >                 goto cannot_free;
> > >
> > > }
> > >
> > > However, since __remove_mapping() occurs after pageout(), it seems
> > > this also doesn't prevent swapout from allocating a new swap entry to
> > > fill src_pte.
> > >
> > > It seems your concern is valid—unless I'm missing something.
> > > Do you have a reproducer? If so, this will likely need a separate fix
> > > patch rather than being hidden in this patchset.
> >
> > Thanks for the analysis. I don't have a reproducer yet, I did some
> > local experiments and that seems possible, but the race window is so
> > tiny and it's very difficult to make the swap entry reuse to collide
> > with that, I'll try more but in theory this seems possible, or at
> > least looks very fragile.
> >
> > And yeah, let's patch the kernel first if that's a real issue.
> >
> > >
> > > >     double_pt_lock
> > > >     is_pte_pages_stable
> > > >       | Passed because of entry reuse.
> > > >     folio_move_anon_rmap(...)
> > > >       | Moved invalid folio A.
> > > >
> > > > And could it be possible that the swap_cache_get_folio returns NULL
> > > > here, but later right before the double_pt_lock, a folio is added to
> > > > swap cache? Maybe we better check the swap cache after clear and
> > > > releasing dst lock, but before releasing src lock?
> > >
> > > It seems you're suggesting that a parallel swap-in allocates and adds
> > > a folio to the swap cache, but the PTE has not yet been updated from
> > > a swap entry to a present mapping?
> > >
> > > As long as do_swap_page() adds the folio to the swap cache
> > > before updating the PTE to present, this scenario seems possible.
> >
> > Yes, that's two kinds of problems here. I suspected there could be an
> > ABA problem while working on the series, but wasn't certain. And just
> > realised there could be another missed cache read here thanks to your
> > review and discussion :)
> >
> > >
> > > It seems we need to double-check:
> > >
> > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > index bc473ad21202..976053bd2bf1 100644
> > > --- a/mm/userfaultfd.c
> > > +++ b/mm/userfaultfd.c
> > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > > struct vm_area_struct *dst_vma,
> > >         if (src_folio) {
> > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > +       } else {
> > > +               struct folio *folio =
> > > filemap_get_folio(swap_address_space(entry),
> > > +                                       swap_cache_index(entry));
> > > +               if (!IS_ERR_OR_NULL(folio)) {
> > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > +                       return -EAGAIN;
> > > +               }
> > >         }
> > > -
> > >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > >         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
> >
> > Maybe it has to get even dirtier here to call swapcache_prepare too to
> > cover the SYNC_IO case?
> >
> > >
> > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> >
> > That will cause quite a performance regression I think. Good thing is,
> > that's exactly the problem this series is solving by dropping the SYNC
> > IO swapin path and never bypassing the swap cache, while improving the
> > performance, eliminating things like this. One more reason to justify
> > the approach :)

Hi Barry,

>
> I attempted to reproduce the scenario where a folio is added to the swapcache
> after filemap_get_folio() returns NULL but before move_swap_pte()
> moves the swap PTE
> using non-synchronized I/O. Technically, this seems possible; however,
> I was unable
> to reproduce it, likely because the time window between swapin_readahead and
> taking the page table lock within do_swap_page() is too short.

Thank you so much for trying this!

I have been trying to reproduce it too, and so far I didn't observe
any crash or warn. I added following debug code:

 static __always_inline
 bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
@@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
        pmd_t dummy_pmdval;
        pmd_t dst_pmdval;
        struct folio *src_folio = NULL;
+       struct folio *tmp_folio = NULL;
        struct anon_vma *src_anon_vma = NULL;
        struct mmu_notifier_range range;
        int err = 0;
@@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                if (!src_folio)
                        folio = filemap_get_folio(swap_address_space(entry),
                                        swap_cache_index(entry));
+               udelay(get_random_u32_below(1000));
+               tmp_folio = filemap_get_folio(swap_address_space(entry),
+                                       swap_cache_index(entry));
+               if (!IS_ERR_OR_NULL(tmp_folio)) {
+                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
+                               pr_err("UFFDIO_MOVE: UNSTABLE folio
%lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
tmp_folio->swap.val);
+                       }
+                       folio_put(tmp_folio);
+               }
                if (!IS_ERR_OR_NULL(folio)) {
                        if (folio_test_large(folio)) {
                                err = -EBUSY;
@@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
dst_pte, src_pte,
                                orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
                                dst_ptl, src_ptl, src_folio);
+               if (tmp_folio != folio && !err)
+                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
check: %lx -> %lx\n", folio, tmp_folio);
        }

And I saw these two prints are getting triggered like this (not a real
issue though, just help to understand the problem)
...
[ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
fffffdffc7ccac80 (51)
[ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
fffffdffc3435e00 (3b)
[ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
fffffdffc34ab8c0 (76)
[ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
fffffdffc34ab8c0 (6d)
[ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
fffffdffc34eda80 (32)
[ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
fffffdffc38cedc0 (b)
[ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
[ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
fffffdffc3465cc0 (21)
...

The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
fffffdffc3853540 (53)" worries me at first. On first look it seems the
folio is indeed freed completely from the swap cache after the first
lookup, so another swapout can reuse the entry. But as you mentioned
__remove_mapping won't release a folio if the refcount check fails, so
they must be freed by folio_free_swap or __try_to_reclaim_swap, there
are many places that can happen. But these two helpers won't free a
folio from swap cache if its swap count is not zero. And the folio
will either be swapped out (swap count non zero), or mapped (freeing
it is fine, PTE is non_swap, and another swapout will still use the
same folio).

So after more investigation and dumping the pages, it's actually the
second lookup (tmp_folio) seeing the entry being reused by another
page table entry, after the first folio is swapped back and released.
So the page table check below will always fail just fine.

But this also proves the first look up can see a completely irrelevant
folio too: If the src folio is swapped out, but got swapped back and
freed, then another folio B shortly got added to swap cache reuse the
src folio's old swap entry, then the folio B got seen by the look up
here and get freed from swap cache, then src folio got swapped out
again also reusing the same entry, then we have a problem as PTE seems
untouched indeed but we grabbed a wrong folio. Seems possible if I'm
not wrong:

Something like this:
CPU1                             CPU2
move_pages_pte
  entry = pte_to_swp_entry(orig_src_pte);
    | Got Swap Entry S1 from src_pte
  ...
                                 <swapin src_pte, using folio A>
                                 <free folio A from swap cache freeing S1>
                                 <someone else try swap out folio B >
                                 <put folio B to swap cache using S1 >
                                ...
  folio = swap_cache_get_folio(S1)
    | Got folio B here !!!
  move_swap_pte
                                 <free folio B from swap cache>
                                   | Holding a reference doesn't pin the cache
                                   | as we have demonstrated
                                 <Swapout folio A also using S1>
    double_pt_lock
    is_pte_pages_stable
      | Passed because of S1 is reused
    folio_move_anon_rmap(...)
      | Moved invalid folio B here !!!

But this is extremely hard to reproduce though, even if doing it
deliberately...

So I think a "folio_swap_contains" or equivalent check here is a good
thing to have, to make it more robust and easier to understand. The
checking after locking a folio has very tiny overhead and can
definitely ensure the folio's swap entry is valid and stable.

The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
here might seem problematic, but it's still not a real problem. That's
the case where the swapin in src region happens after the lookup, and
before the PTE lock. It will pass the PTE check without moving the
folio. But the folio is guaranteed to be a completely new folio here
because the folio can't be added back to the page table without
holding the PTE lock, and if that happens the following PTE check here
will fail.

So I think we should patch the current kernel only adding a
"folio_swap_contains" equivalent check here, and maybe more comments,
how do you think?


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-23 20:01                 ` Kairui Song
@ 2025-05-27  7:58                   ` Barry Song
  2025-05-27 15:11                     ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Barry Song @ 2025-05-27  7:58 UTC (permalink / raw)
  To: Kairui Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> > > >
> > > > On Wed, May 21, 2025 at 7:10 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > On Tue, May 20, 2025 at 12:41 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > On Tue, May 20, 2025 at 3:31 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > > >
> > > > > > > On Mon, May 19, 2025 at 12:38 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > > > >
> > > > > > > > > From: Kairui Song <kasong@tencent.com>
> > > > > > > >
> > > > > > > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > > > > > > index e5a0db7f3331..5b4f01aecf35 100644
> > > > > > > > > --- a/mm/userfaultfd.c
> > > > > > > > > +++ b/mm/userfaultfd.c
> > > > > > > > > @@ -1409,6 +1409,10 @@ static int move_pages_pte(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd,
> > > > > > > > >                               goto retry;
> > > > > > > > >                       }
> > > > > > > > >               }
> > > > > > > > > +             if (!folio_swap_contains(src_folio, entry)) {
> > > > > > > > > +                     err = -EBUSY;
> > > > > > > > > +                     goto out;
> > > > > > > > > +             }
> > > > > > > >
> > > > > > > > It seems we don't need this. In move_swap_pte(), we have been checking pte pages
> > > > > > > > are stable:
> > > > > > > >
> > > > > > > >         if (!is_pte_pages_stable(dst_pte, src_pte, orig_dst_pte, orig_src_pte,
> > > > > > > >                                  dst_pmd, dst_pmdval)) {
> > > > > > > >                 double_pt_unlock(dst_ptl, src_ptl);
> > > > > > > >                 return -EAGAIN;
> > > > > > > >         }
> > > > > > >
> > > > > > > The tricky part is when swap_cache_get_folio returns the folio, both
> > > > > > > folio and ptes are unlocked. So is it possible that someone else
> > > > > > > swapped in the entries, then swapped them out again using the same
> > > > > > > entries?
> > > > > > >
> > > > > > > The folio will be different here but PTEs are still the same value to
> > > > > > > they will pass the is_pte_pages_stable check, we previously saw
> > > > > > > similar races with anon fault or shmem. I think more strict checking
> > > > > > > won't hurt here.
> > > > > >
> > > > > > This doesn't seem to be the same case as the one you fixed in
> > > > > > do_swap_page(). Here, we're hitting the swap cache, whereas in that
> > > > > > case, there was no one hitting the swap cache, and you used
> > > > > > swap_prepare() to set up the cache to fix the issue.
> > > > > >
> > > > > > By the way, if we're not hitting the swap cache, src_folio will be
> > > > > > NULL. Also, it seems that folio_swap_contains(src_folio, entry) does
> > > > > > not guard against that case either.
> > > > >
> > > > > Ah, that's true, it should be moved inside the if (folio) {...} block
> > > > > above. Thanks for catching this!
> > > > >
> > > > > > But I suspect we won't have a problem, since we're not swapping in —
> > > > > > we didn't read any stale data, right? Swap-in will only occur after we
> > > > > > move the PTEs.
> > > > >
> > > > > My concern is that a parallel swapin / swapout could result in the
> > > > > folio to be a completely irrelevant or invalid folio.
> > > > >
> > > > > It's not about the dst, but in the move src side, something like:
> > > > >
> > > > > CPU1                             CPU2
> > > > > move_pages_pte
> > > > >   folio = swap_cache_get_folio(...)
> > > > >     | Got folio A here
> > > > >   move_swap_pte
> > > > >                                  <swapin src_pte, using folio A>
> > > > >                                  <swapout src_pte, put folio A>
> > > > >                                    | Now folio A is no longer valid.
> > > > >                                    | It's very unlikely but here SWAP
> > > > >                                    | could reuse the same entry as above.
> > > >
> > > >
> > > > swap_cache_get_folio() does increment the folio's refcount, but it seems this
> > > > doesn't prevent do_swap_page() from freeing the swap entry after swapping
> > > > in src_pte with folio A, if it's a read fault.
> > > > for write fault, folio_ref_count(folio) == (1 + folio_nr_pages(folio))
> > > > will be false:
> > > >
> > > > static inline bool should_try_to_free_swap(struct folio *folio,
> > > >                                            struct vm_area_struct *vma,
> > > >                                            unsigned int fault_flags)
> > > > {
> > > >        ...
> > > >
> > > >         /*
> > > >          * If we want to map a page that's in the swapcache writable, we
> > > >          * have to detect via the refcount if we're really the exclusive
> > > >          * user. Try freeing the swapcache to get rid of the swapcache
> > > >          * reference only in case it's likely that we'll be the exlusive user.
> > > >          */
> > > >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > > >                 folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > > > }
> > > >
> > > > and for swapout, __removing_mapping does check refcount as well:
> > > >
> > > > static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> > > >                             bool reclaimed, struct mem_cgroup *target_memcg)
> > > > {
> > > >         refcount = 1 + folio_nr_pages(folio);
> > > >         if (!folio_ref_freeze(folio, refcount))
> > > >                 goto cannot_free;
> > > >
> > > > }
> > > >
> > > > However, since __remove_mapping() occurs after pageout(), it seems
> > > > this also doesn't prevent swapout from allocating a new swap entry to
> > > > fill src_pte.
> > > >
> > > > It seems your concern is valid—unless I'm missing something.
> > > > Do you have a reproducer? If so, this will likely need a separate fix
> > > > patch rather than being hidden in this patchset.
> > >
> > > Thanks for the analysis. I don't have a reproducer yet, I did some
> > > local experiments and that seems possible, but the race window is so
> > > tiny and it's very difficult to make the swap entry reuse to collide
> > > with that, I'll try more but in theory this seems possible, or at
> > > least looks very fragile.
> > >
> > > And yeah, let's patch the kernel first if that's a real issue.
> > >
> > > >
> > > > >     double_pt_lock
> > > > >     is_pte_pages_stable
> > > > >       | Passed because of entry reuse.
> > > > >     folio_move_anon_rmap(...)
> > > > >       | Moved invalid folio A.
> > > > >
> > > > > And could it be possible that the swap_cache_get_folio returns NULL
> > > > > here, but later right before the double_pt_lock, a folio is added to
> > > > > swap cache? Maybe we better check the swap cache after clear and
> > > > > releasing dst lock, but before releasing src lock?
> > > >
> > > > It seems you're suggesting that a parallel swap-in allocates and adds
> > > > a folio to the swap cache, but the PTE has not yet been updated from
> > > > a swap entry to a present mapping?
> > > >
> > > > As long as do_swap_page() adds the folio to the swap cache
> > > > before updating the PTE to present, this scenario seems possible.
> > >
> > > Yes, that's two kinds of problems here. I suspected there could be an
> > > ABA problem while working on the series, but wasn't certain. And just
> > > realised there could be another missed cache read here thanks to your
> > > review and discussion :)
> > >
> > > >
> > > > It seems we need to double-check:
> > > >
> > > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > > > index bc473ad21202..976053bd2bf1 100644
> > > > --- a/mm/userfaultfd.c
> > > > +++ b/mm/userfaultfd.c
> > > > @@ -1102,8 +1102,14 @@ static int move_swap_pte(struct mm_struct *mm,
> > > > struct vm_area_struct *dst_vma,
> > > >         if (src_folio) {
> > > >                 folio_move_anon_rmap(src_folio, dst_vma);
> > > >                 src_folio->index = linear_page_index(dst_vma, dst_addr);
> > > > +       } else {
> > > > +               struct folio *folio =
> > > > filemap_get_folio(swap_address_space(entry),
> > > > +                                       swap_cache_index(entry));
> > > > +               if (!IS_ERR_OR_NULL(folio)) {
> > > > +                       double_pt_unlock(dst_ptl, src_ptl);
> > > > +                       return -EAGAIN;
> > > > +               }
> > > >         }
> > > > -
> > > >         orig_src_pte = ptep_get_and_clear(mm, src_addr, src_pte);
> > > >  #ifdef CONFIG_MEM_SOFT_DIRTY
> > > >         orig_src_pte = pte_swp_mksoft_dirty(orig_src_pte);
> > >
> > > Maybe it has to get even dirtier here to call swapcache_prepare too to
> > > cover the SYNC_IO case?
> > >
> > > >
> > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > >
> > > That will cause quite a performance regression I think. Good thing is,
> > > that's exactly the problem this series is solving by dropping the SYNC
> > > IO swapin path and never bypassing the swap cache, while improving the
> > > performance, eliminating things like this. One more reason to justify
> > > the approach :)
>
> Hi Barry,
>
> >
> > I attempted to reproduce the scenario where a folio is added to the swapcache
> > after filemap_get_folio() returns NULL but before move_swap_pte()
> > moves the swap PTE
> > using non-synchronized I/O. Technically, this seems possible; however,
> > I was unable
> > to reproduce it, likely because the time window between swapin_readahead and
> > taking the page table lock within do_swap_page() is too short.
>
> Thank you so much for trying this!
>
> I have been trying to reproduce it too, and so far I didn't observe
> any crash or warn. I added following debug code:
>
>  static __always_inline
>  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>         pmd_t dummy_pmdval;
>         pmd_t dst_pmdval;
>         struct folio *src_folio = NULL;
> +       struct folio *tmp_folio = NULL;
>         struct anon_vma *src_anon_vma = NULL;
>         struct mmu_notifier_range range;
>         int err = 0;
> @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 if (!src_folio)
>                         folio = filemap_get_folio(swap_address_space(entry),
>                                         swap_cache_index(entry));
> +               udelay(get_random_u32_below(1000));
> +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> +                                       swap_cache_index(entry));
> +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> tmp_folio->swap.val);
> +                       }
> +                       folio_put(tmp_folio);
> +               }
>                 if (!IS_ERR_OR_NULL(folio)) {
>                         if (folio_test_large(folio)) {
>                                 err = -EBUSY;
> @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> dst_pte, src_pte,
>                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
>                                 dst_ptl, src_ptl, src_folio);
> +               if (tmp_folio != folio && !err)
> +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> check: %lx -> %lx\n", folio, tmp_folio);
>         }
>
> And I saw these two prints are getting triggered like this (not a real
> issue though, just help to understand the problem)
> ...
> [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> fffffdffc7ccac80 (51)
> [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> fffffdffc3435e00 (3b)
> [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> fffffdffc34ab8c0 (76)
> [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> fffffdffc34ab8c0 (6d)
> [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> fffffdffc34eda80 (32)
> [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> fffffdffc38cedc0 (b)
> [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> fffffdffc3465cc0 (21)
> ...
>
> The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> fffffdffc3853540 (53)" worries me at first. On first look it seems the
> folio is indeed freed completely from the swap cache after the first
> lookup, so another swapout can reuse the entry. But as you mentioned
> __remove_mapping won't release a folio if the refcount check fails, so
> they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> are many places that can happen. But these two helpers won't free a
> folio from swap cache if its swap count is not zero. And the folio
> will either be swapped out (swap count non zero), or mapped (freeing
> it is fine, PTE is non_swap, and another swapout will still use the
> same folio).
>
> So after more investigation and dumping the pages, it's actually the
> second lookup (tmp_folio) seeing the entry being reused by another
> page table entry, after the first folio is swapped back and released.
> So the page table check below will always fail just fine.
>
> But this also proves the first look up can see a completely irrelevant
> folio too: If the src folio is swapped out, but got swapped back and
> freed, then another folio B shortly got added to swap cache reuse the
> src folio's old swap entry, then the folio B got seen by the look up
> here and get freed from swap cache, then src folio got swapped out
> again also reusing the same entry, then we have a problem as PTE seems
> untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> not wrong:
>
> Something like this:
> CPU1                             CPU2
> move_pages_pte
>   entry = pte_to_swp_entry(orig_src_pte);
>     | Got Swap Entry S1 from src_pte
>   ...
>                                  <swapin src_pte, using folio A>

I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
sure where folio B comes from in the statement `<someone else tried to
swap out folio B>`.

If that assumption is correct, and folio A is still in the swapcache,
how could someone swap in folio B without hitting folio A? That would
suggest folio A must have been removed from the swapcache earlier—right?

>                                  <free folio A from swap cache freeing S1>
>                                  <someone else try swap out folio B >
>                                  <put folio B to swap cache using S1 >
>                                 ...
>   folio = swap_cache_get_folio(S1)
>     | Got folio B here !!!
>   move_swap_pte
>                                  <free folio B from swap cache>
>                                    | Holding a reference doesn't pin the cache
>                                    | as we have demonstrated
>                                  <Swapout folio A also using S1>
>     double_pt_lock
>     is_pte_pages_stable
>       | Passed because of S1 is reused
>     folio_move_anon_rmap(...)
>       | Moved invalid folio B here !!!
>
> But this is extremely hard to reproduce though, even if doing it
> deliberately...
>
> So I think a "folio_swap_contains" or equivalent check here is a good
> thing to have, to make it more robust and easier to understand. The
> checking after locking a folio has very tiny overhead and can
> definitely ensure the folio's swap entry is valid and stable.
>
> The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> here might seem problematic, but it's still not a real problem. That's
> the case where the swapin in src region happens after the lookup, and
> before the PTE lock. It will pass the PTE check without moving the
> folio. But the folio is guaranteed to be a completely new folio here
> because the folio can't be added back to the page table without
> holding the PTE lock, and if that happens the following PTE check here
> will fail.
>
> So I think we should patch the current kernel only adding a
> "folio_swap_contains" equivalent check here, and maybe more comments,
> how do you think?

The description appears to have some inconsistencies.
Would you mind rephrasing it?

Thanks
barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-27  7:58                   ` Barry Song
@ 2025-05-27 15:11                     ` Kairui Song
  2025-05-30  8:49                       ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-27 15:11 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > >
> > > > That will cause quite a performance regression I think. Good thing is,
> > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > performance, eliminating things like this. One more reason to justify
> > > > the approach :)
> >
> > Hi Barry,
> >
> > >
> > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > moves the swap PTE
> > > using non-synchronized I/O. Technically, this seems possible; however,
> > > I was unable
> > > to reproduce it, likely because the time window between swapin_readahead and
> > > taking the page table lock within do_swap_page() is too short.
> >
> > Thank you so much for trying this!
> >
> > I have been trying to reproduce it too, and so far I didn't observe
> > any crash or warn. I added following debug code:
> >
> >  static __always_inline
> >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >         pmd_t dummy_pmdval;
> >         pmd_t dst_pmdval;
> >         struct folio *src_folio = NULL;
> > +       struct folio *tmp_folio = NULL;
> >         struct anon_vma *src_anon_vma = NULL;
> >         struct mmu_notifier_range range;
> >         int err = 0;
> > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 if (!src_folio)
> >                         folio = filemap_get_folio(swap_address_space(entry),
> >                                         swap_cache_index(entry));
> > +               udelay(get_random_u32_below(1000));
> > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > +                                       swap_cache_index(entry));
> > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > tmp_folio->swap.val);
> > +                       }
> > +                       folio_put(tmp_folio);
> > +               }
> >                 if (!IS_ERR_OR_NULL(folio)) {
> >                         if (folio_test_large(folio)) {
> >                                 err = -EBUSY;
> > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > pmd_t *dst_pmd, pmd_t *src_pmd,
> >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > dst_pte, src_pte,
> >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> >                                 dst_ptl, src_ptl, src_folio);
> > +               if (tmp_folio != folio && !err)
> > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > check: %lx -> %lx\n", folio, tmp_folio);
> >         }
> >
> > And I saw these two prints are getting triggered like this (not a real
> > issue though, just help to understand the problem)
> > ...
> > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > fffffdffc7ccac80 (51)
> > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > fffffdffc3435e00 (3b)
> > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > fffffdffc34ab8c0 (76)
> > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > fffffdffc34ab8c0 (6d)
> > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > fffffdffc34eda80 (32)
> > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > fffffdffc38cedc0 (b)
> > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > fffffdffc3465cc0 (21)
> > ...
> >
> > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > folio is indeed freed completely from the swap cache after the first
> > lookup, so another swapout can reuse the entry. But as you mentioned
> > __remove_mapping won't release a folio if the refcount check fails, so
> > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > are many places that can happen. But these two helpers won't free a
> > folio from swap cache if its swap count is not zero. And the folio
> > will either be swapped out (swap count non zero), or mapped (freeing
> > it is fine, PTE is non_swap, and another swapout will still use the
> > same folio).
> >
> > So after more investigation and dumping the pages, it's actually the
> > second lookup (tmp_folio) seeing the entry being reused by another
> > page table entry, after the first folio is swapped back and released.
> > So the page table check below will always fail just fine.
> >
> > But this also proves the first look up can see a completely irrelevant
> > folio too: If the src folio is swapped out, but got swapped back and
> > freed, then another folio B shortly got added to swap cache reuse the
> > src folio's old swap entry, then the folio B got seen by the look up
> > here and get freed from swap cache, then src folio got swapped out
> > again also reusing the same entry, then we have a problem as PTE seems
> > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > not wrong:
> >
> > Something like this:
> > CPU1                             CPU2
> > move_pages_pte
> >   entry = pte_to_swp_entry(orig_src_pte);
> >     | Got Swap Entry S1 from src_pte
> >   ...
> >                                  <swapin src_pte, using folio A>
>
> I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> sure where folio B comes from in the statement `<someone else tried to
> swap out folio B>`.
>
> If that assumption is correct, and folio A is still in the swapcache,
> how could someone swap in folio B without hitting folio A? That would
> suggest folio A must have been removed from the swapcache earlier—right?
>
> >                                  <free folio A from swap cache freeing S1>
> >                                  <someone else try swap out folio B >

Sorry my bad, I think I made people think folio B is related to
src_pte at this point. What I actually mean is that: Another random
folio B, unrelated to src_pte, could got swapped out, and using the
swap entry S1.

> >                                  <put folio B to swap cache using S1 >
> >                                 ...
> >   folio = swap_cache_get_folio(S1)
> >     | Got folio B here !!!
> >   move_swap_pte
> >                                  <free folio B from swap cache>
> >                                    | Holding a reference doesn't pin the cache
> >                                    | as we have demonstrated
> >                                  <Swapout folio A also using S1>
> >     double_pt_lock
> >     is_pte_pages_stable
> >       | Passed because of S1 is reused
> >     folio_move_anon_rmap(...)
> >       | Moved invalid folio B here !!!
> >
> > But this is extremely hard to reproduce though, even if doing it
> > deliberately...
> >
> > So I think a "folio_swap_contains" or equivalent check here is a good
> > thing to have, to make it more robust and easier to understand. The
> > checking after locking a folio has very tiny overhead and can
> > definitely ensure the folio's swap entry is valid and stable.
> >
> > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > here might seem problematic, but it's still not a real problem. That's
> > the case where the swapin in src region happens after the lookup, and
> > before the PTE lock. It will pass the PTE check without moving the
> > folio. But the folio is guaranteed to be a completely new folio here
> > because the folio can't be added back to the page table without
> > holding the PTE lock, and if that happens the following PTE check here
> > will fail.
> >
> > So I think we should patch the current kernel only adding a
> > "folio_swap_contains" equivalent check here, and maybe more comments,
> > how do you think?
>
> The description appears to have some inconsistencies.
> Would you mind rephrasing it?

Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
fffffdffc385fb00" part first, as both you and me have come into a
conclusion that "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" is OK, and this
output only proves that happens.

So the problematic race is:

Here move_pages_pte is moving src_pte to dst_pte, and it begins with
src_pte == swap entry S1, and S1 isn't cached.

CPU1                             CPU2
move_pages_pte()
  entry = pte_to_swp_entry(orig_src_pte);
    | src_pte is absent, and got entry == S1
  ... < Somehow interrupted> ...
                                 <swapin src_pte, using folio A>
                                   | folio A is just a new allocated folio
                                   | for resolving the swap in fault.
                                 <free folio A from swap cache freeing S1>
                                   | swap in fault is resolved, src_pte
                                   | now points to folio A, so folio A
                                   | can get freed just fine.
                                   | And now S1 is free to be used
                                   | by anyone.
                                 <someone else try swap out another folio B >
                                   | Folio B is a completely unrelated
                                   | folio swapped out by random process.
                                   | (has nothing to do with src_pte)
                                   | But S1 is freed so it may use S1
                                   | as its swap entry.
                                 <put folio B to swap cache with index S1 >
                                 ...
  folio = filemap_get_folio(S1)
    | The lookup is using S1, so it
    | got folio B here !!!
  ... < Somehow interrupted> ...
                                 <free folio B from swap cache>
                                   | Folio B could fail to be swapped out
                                   | or got swapped in again, so it can
                                   | be freed by folio_free_swap or
                                   | swap cache reclaim.
                                   | CPU1 is holding a reference but it
                                   | doesn't pin the swap cache folio
                                   | as I have demonstrated with the
                                   | test C program previously.
                                   | New S1 is free to be used again.
                                 <Swapout src_pte again using S1>
                                   | No thing blocks this from happening
                                   | The swapout is still using folio A,
                                   | and src_pte == S1.
  folio_trylock(folio)
  move_swap_pte
    double_pt_lock
    is_pte_pages_stable
      | Passed because of S1 is reused so src_pte == S1.
    folio_move_anon_rmap(...)
      | Moved invalid folio B here !!!

It's a long and complex one, I don't think it's practically possible
to happen in reality but in theory doable, once in a million maybe...
Still we have to fix it, or did I got anything wrong here?

>
> Thanks
> barry


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-27 15:11                     ` Kairui Song
@ 2025-05-30  8:49                       ` Kairui Song
  2025-05-30 19:24                         ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-05-30  8:49 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Tue, May 27, 2025 at 11:11 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > >
> > > > > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> > > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > > >
> > > > > That will cause quite a performance regression I think. Good thing is,
> > > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > > performance, eliminating things like this. One more reason to justify
> > > > > the approach :)
> > >
> > > Hi Barry,
> > >
> > > >
> > > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > > moves the swap PTE
> > > > using non-synchronized I/O. Technically, this seems possible; however,
> > > > I was unable
> > > > to reproduce it, likely because the time window between swapin_readahead and
> > > > taking the page table lock within do_swap_page() is too short.
> > >
> > > Thank you so much for trying this!
> > >
> > > I have been trying to reproduce it too, and so far I didn't observe
> > > any crash or warn. I added following debug code:
> > >
> > >  static __always_inline
> > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >         pmd_t dummy_pmdval;
> > >         pmd_t dst_pmdval;
> > >         struct folio *src_folio = NULL;
> > > +       struct folio *tmp_folio = NULL;
> > >         struct anon_vma *src_anon_vma = NULL;
> > >         struct mmu_notifier_range range;
> > >         int err = 0;
> > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                 if (!src_folio)
> > >                         folio = filemap_get_folio(swap_address_space(entry),
> > >                                         swap_cache_index(entry));
> > > +               udelay(get_random_u32_below(1000));
> > > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > > +                                       swap_cache_index(entry));
> > > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > > tmp_folio->swap.val);
> > > +                       }
> > > +                       folio_put(tmp_folio);
> > > +               }
> > >                 if (!IS_ERR_OR_NULL(folio)) {
> > >                         if (folio_test_large(folio)) {
> > >                                 err = -EBUSY;
> > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > > dst_pte, src_pte,
> > >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > >                                 dst_ptl, src_ptl, src_folio);
> > > +               if (tmp_folio != folio && !err)
> > > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > > check: %lx -> %lx\n", folio, tmp_folio);
> > >         }
> > >
> > > And I saw these two prints are getting triggered like this (not a real
> > > issue though, just help to understand the problem)
> > > ...
> > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > > fffffdffc7ccac80 (51)
> > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > > fffffdffc3435e00 (3b)
> > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > > fffffdffc34ab8c0 (76)
> > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > > fffffdffc34ab8c0 (6d)
> > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > > fffffdffc34eda80 (32)
> > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > > fffffdffc38cedc0 (b)
> > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > > fffffdffc3465cc0 (21)
> > > ...
> > >
> > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > > folio is indeed freed completely from the swap cache after the first
> > > lookup, so another swapout can reuse the entry. But as you mentioned
> > > __remove_mapping won't release a folio if the refcount check fails, so
> > > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > > are many places that can happen. But these two helpers won't free a
> > > folio from swap cache if its swap count is not zero. And the folio
> > > will either be swapped out (swap count non zero), or mapped (freeing
> > > it is fine, PTE is non_swap, and another swapout will still use the
> > > same folio).
> > >
> > > So after more investigation and dumping the pages, it's actually the
> > > second lookup (tmp_folio) seeing the entry being reused by another
> > > page table entry, after the first folio is swapped back and released.
> > > So the page table check below will always fail just fine.
> > >
> > > But this also proves the first look up can see a completely irrelevant
> > > folio too: If the src folio is swapped out, but got swapped back and
> > > freed, then another folio B shortly got added to swap cache reuse the
> > > src folio's old swap entry, then the folio B got seen by the look up
> > > here and get freed from swap cache, then src folio got swapped out
> > > again also reusing the same entry, then we have a problem as PTE seems
> > > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > > not wrong:
> > >
> > > Something like this:
> > > CPU1                             CPU2
> > > move_pages_pte
> > >   entry = pte_to_swp_entry(orig_src_pte);
> > >     | Got Swap Entry S1 from src_pte
> > >   ...
> > >                                  <swapin src_pte, using folio A>
> >
> > I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> > sure where folio B comes from in the statement `<someone else tried to
> > swap out folio B>`.
> >
> > If that assumption is correct, and folio A is still in the swapcache,
> > how could someone swap in folio B without hitting folio A? That would
> > suggest folio A must have been removed from the swapcache earlier—right?
> >
> > >                                  <free folio A from swap cache freeing S1>
> > >                                  <someone else try swap out folio B >
>
> Sorry my bad, I think I made people think folio B is related to
> src_pte at this point. What I actually mean is that: Another random
> folio B, unrelated to src_pte, could got swapped out, and using the
> swap entry S1.
>
> > >                                  <put folio B to swap cache using S1 >
> > >                                 ...
> > >   folio = swap_cache_get_folio(S1)
> > >     | Got folio B here !!!
> > >   move_swap_pte
> > >                                  <free folio B from swap cache>
> > >                                    | Holding a reference doesn't pin the cache
> > >                                    | as we have demonstrated
> > >                                  <Swapout folio A also using S1>
> > >     double_pt_lock
> > >     is_pte_pages_stable
> > >       | Passed because of S1 is reused
> > >     folio_move_anon_rmap(...)
> > >       | Moved invalid folio B here !!!
> > >
> > > But this is extremely hard to reproduce though, even if doing it
> > > deliberately...
> > >
> > > So I think a "folio_swap_contains" or equivalent check here is a good
> > > thing to have, to make it more robust and easier to understand. The
> > > checking after locking a folio has very tiny overhead and can
> > > definitely ensure the folio's swap entry is valid and stable.
> > >
> > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > > here might seem problematic, but it's still not a real problem. That's
> > > the case where the swapin in src region happens after the lookup, and
> > > before the PTE lock. It will pass the PTE check without moving the
> > > folio. But the folio is guaranteed to be a completely new folio here
> > > because the folio can't be added back to the page table without
> > > holding the PTE lock, and if that happens the following PTE check here
> > > will fail.
> > >
> > > So I think we should patch the current kernel only adding a
> > > "folio_swap_contains" equivalent check here, and maybe more comments,
> > > how do you think?
> >
> > The description appears to have some inconsistencies.
> > Would you mind rephrasing it?
>
> Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
> fffffdffc385fb00" part first, as both you and me have come into a
> conclusion that "filemap_get_folio() returns NULL before
> move_swap_pte, but a folio was added to swap cache" is OK, and this
> output only proves that happens.
>
> So the problematic race is:
>
> Here move_pages_pte is moving src_pte to dst_pte, and it begins with
> src_pte == swap entry S1, and S1 isn't cached.
>
> CPU1                             CPU2
> move_pages_pte()
>   entry = pte_to_swp_entry(orig_src_pte);
>     | src_pte is absent, and got entry == S1
>   ... < Somehow interrupted> ...
>                                  <swapin src_pte, using folio A>
>                                    | folio A is just a new allocated folio
>                                    | for resolving the swap in fault.
>                                  <free folio A from swap cache freeing S1>
>                                    | swap in fault is resolved, src_pte
>                                    | now points to folio A, so folio A
>                                    | can get freed just fine.
>                                    | And now S1 is free to be used
>                                    | by anyone.
>                                  <someone else try swap out another folio B >
>                                    | Folio B is a completely unrelated
>                                    | folio swapped out by random process.
>                                    | (has nothing to do with src_pte)
>                                    | But S1 is freed so it may use S1
>                                    | as its swap entry.
>                                  <put folio B to swap cache with index S1 >
>                                  ...
>   folio = filemap_get_folio(S1)
>     | The lookup is using S1, so it
>     | got folio B here !!!
>   ... < Somehow interrupted> ...
>                                  <free folio B from swap cache>
>                                    | Folio B could fail to be swapped out
>                                    | or got swapped in again, so it can
>                                    | be freed by folio_free_swap or
>                                    | swap cache reclaim.
>                                    | CPU1 is holding a reference but it
>                                    | doesn't pin the swap cache folio
>                                    | as I have demonstrated with the
>                                    | test C program previously.
>                                    | New S1 is free to be used again.
>                                  <Swapout src_pte again using S1>
>                                    | No thing blocks this from happening
>                                    | The swapout is still using folio A,
>                                    | and src_pte == S1.
>   folio_trylock(folio)
>   move_swap_pte
>     double_pt_lock
>     is_pte_pages_stable
>       | Passed because of S1 is reused so src_pte == S1.
>     folio_move_anon_rmap(...)
>       | Moved invalid folio B here !!!
>
> It's a long and complex one, I don't think it's practically possible
> to happen in reality but in theory doable, once in a million maybe...
> Still we have to fix it, or did I got anything wrong here?

Hi Barry,

I managed to reproduce this issue, by hacking the kernel a bit (Only
adding only delay to increase the race window, and adding a WARN to
indicate the problem)

1. Applying following patch for kernel:
===

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bc473ad21202..1d710adf9839 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -15,6 +15,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
 #include <linux/shmem_fs.h>
+#include <linux/delay.h>
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include "internal.h"
@@ -1100,6 +1101,10 @@ static int move_swap_pte(struct mm_struct *mm,
struct vm_area_struct *dst_vma,
         * occur and hit the swapcache after moving the PTE.
         */
        if (src_folio) {
+               if (WARN_ON(src_folio->swap.val !=
pte_to_swp_entry(orig_src_pte).val))
+                       pr_err("Moving folio %lx (folio->swap = %lx),
orig_src_pte = %lx\n",
+                              (unsigned long)src_folio, src_folio->swap.val,
+                              pte_to_swp_entry(orig_src_pte).val);
                folio_move_anon_rmap(src_folio, dst_vma);
                src_folio->index = linear_page_index(dst_vma, dst_addr);
        }
@@ -1388,9 +1393,13 @@ static int move_pages_pte(struct mm_struct *mm,
pmd_t *dst_pmd, pmd_t *src_pmd,
                 * folios in the swapcache. This issue needs to be resolved
                 * separately to allow proper handling.
                 */
+               pr_err("DEBUG: Will do the lookup using entry %lx,
wait 3s...\n", entry.val);
+               mdelay(1000 * 3);
                if (!src_folio)
                        folio = filemap_get_folio(swap_address_space(entry),
                                        swap_cache_index(entry));
+               pr_err("DEBUG: Got folio value %lx, wait 3s...\n",
(unsigned long)folio);
+               mdelay(1000 * 3);
                if (!IS_ERR_OR_NULL(folio)) {
                        if (folio_test_large(folio)) {
                                err = -EBUSY;

2. Save following program in userspace (didn't bother with error check
for simpler code):
===
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/ioctl.h>
#include <sys/syscall.h>
#include <linux/userfaultfd.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <poll.h>
#include <errno.h>

#define PAGE_SIZE 4096
/* Need to consume all slots so define the swap device size here */
#define SWAP_DEVICE_SIZE (PAGE_SIZE * 1024 - 1)

char *src, *race, *dst, *place_holder;
int uffd;

void read_in(char *p) {
    /* This test program initials memory with 0xAA to bypass zeromap */
    while (*((volatile char*)p) != 0xAA);
}

void *reader_thread(void *arg) {
    /* Test requires kernel to wait upon uffd move */
    read_in(dst);
    return NULL;
}

void *fault_handler_thread(void *arg) {
    int ret;
    struct uffd_msg msg;
    struct uffdio_move move;
    struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
    pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
    pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);

    while (1) {
        poll(&pollfd, 1, -1);
        read(uffd, &msg, sizeof(msg));

        move.src = (unsigned long)src + (msg.arg.pagefault.address -
            (unsigned long)dst);
        move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
        move.len = PAGE_SIZE;
        move.mode = 0;

        ioctl(uffd, UFFDIO_MOVE, &move);
    }
    return NULL;
}

int main() {
    pthread_t fault_handler_thr, reader_thr;
    struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
    struct uffdio_register uffdio_register;

    src = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);
    dst = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);
    memset(src, 0xAA, PAGE_SIZE);
    madvise(src, PAGE_SIZE, MADV_PAGEOUT);

    /* Consume all slots on swap device left only one entry (S1) */
    place_holder = mmap(NULL, SWAP_DEVICE_SIZE - 1, PROT_READ |
PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
    memset(place_holder, 0xAA, SWAP_DEVICE_SIZE - 1);
    madvise(place_holder, SWAP_DEVICE_SIZE - 1, MADV_PAGEOUT);

    /* Setup uffd handler and dst reader */
    uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
    ioctl(uffd, UFFDIO_API, &uffdio_api);
    uffdio_register.range.start = (unsigned long)dst;
    uffdio_register.range.len = PAGE_SIZE;
    uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
    ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
    pthread_create(&fault_handler_thr, NULL, fault_handler_thread, NULL);
    pthread_create(&reader_thr, NULL, reader_thread, NULL);

    /* Wait for UFFDIO to start */
    sleep(1);

    /* Release src folio (A) from swap, freeing the entry S1 */
    read_in(src);

    /* Swapout another race folio (B) using S1 */
    race = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_ANONYMOUS, -1, 0);
    memset(race, 0xAA, PAGE_SIZE);
    madvise(race, PAGE_SIZE, MADV_PAGEOUT);

    /* Wait for UFFDIO swap lookup to see the race folio (B) */
    sleep(3);

    /* Free the race folio (B) from swap */
    read_in(race);
    /* And swap out src folio (A) again, using S1 */
    madvise(src, PAGE_SIZE, MADV_PAGEOUT);

    /* Kernel should have moved a wrong folio by now */

    pthread_join(reader_thr, NULL);
    pthread_cancel(fault_handler_thr);
    pthread_join(fault_handler_thr, NULL);
    munmap(race, PAGE_SIZE);
    munmap(src, PAGE_SIZE);
    munmap(dst, PAGE_SIZE);
    close(uffd);

    return 0;
}

3. Run the test with (ensure no other swap device is mounted and
current dir is on a block device):
===
dd if=/dev/zero of=swap.img bs=1M count=1; mkswap swap.img; swapon
swap.img; gcc test-uffd.c && ./a.out

Then we get the WARN:
[  348.200587] ------------[ cut here ]------------
[  348.200599] WARNING: CPU: 7 PID: 1856 at mm/userfaultfd.c:1104
move_pages_pte+0xdb8/0x11a0
[  348.207544] Modules linked in: loop
[  348.209401] CPU: 7 UID: 0 PID: 1856 Comm: a.out Kdump: loaded Not
tainted 6.15.0-rc6ptch-00381-g99f00d7c6c6f-dirty #304
PREEMPT(voluntary)
[  348.214579] Hardware name: QEMU QEMU Virtual Machine, BIOS
edk2-stable202408-prebuilt.qemu.org 08/13/2024
[  348.218656] pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
[  348.222013] pc : move_pages_pte+0xdb8/0x11a0
[  348.224062] lr : move_pages_pte+0x928/0x11a0
[  348.225881] sp : ffff800088b2b8f0
[  348.227360] x29: ffff800088b2b970 x28: 0000000000000000 x27: 0000ffffbc920000
[  348.230228] x26: fffffdffc335e4a8 x25: 0000000000000001 x24: fffffdffc3e4dd40
[  348.233159] x23: 080000010d792403 x22: ffff0000cd792900 x21: ffff0000c5a6d2c0
[  348.236339] x20: fffffdffc335e4a8 x19: 0000000000001004 x18: 0000000000000006
[  348.239269] x17: 0000ffffbc920000 x16: 0000ffffbc922fff x15: 0000000000000003
[  348.242703] x14: ffff8000812c3b68 x13: 0000000000000003 x12: 0000000000000003
[  348.245947] x11: 0000000000000000 x10: ffff800081e4feb8 x9 : 0000000000000001
[  348.249284] x8 : 0000000000000000 x7 : 6f696c6f6620746f x6 : 47203a4755424544
[  348.252071] x5 : ffff8000815789e3 x4 : ffff8000815789e5 x3 : 0000000000000000
[  348.255358] x2 : ffff0001fed2aef0 x1 : 0000000000000000 x0 : fffffdffc335e4a8
[  348.258134] Call trace:
[  348.259468]  move_pages_pte+0xdb8/0x11a0 (P)
[  348.261348]  move_pages+0x3c0/0x738
[  348.262987]  userfaultfd_ioctl+0x3d8/0x1f98
[  348.264916]  __arm64_sys_ioctl+0x88/0xd0
[  348.266779]  invoke_syscall+0x64/0xec
[  348.268347]  el0_svc_common+0xa8/0xd8
[  348.269967]  do_el0_svc+0x1c/0x28
[  348.271711]  el0_svc+0x40/0xe0
[  348.273345]  el0t_64_sync_handler+0x78/0x108
[  348.274821]  el0t_64_sync+0x19c/0x1a0
[  348.276117] ---[ end trace 0000000000000000 ]---
[  348.278638] Moving folio fffffdffc3e4dd40 (folio->swap = 0), orig_src_pte = 1

That's the new added WARN, but the test program also hung with D
forever, and with errors with other tests like:
[  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_ANONPAGES val:-1
[  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
type:MM_SHMEMPAGES val:1

Because the kernel just moved the wrong folio, so unmap takes forever
looking for the missing folio, and counting went wrong too.

So this race is real. It's extremely unlikely to happen because it
requires multiple collisions of multiple tiny race windows, however
it's not impossible.

I'll post a fix very soon.


^ permalink raw reply related	[flat|nested] 56+ messages in thread

* Re: [PATCH 05/28] mm, swap: sanitize swap cache lookup convention
  2025-05-30  8:49                       ` Kairui Song
@ 2025-05-30 19:24                         ` Kairui Song
  0 siblings, 0 replies; 56+ messages in thread
From: Kairui Song @ 2025-05-30 19:24 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, Baolin Wang, Baoquan He, Chris Li, David Hildenbrand,
	Johannes Weiner, Hugh Dickins, Kalesh Singh, LKML, linux-mm,
	Nhat Pham, Ryan Roberts, Kemeng Shi, Tim Chen, Matthew Wilcox,
	Huang, Ying, Yosry Ahmed

On Fri, May 30, 2025 at 4:49 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, May 27, 2025 at 11:11 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, May 27, 2025 at 3:59 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, May 24, 2025 at 8:01 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Fri, May 23, 2025 at 10:30 AM Barry Song <21cnbao@gmail.com> wrote:
> > > > >
> > > > > On Wed, May 21, 2025 at 2:45 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > > > >
> > > > > > Barry Song <21cnbao@gmail.com> 于 2025年5月21日周三 06:33写道：
> > > > > > > Let me run test case [1] to check whether this ever happens. I guess I need to
> > > > > > > hack kernel a bit to always add folio to swapcache even for SYNC IO.
> > > > > >
> > > > > > That will cause quite a performance regression I think. Good thing is,
> > > > > > that's exactly the problem this series is solving by dropping the SYNC
> > > > > > IO swapin path and never bypassing the swap cache, while improving the
> > > > > > performance, eliminating things like this. One more reason to justify
> > > > > > the approach :)
> > > >
> > > > Hi Barry,
> > > >
> > > > >
> > > > > I attempted to reproduce the scenario where a folio is added to the swapcache
> > > > > after filemap_get_folio() returns NULL but before move_swap_pte()
> > > > > moves the swap PTE
> > > > > using non-synchronized I/O. Technically, this seems possible; however,
> > > > > I was unable
> > > > > to reproduce it, likely because the time window between swapin_readahead and
> > > > > taking the page table lock within do_swap_page() is too short.
> > > >
> > > > Thank you so much for trying this!
> > > >
> > > > I have been trying to reproduce it too, and so far I didn't observe
> > > > any crash or warn. I added following debug code:
> > > >
> > > >  static __always_inline
> > > >  bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_end)
> > > > @@ -1163,6 +1167,7 @@ static int move_pages_pte(struct mm_struct *mm,
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >         pmd_t dummy_pmdval;
> > > >         pmd_t dst_pmdval;
> > > >         struct folio *src_folio = NULL;
> > > > +       struct folio *tmp_folio = NULL;
> > > >         struct anon_vma *src_anon_vma = NULL;
> > > >         struct mmu_notifier_range range;
> > > >         int err = 0;
> > > > @@ -1391,6 +1396,15 @@ static int move_pages_pte(struct mm_struct *mm,
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                 if (!src_folio)
> > > >                         folio = filemap_get_folio(swap_address_space(entry),
> > > >                                         swap_cache_index(entry));
> > > > +               udelay(get_random_u32_below(1000));
> > > > +               tmp_folio = filemap_get_folio(swap_address_space(entry),
> > > > +                                       swap_cache_index(entry));
> > > > +               if (!IS_ERR_OR_NULL(tmp_folio)) {
> > > > +                       if (!IS_ERR_OR_NULL(folio) && tmp_folio != folio) {
> > > > +                               pr_err("UFFDIO_MOVE: UNSTABLE folio
> > > > %lx (%lx) -> %lx (%lx)\n", folio, folio->swap.val, tmp_folio,
> > > > tmp_folio->swap.val);
> > > > +                       }
> > > > +                       folio_put(tmp_folio);
> > > > +               }
> > > >                 if (!IS_ERR_OR_NULL(folio)) {
> > > >                         if (folio_test_large(folio)) {
> > > >                                 err = -EBUSY;
> > > > @@ -1413,6 +1427,8 @@ static int move_pages_pte(struct mm_struct *mm,
> > > > pmd_t *dst_pmd, pmd_t *src_pmd,
> > > >                 err = move_swap_pte(mm, dst_vma, dst_addr, src_addr,
> > > > dst_pte, src_pte,
> > > >                                 orig_dst_pte, orig_src_pte, dst_pmd, dst_pmdval,
> > > >                                 dst_ptl, src_ptl, src_folio);
> > > > +               if (tmp_folio != folio && !err)
> > > > +                       pr_err("UFFDIO_MOVE: UNSTABLE folio passed
> > > > check: %lx -> %lx\n", folio, tmp_folio);
> > > >         }
> > > >
> > > > And I saw these two prints are getting triggered like this (not a real
> > > > issue though, just help to understand the problem)
> > > > ...
> > > > [ 3127.632791] UFFDIO_MOVE: UNSTABLE folio fffffdffc334cd00 (0) ->
> > > > fffffdffc7ccac80 (51)
> > > > [ 3172.033269] UFFDIO_MOVE: UNSTABLE folio fffffdffc343bb40 (0) ->
> > > > fffffdffc3435e00 (3b)
> > > > [ 3194.425213] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d481c0 (0) ->
> > > > fffffdffc34ab8c0 (76)
> > > > [ 3194.991318] UFFDIO_MOVE: UNSTABLE folio fffffdffc34f95c0 (0) ->
> > > > fffffdffc34ab8c0 (6d)
> > > > [ 3203.467212] UFFDIO_MOVE: UNSTABLE folio fffffdffc34b13c0 (0) ->
> > > > fffffdffc34eda80 (32)
> > > > [ 3206.217820] UFFDIO_MOVE: UNSTABLE folio fffffdffc7d297c0 (0) ->
> > > > fffffdffc38cedc0 (b)
> > > > [ 3214.913039] UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc34db140
> > > > [ 3217.066972] UFFDIO_MOVE: UNSTABLE folio fffffdffc342b5c0 (0) ->
> > > > fffffdffc3465cc0 (21)
> > > > ...
> > > >
> > > > The "UFFDIO_MOVE: UNSTABLE folio fffffdffc3435180 (0) ->
> > > > fffffdffc3853540 (53)" worries me at first. On first look it seems the
> > > > folio is indeed freed completely from the swap cache after the first
> > > > lookup, so another swapout can reuse the entry. But as you mentioned
> > > > __remove_mapping won't release a folio if the refcount check fails, so
> > > > they must be freed by folio_free_swap or __try_to_reclaim_swap, there
> > > > are many places that can happen. But these two helpers won't free a
> > > > folio from swap cache if its swap count is not zero. And the folio
> > > > will either be swapped out (swap count non zero), or mapped (freeing
> > > > it is fine, PTE is non_swap, and another swapout will still use the
> > > > same folio).
> > > >
> > > > So after more investigation and dumping the pages, it's actually the
> > > > second lookup (tmp_folio) seeing the entry being reused by another
> > > > page table entry, after the first folio is swapped back and released.
> > > > So the page table check below will always fail just fine.
> > > >
> > > > But this also proves the first look up can see a completely irrelevant
> > > > folio too: If the src folio is swapped out, but got swapped back and
> > > > freed, then another folio B shortly got added to swap cache reuse the
> > > > src folio's old swap entry, then the folio B got seen by the look up
> > > > here and get freed from swap cache, then src folio got swapped out
> > > > again also reusing the same entry, then we have a problem as PTE seems
> > > > untouched indeed but we grabbed a wrong folio. Seems possible if I'm
> > > > not wrong:
> > > >
> > > > Something like this:
> > > > CPU1                             CPU2
> > > > move_pages_pte
> > > >   entry = pte_to_swp_entry(orig_src_pte);
> > > >     | Got Swap Entry S1 from src_pte
> > > >   ...
> > > >                                  <swapin src_pte, using folio A>
> > >
> > > I’m assuming you mean `<swapin src_pte, using folio B>`, since I’m not
> > > sure where folio B comes from in the statement `<someone else tried to
> > > swap out folio B>`.
> > >
> > > If that assumption is correct, and folio A is still in the swapcache,
> > > how could someone swap in folio B without hitting folio A? That would
> > > suggest folio A must have been removed from the swapcache earlier—right?
> > >
> > > >                                  <free folio A from swap cache freeing S1>
> > > >                                  <someone else try swap out folio B >
> >
> > Sorry my bad, I think I made people think folio B is related to
> > src_pte at this point. What I actually mean is that: Another random
> > folio B, unrelated to src_pte, could got swapped out, and using the
> > swap entry S1.
> >
> > > >                                  <put folio B to swap cache using S1 >
> > > >                                 ...
> > > >   folio = swap_cache_get_folio(S1)
> > > >     | Got folio B here !!!
> > > >   move_swap_pte
> > > >                                  <free folio B from swap cache>
> > > >                                    | Holding a reference doesn't pin the cache
> > > >                                    | as we have demonstrated
> > > >                                  <Swapout folio A also using S1>
> > > >     double_pt_lock
> > > >     is_pte_pages_stable
> > > >       | Passed because of S1 is reused
> > > >     folio_move_anon_rmap(...)
> > > >       | Moved invalid folio B here !!!
> > > >
> > > > But this is extremely hard to reproduce though, even if doing it
> > > > deliberately...
> > > >
> > > > So I think a "folio_swap_contains" or equivalent check here is a good
> > > > thing to have, to make it more robust and easier to understand. The
> > > > checking after locking a folio has very tiny overhead and can
> > > > definitely ensure the folio's swap entry is valid and stable.
> > > >
> > > > The "UFFDIO_MOVE: UNSTABLE folio passed check: 0 -> fffffdffc385fb00"
> > > > here might seem problematic, but it's still not a real problem. That's
> > > > the case where the swapin in src region happens after the lookup, and
> > > > before the PTE lock. It will pass the PTE check without moving the
> > > > folio. But the folio is guaranteed to be a completely new folio here
> > > > because the folio can't be added back to the page table without
> > > > holding the PTE lock, and if that happens the following PTE check here
> > > > will fail.
> > > >
> > > > So I think we should patch the current kernel only adding a
> > > > "folio_swap_contains" equivalent check here, and maybe more comments,
> > > > how do you think?
> > >
> > > The description appears to have some inconsistencies.
> > > Would you mind rephrasing it?
> >
> > Yeah, let's ignore the "UFFDIO_MOVE: UNSTABLE folio passed check: 0 ->
> > fffffdffc385fb00" part first, as both you and me have come into a
> > conclusion that "filemap_get_folio() returns NULL before
> > move_swap_pte, but a folio was added to swap cache" is OK, and this
> > output only proves that happens.
> >
> > So the problematic race is:
> >
> > Here move_pages_pte is moving src_pte to dst_pte, and it begins with
> > src_pte == swap entry S1, and S1 isn't cached.
> >
> > CPU1                             CPU2
> > move_pages_pte()
> >   entry = pte_to_swp_entry(orig_src_pte);
> >     | src_pte is absent, and got entry == S1
> >   ... < Somehow interrupted> ...
> >                                  <swapin src_pte, using folio A>
> >                                    | folio A is just a new allocated folio
> >                                    | for resolving the swap in fault.
> >                                  <free folio A from swap cache freeing S1>
> >                                    | swap in fault is resolved, src_pte
> >                                    | now points to folio A, so folio A
> >                                    | can get freed just fine.
> >                                    | And now S1 is free to be used
> >                                    | by anyone.
> >                                  <someone else try swap out another folio B >
> >                                    | Folio B is a completely unrelated
> >                                    | folio swapped out by random process.
> >                                    | (has nothing to do with src_pte)
> >                                    | But S1 is freed so it may use S1
> >                                    | as its swap entry.
> >                                  <put folio B to swap cache with index S1 >
> >                                  ...
> >   folio = filemap_get_folio(S1)
> >     | The lookup is using S1, so it
> >     | got folio B here !!!
> >   ... < Somehow interrupted> ...
> >                                  <free folio B from swap cache>
> >                                    | Folio B could fail to be swapped out
> >                                    | or got swapped in again, so it can
> >                                    | be freed by folio_free_swap or
> >                                    | swap cache reclaim.
> >                                    | CPU1 is holding a reference but it
> >                                    | doesn't pin the swap cache folio
> >                                    | as I have demonstrated with the
> >                                    | test C program previously.
> >                                    | New S1 is free to be used again.
> >                                  <Swapout src_pte again using S1>
> >                                    | No thing blocks this from happening
> >                                    | The swapout is still using folio A,
> >                                    | and src_pte == S1.
> >   folio_trylock(folio)
> >   move_swap_pte
> >     double_pt_lock
> >     is_pte_pages_stable
> >       | Passed because of S1 is reused so src_pte == S1.
> >     folio_move_anon_rmap(...)
> >       | Moved invalid folio B here !!!
> >
> > It's a long and complex one, I don't think it's practically possible
> > to happen in reality but in theory doable, once in a million maybe...
> > Still we have to fix it, or did I got anything wrong here?
>
> Hi Barry,
>
> I managed to reproduce this issue, by hacking the kernel a bit (Only
> adding only delay to increase the race window, and adding a WARN to
> indicate the problem)
>
> 1. Applying following patch for kernel:
> ===
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index bc473ad21202..1d710adf9839 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -15,6 +15,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
>  #include <linux/shmem_fs.h>
> +#include <linux/delay.h>
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -1100,6 +1101,10 @@ static int move_swap_pte(struct mm_struct *mm,
> struct vm_area_struct *dst_vma,
>          * occur and hit the swapcache after moving the PTE.
>          */
>         if (src_folio) {
> +               if (WARN_ON(src_folio->swap.val !=
> pte_to_swp_entry(orig_src_pte).val))
> +                       pr_err("Moving folio %lx (folio->swap = %lx),
> orig_src_pte = %lx\n",
> +                              (unsigned long)src_folio, src_folio->swap.val,
> +                              pte_to_swp_entry(orig_src_pte).val);
>                 folio_move_anon_rmap(src_folio, dst_vma);
>                 src_folio->index = linear_page_index(dst_vma, dst_addr);
>         }
> @@ -1388,9 +1393,13 @@ static int move_pages_pte(struct mm_struct *mm,
> pmd_t *dst_pmd, pmd_t *src_pmd,
>                  * folios in the swapcache. This issue needs to be resolved
>                  * separately to allow proper handling.
>                  */
> +               pr_err("DEBUG: Will do the lookup using entry %lx,
> wait 3s...\n", entry.val);
> +               mdelay(1000 * 3);
>                 if (!src_folio)
>                         folio = filemap_get_folio(swap_address_space(entry),
>                                         swap_cache_index(entry));
> +               pr_err("DEBUG: Got folio value %lx, wait 3s...\n",
> (unsigned long)folio);
> +               mdelay(1000 * 3);
>                 if (!IS_ERR_OR_NULL(folio)) {
>                         if (folio_test_large(folio)) {
>                                 err = -EBUSY;
>
> 2. Save following program in userspace (didn't bother with error check
> for simpler code):
> ===
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <sys/ioctl.h>
> #include <sys/syscall.h>
> #include <linux/userfaultfd.h>
> #include <fcntl.h>
> #include <pthread.h>
> #include <unistd.h>
> #include <poll.h>
> #include <errno.h>
>
> #define PAGE_SIZE 4096
> /* Need to consume all slots so define the swap device size here */
> #define SWAP_DEVICE_SIZE (PAGE_SIZE * 1024 - 1)
>
> char *src, *race, *dst, *place_holder;
> int uffd;
>
> void read_in(char *p) {
>     /* This test program initials memory with 0xAA to bypass zeromap */
>     while (*((volatile char*)p) != 0xAA);
> }
>
> void *reader_thread(void *arg) {
>     /* Test requires kernel to wait upon uffd move */
>     read_in(dst);
>     return NULL;
> }
>
> void *fault_handler_thread(void *arg) {
>     int ret;
>     struct uffd_msg msg;
>     struct uffdio_move move;
>     struct pollfd pollfd = { .fd = uffd, .events = POLLIN };
>     pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, NULL);
>     pthread_setcanceltype(PTHREAD_CANCEL_DEFERRED, NULL);
>
>     while (1) {
>         poll(&pollfd, 1, -1);
>         read(uffd, &msg, sizeof(msg));
>
>         move.src = (unsigned long)src + (msg.arg.pagefault.address -
>             (unsigned long)dst);
>         move.dst = msg.arg.pagefault.address & ~(PAGE_SIZE - 1);
>         move.len = PAGE_SIZE;
>         move.mode = 0;
>
>         ioctl(uffd, UFFDIO_MOVE, &move);
>     }
>     return NULL;
> }
>
> int main() {
>     pthread_t fault_handler_thr, reader_thr;
>     struct uffdio_api uffdio_api = { .api = UFFD_API, .features = 0 };
>     struct uffdio_register uffdio_register;
>
>     src = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS, -1, 0);
>     dst = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS, -1, 0);
>     memset(src, 0xAA, PAGE_SIZE);
>     madvise(src, PAGE_SIZE, MADV_PAGEOUT);
>
>     /* Consume all slots on swap device left only one entry (S1) */
>     place_holder = mmap(NULL, SWAP_DEVICE_SIZE - 1, PROT_READ |
> PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>     memset(place_holder, 0xAA, SWAP_DEVICE_SIZE - 1);
>     madvise(place_holder, SWAP_DEVICE_SIZE - 1, MADV_PAGEOUT);
>
>     /* Setup uffd handler and dst reader */
>     uffd = syscall(SYS_userfaultfd, O_CLOEXEC | O_NONBLOCK);
>     ioctl(uffd, UFFDIO_API, &uffdio_api);
>     uffdio_register.range.start = (unsigned long)dst;
>     uffdio_register.range.len = PAGE_SIZE;
>     uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
>     ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
>     pthread_create(&fault_handler_thr, NULL, fault_handler_thread, NULL);
>     pthread_create(&reader_thr, NULL, reader_thread, NULL);
>
>     /* Wait for UFFDIO to start */
>     sleep(1);
>
>     /* Release src folio (A) from swap, freeing the entry S1 */
>     read_in(src);
>
>     /* Swapout another race folio (B) using S1 */
>     race = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_ANONYMOUS, -1, 0);
>     memset(race, 0xAA, PAGE_SIZE);
>     madvise(race, PAGE_SIZE, MADV_PAGEOUT);
>
>     /* Wait for UFFDIO swap lookup to see the race folio (B) */
>     sleep(3);
>
>     /* Free the race folio (B) from swap */
>     read_in(race);
>     /* And swap out src folio (A) again, using S1 */
>     madvise(src, PAGE_SIZE, MADV_PAGEOUT);
>
>     /* Kernel should have moved a wrong folio by now */
>
>     pthread_join(reader_thr, NULL);
>     pthread_cancel(fault_handler_thr);
>     pthread_join(fault_handler_thr, NULL);
>     munmap(race, PAGE_SIZE);
>     munmap(src, PAGE_SIZE);
>     munmap(dst, PAGE_SIZE);
>     close(uffd);
>
>     return 0;
> }
>
> 3. Run the test with (ensure no other swap device is mounted and
> current dir is on a block device):
> ===
> dd if=/dev/zero of=swap.img bs=1M count=1; mkswap swap.img; swapon
> swap.img; gcc test-uffd.c && ./a.out
>
> Then we get the WARN:
> [  348.200587] ------------[ cut here ]------------
> [  348.200599] WARNING: CPU: 7 PID: 1856 at mm/userfaultfd.c:1104
> move_pages_pte+0xdb8/0x11a0
> [  348.207544] Modules linked in: loop
> [  348.209401] CPU: 7 UID: 0 PID: 1856 Comm: a.out Kdump: loaded Not
> tainted 6.15.0-rc6ptch-00381-g99f00d7c6c6f-dirty #304
> PREEMPT(voluntary)
> [  348.214579] Hardware name: QEMU QEMU Virtual Machine, BIOS
> edk2-stable202408-prebuilt.qemu.org 08/13/2024
> [  348.218656] pstate: 81400005 (Nzcv daif +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> [  348.222013] pc : move_pages_pte+0xdb8/0x11a0
> [  348.224062] lr : move_pages_pte+0x928/0x11a0
> [  348.225881] sp : ffff800088b2b8f0
> [  348.227360] x29: ffff800088b2b970 x28: 0000000000000000 x27: 0000ffffbc920000
> [  348.230228] x26: fffffdffc335e4a8 x25: 0000000000000001 x24: fffffdffc3e4dd40
> [  348.233159] x23: 080000010d792403 x22: ffff0000cd792900 x21: ffff0000c5a6d2c0
> [  348.236339] x20: fffffdffc335e4a8 x19: 0000000000001004 x18: 0000000000000006
> [  348.239269] x17: 0000ffffbc920000 x16: 0000ffffbc922fff x15: 0000000000000003
> [  348.242703] x14: ffff8000812c3b68 x13: 0000000000000003 x12: 0000000000000003
> [  348.245947] x11: 0000000000000000 x10: ffff800081e4feb8 x9 : 0000000000000001
> [  348.249284] x8 : 0000000000000000 x7 : 6f696c6f6620746f x6 : 47203a4755424544
> [  348.252071] x5 : ffff8000815789e3 x4 : ffff8000815789e5 x3 : 0000000000000000
> [  348.255358] x2 : ffff0001fed2aef0 x1 : 0000000000000000 x0 : fffffdffc335e4a8
> [  348.258134] Call trace:
> [  348.259468]  move_pages_pte+0xdb8/0x11a0 (P)
> [  348.261348]  move_pages+0x3c0/0x738
> [  348.262987]  userfaultfd_ioctl+0x3d8/0x1f98
> [  348.264916]  __arm64_sys_ioctl+0x88/0xd0
> [  348.266779]  invoke_syscall+0x64/0xec
> [  348.268347]  el0_svc_common+0xa8/0xd8
> [  348.269967]  do_el0_svc+0x1c/0x28
> [  348.271711]  el0_svc+0x40/0xe0
> [  348.273345]  el0t_64_sync_handler+0x78/0x108
> [  348.274821]  el0t_64_sync+0x19c/0x1a0
> [  348.276117] ---[ end trace 0000000000000000 ]---
> [  348.278638] Moving folio fffffdffc3e4dd40 (folio->swap = 0), orig_src_pte = 1
>
> That's the new added WARN, but the test program also hung with D
> forever, and with errors with other tests like:
> [  406.893936] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
> type:MM_ANONPAGES val:-1
> [  406.894071] BUG: Bad rss-counter state mm:ffff0000c5a9ddc0
> type:MM_SHMEMPAGES val:1
>
> Because the kernel just moved the wrong folio, so unmap takes forever
> looking for the missing folio, and counting went wrong too.
>
> So this race is real. It's extremely unlikely to happen because it
> requires multiple collisions of multiple tiny race windows, however
> it's not impossible.
>
> I'll post a fix very soon.

On second thought, the "filemap_get_folio() returns NULL before
move_swap_pte, but a folio was added to swap cache" case is also
buggy. It can also be reproduced with the program above with slight
modification:

--- test-uffd.c 2025-05-30 08:34:00.485206529 +0000
+++ test-uffd-same-folio.c      2025-05-30 19:04:13.826078271 +0000
@@ -83,20 +83,20 @@
     /* Release src folio (A) from swap, freeing the entry S1 */
     read_in(src);

-    /* Swapout another race folio (B) using S1 */
+    /* Swapout and free another race folio (B) forcing reclaiming S1
and folio (A) */
     race = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_ANONYMOUS, -1, 0);
     memset(race, 0xAA, PAGE_SIZE);
     madvise(race, PAGE_SIZE, MADV_PAGEOUT);
+    read_in(race);
+    printf("RECLAMING A?\n");

-    /* Wait for UFFDIO swap lookup to see the race folio (B) */
+    /* Wait for UFFDIO swap lookup to see NULL */
     sleep(3);

-    /* Free the race folio (B) from swap */
-    read_in(race);
     /* And swap out src folio (A) again, using S1 */
     madvise(src, PAGE_SIZE, MADV_PAGEOUT);

-    /* Kernel should have moved a wrong folio by now */
+    /* Kernel should have moved folio (A) but it didn't */

     pthread_join(reader_thr, NULL);
     pthread_cancel(fault_handler_thr);

I'll fix them together.


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 20/28] mm, swap: check swap table directly for checking cache
  2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
@ 2025-06-19 10:38   ` Baoquan He
  2025-06-19 10:50     ` Kairui Song
  0 siblings, 1 reply; 56+ messages in thread
From: Baoquan He @ 2025-06-19 10:38 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Barry Song, Kalesh Singh,
	Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

On 05/15/25 at 04:17am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Instead of looking at the swap map, check swap table directly to tell if
> a swap entry has cache. Prepare for remove SWAP_HAS_CACHE.

But you actually check both the swap table entry and swap map entry in
this patch, or do I miss anything? 

E.g

if (!swap_count(si->swap_map[offset]) && swp_te_is_folio(swp_te))


> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 12 +++++------
>  mm/swap.h       |  6 ++++++
>  mm/swap_state.c | 11 ++++++++++
>  mm/swapfile.c   | 54 +++++++++++++++++++++++--------------------------
>  4 files changed, 48 insertions(+), 35 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a70624a55aa2..a9a548575e72 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4314,15 +4314,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> +static inline int non_swapcache_batch(swp_entry_t entry, unsigned int max_nr)
>  {
> -	struct swap_info_struct *si = swp_info(entry);
> -	pgoff_t offset = swp_offset(entry);
> -	int i;
> +	unsigned int i;
>  
>  	for (i = 0; i < max_nr; i++) {
> -		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> -			return i;
> +		/* Page table lock pins the swap entries / swap device */
> +		if (swap_cache_check_folio(entry))
> +			break;
> +		entry.val++;
>  	}
>  
>  	return i;
> diff --git a/mm/swap.h b/mm/swap.h
> index 467996dafbae..2ae4624a0e48 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -186,6 +186,7 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
>  extern struct folio *swap_cache_get_folio(swp_entry_t entry);
>  extern struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
>  					  void **shadow, bool swapin);
> +extern bool swap_cache_check_folio(swp_entry_t entry);
>  extern void *swap_cache_get_shadow(swp_entry_t entry);
>  /* Below helpers requires the caller to lock the swap cluster. */
>  extern void __swap_cache_del_folio(swp_entry_t entry,
> @@ -395,6 +396,11 @@ static inline void *swap_cache_get_shadow(swp_entry_t end)
>  	return NULL;
>  }
>  
> +static inline bool swap_cache_check_folio(swp_entry_t entry)
> +{
> +	return false;
> +}
> +
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
>  	return 0;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index c8bb16835612..ea6a1741db5c 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -266,6 +266,17 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
>  	return folio;
>  }
>  
> +/*
> + * Check if a swap entry has folio cached, may return false positive.
> + * Caller must hold a reference of the swap device or pin it in other ways.
> + */
> +bool swap_cache_check_folio(swp_entry_t entry)
> +{
> +	swp_te_t swp_te;
> +	swp_te = __swap_table_get(swp_cluster(entry), swp_offset(entry));
> +	return swp_te_is_folio(swp_te);
> +}
> +
>  /*
>   * If we are the only user, then try to free up the swap cache.
>   *
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ef233466725e..0f2a499ff2c9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -181,15 +181,19 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
>  #define TTRS_FULL		0x4
>  
>  static bool swap_only_has_cache(struct swap_info_struct *si,
> -			      unsigned long offset, int nr_pages)
> +				struct swap_cluster_info *ci,
> +				unsigned long offset, int nr_pages)
>  {
>  	unsigned char *map = si->swap_map + offset;
>  	unsigned char *map_end = map + nr_pages;
> +	swp_te_t entry;
>  
>  	do {
> +		entry = __swap_table_get(ci, offset);

entry is not used in swap_only_has_cache() in this patch.

>  		VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> -		if (*map != SWAP_HAS_CACHE)
> +		if (*map)
>  			return false;
> +		offset++;
>  	} while (++map < map_end);
>  
>  	return true;
......snip...



^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 20/28] mm, swap: check swap table directly for checking cache
  2025-06-19 10:38   ` Baoquan He
@ 2025-06-19 10:50     ` Kairui Song
  2025-06-20  8:04       ` Baoquan He
  0 siblings, 1 reply; 56+ messages in thread
From: Kairui Song @ 2025-06-19 10:50 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Barry Song, Kalesh Singh,
	Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

On Thu, Jun 19, 2025 at 6:38 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 05/15/25 at 04:17am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of looking at the swap map, check swap table directly to tell if
> > a swap entry has cache. Prepare for remove SWAP_HAS_CACHE.
>
> But you actually check both the swap table entry and swap map entry in
> this patch, or do I miss anything?

Hi, Baoquan

>
> E.g
>
> if (!swap_count(si->swap_map[offset]) && swp_te_is_folio(swp_te))

Yes, the count info is still in the swap_map now, I'm only converting
the HAS_CACHE check to use swp_te_t here. We'll remove swap_map in
later patches and use the swp_te_t solely to get both info.

The reason some checks are added to check the swap_count is that:
Before this patch, `swap_map[offset] == SWAP_HAS_CACHE` implies the
count is zero too. So if HAS_CACHE is moved to swp_te_t, we still need
to check the count separately. The overhead will be gone very soon in
a later patch.

>
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 12 +++++------
> >  mm/swap.h       |  6 ++++++
> >  mm/swap_state.c | 11 ++++++++++
> >  mm/swapfile.c   | 54 +++++++++++++++++++++++--------------------------
> >  4 files changed, 48 insertions(+), 35 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index a70624a55aa2..a9a548575e72 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4314,15 +4314,15 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
> >  }
> >
> >  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> > +static inline int non_swapcache_batch(swp_entry_t entry, unsigned int max_nr)
> >  {
> > -     struct swap_info_struct *si = swp_info(entry);
> > -     pgoff_t offset = swp_offset(entry);
> > -     int i;
> > +     unsigned int i;
> >
> >       for (i = 0; i < max_nr; i++) {
> > -             if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> > -                     return i;
> > +             /* Page table lock pins the swap entries / swap device */
> > +             if (swap_cache_check_folio(entry))
> > +                     break;
> > +             entry.val++;
> >       }
> >
> >       return i;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 467996dafbae..2ae4624a0e48 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -186,6 +186,7 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
> >  extern struct folio *swap_cache_get_folio(swp_entry_t entry);
> >  extern struct folio *swap_cache_add_folio(swp_entry_t entry, struct folio *folio,
> >                                         void **shadow, bool swapin);
> > +extern bool swap_cache_check_folio(swp_entry_t entry);
> >  extern void *swap_cache_get_shadow(swp_entry_t entry);
> >  /* Below helpers requires the caller to lock the swap cluster. */
> >  extern void __swap_cache_del_folio(swp_entry_t entry,
> > @@ -395,6 +396,11 @@ static inline void *swap_cache_get_shadow(swp_entry_t end)
> >       return NULL;
> >  }
> >
> > +static inline bool swap_cache_check_folio(swp_entry_t entry)
> > +{
> > +     return false;
> > +}
> > +
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> >       return 0;
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index c8bb16835612..ea6a1741db5c 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -266,6 +266,17 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
> >       return folio;
> >  }
> >
> > +/*
> > + * Check if a swap entry has folio cached, may return false positive.
> > + * Caller must hold a reference of the swap device or pin it in other ways.
> > + */
> > +bool swap_cache_check_folio(swp_entry_t entry)
> > +{
> > +     swp_te_t swp_te;
> > +     swp_te = __swap_table_get(swp_cluster(entry), swp_offset(entry));
> > +     return swp_te_is_folio(swp_te);
> > +}
> > +
> >  /*
> >   * If we are the only user, then try to free up the swap cache.
> >   *
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index ef233466725e..0f2a499ff2c9 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -181,15 +181,19 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
> >  #define TTRS_FULL            0x4
> >
> >  static bool swap_only_has_cache(struct swap_info_struct *si,
> > -                           unsigned long offset, int nr_pages)
> > +                             struct swap_cluster_info *ci,
> > +                             unsigned long offset, int nr_pages)
> >  {
> >       unsigned char *map = si->swap_map + offset;
> >       unsigned char *map_end = map + nr_pages;
> > +     swp_te_t entry;
> >
> >       do {
> > +             entry = __swap_table_get(ci, offset);
>
> entry is not used in swap_only_has_cache() in this patch.

Thanks, it used in a later patch so I must move it here accidently
during a rebase, will defer this change to later patch.

>
> >               VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> > -             if (*map != SWAP_HAS_CACHE)
> > +             if (*map)
> >                       return false;
> > +             offset++;
> >       } while (++map < map_end);
> >
> >       return true;
> ......snip...
>
>


^ permalink raw reply	[flat|nested] 56+ messages in thread

* Re: [PATCH 20/28] mm, swap: check swap table directly for checking cache
  2025-06-19 10:50     ` Kairui Song
@ 2025-06-20  8:04       ` Baoquan He
  0 siblings, 0 replies; 56+ messages in thread
From: Baoquan He @ 2025-06-20  8:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	David Hildenbrand, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Baolin Wang, Barry Song, Kalesh Singh,
	Kemeng Shi, Tim Chen, Ryan Roberts, linux-kernel

On 06/19/25 at 06:50pm, Kairui Song wrote:
> On Thu, Jun 19, 2025 at 6:38 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 05/15/25 at 04:17am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Instead of looking at the swap map, check swap table directly to tell if
> > > a swap entry has cache. Prepare for remove SWAP_HAS_CACHE.
> >
> > But you actually check both the swap table entry and swap map entry in
> > this patch, or do I miss anything?
> 
> Hi, Baoquan
> 
> >
> > E.g
> >
> > if (!swap_count(si->swap_map[offset]) && swp_te_is_folio(swp_te))
> 
> Yes, the count info is still in the swap_map now, I'm only converting
> the HAS_CACHE check to use swp_te_t here. We'll remove swap_map in
> later patches and use the swp_te_t solely to get both info.

Ah, I see it now. That's why the subject is saying it's checking swap table
for checking cache. Then it's fine to me, even though it's a little
confusing. 

> 
> The reason some checks are added to check the swap_count is that:
> Before this patch, `swap_map[offset] == SWAP_HAS_CACHE` implies the
> count is zero too. So if HAS_CACHE is moved to swp_te_t, we still need
> to check the count separately. The overhead will be gone very soon in
> a later patch.

Got it, that sounds great, thanks.



^ permalink raw reply	[flat|nested] 56+ messages in thread

end of thread, other threads:[~2025-06-20  8:05 UTC | newest]

Thread overview: 56+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 20:17 [PATCH 00/28] mm, swap: introduce swap table Kairui Song
2025-05-14 20:17 ` [PATCH 01/28] mm, swap: don't scan every fragment cluster Kairui Song
2025-05-14 20:17 ` [PATCH 02/28] mm, swap: consolidate the helper for mincore Kairui Song
2025-05-14 20:17 ` [PATCH 03/28] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-05-14 20:17 ` [PATCH 04/28] mm, swap: split readahead update out of swap cache lookup Kairui Song
2025-05-14 20:17 ` [PATCH 05/28] mm, swap: sanitize swap cache lookup convention Kairui Song
2025-05-19  4:38   ` Barry Song
2025-05-20  3:31     ` Kairui Song
2025-05-20  4:41       ` Barry Song
2025-05-20 19:09         ` Kairui Song
2025-05-20 22:33           ` Barry Song
2025-05-21  2:45             ` Kairui Song
2025-05-21  3:24               ` Barry Song
2025-05-23  2:29               ` Barry Song
2025-05-23 20:01                 ` Kairui Song
2025-05-27  7:58                   ` Barry Song
2025-05-27 15:11                     ` Kairui Song
2025-05-30  8:49                       ` Kairui Song
2025-05-30 19:24                         ` Kairui Song
2025-05-14 20:17 ` [PATCH 06/28] mm, swap: rearrange swap cluster definition and helpers Kairui Song
2025-05-19  6:26   ` Barry Song
2025-05-20  3:50     ` Kairui Song
2025-05-14 20:17 ` [PATCH 07/28] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-05-14 20:17 ` [PATCH 08/28] mm, swap: use swap table for the swap cache and switch API Kairui Song
2025-05-14 20:17 ` [PATCH 09/28] mm/swap: rename __read_swap_cache_async to __swapin_cache_alloc Kairui Song
2025-05-14 20:17 ` [PATCH 10/28] mm, swap: add a swap helper for bypassing only read ahead Kairui Song
2025-05-14 20:17 ` [PATCH 11/28] mm, swap: clean up and consolidate helper for mTHP swapin check Kairui Song
2025-05-15  9:31   ` Klara Modin
2025-05-15  9:39     ` Kairui Song
2025-05-19  7:08   ` Barry Song
2025-05-19 11:09     ` Kairui Song
2025-05-19 11:57       ` Barry Song
2025-05-14 20:17 ` [PATCH 12/28] mm, swap: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 13/28] mm/shmem, swap: avoid redundant Xarray lookup during swapin Kairui Song
2025-05-14 20:17 ` [PATCH 14/28] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-05-14 20:17 ` [PATCH 15/28] mm, swap: split locked entry freeing into a standalone helper Kairui Song
2025-05-14 20:17 ` [PATCH 16/28] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-05-14 20:17 ` [PATCH 17/28] mm, swap: sanitize swap entry management workflow Kairui Song
2025-05-14 20:17 ` [PATCH 18/28] mm, swap: rename and introduce folio_free_swap_cache Kairui Song
2025-05-14 20:17 ` [PATCH 19/28] mm, swap: clean up and improve swap entries batch freeing Kairui Song
2025-05-14 20:17 ` [PATCH 20/28] mm, swap: check swap table directly for checking cache Kairui Song
2025-06-19 10:38   ` Baoquan He
2025-06-19 10:50     ` Kairui Song
2025-06-20  8:04       ` Baoquan He
2025-05-14 20:17 ` [PATCH 21/28] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-05-14 20:17 ` [PATCH 22/28] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-05-14 20:17 ` [PATCH 23/28] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-05-14 20:17 ` [PATCH 24/28] mm, swap: implement helpers for reserving data in swap table Kairui Song
2025-05-15  9:40   ` Klara Modin
2025-05-16  2:35     ` Kairui Song
2025-05-14 20:17 ` [PATCH 25/28] mm/workingset: leave highest 8 bits empty for anon shadow Kairui Song
2025-05-14 20:17 ` [PATCH 26/28] mm, swap: minor clean up for swapon Kairui Song
2025-05-14 20:17 ` [PATCH 27/28] mm, swap: use swap table to track swap count Kairui Song
2025-05-14 20:17 ` [PATCH 28/28] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-05-21 18:36   ` Nhat Pham
2025-05-22  4:13     ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).