[PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
@ 2025-10-29 15:58 Kairui Song
  2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
                   ` (20 more replies)
  0 siblings, 21 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
special swap bits including SWAP_HAS_CACHE, along with many historical
issues. The performance is about ~20% better for some workloads, like
Redis with persistence. This also cleans up the code to prepare for
later phases, some patches are from a previously posted series.

Swap cache bypassing and swap synchronization in general had many
issues. Some are solved as workarounds, and some are still there [1]. To
resolve them in a clean way, one good solution is to always use swap
cache as the synchronization layer [2]. So we have to remove the swap
cache bypass swap-in path first. It wasn't very doable due to
performance issues, but now combined with the swap table, removing
the swap cache bypass path will instead improve the performance,
there is no reason to keep it.

Now we can rework the swap entry and cache synchronization following
the new design. Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.

Test results:

Redis / Valkey bench:
=====================

Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 460475.84 RPS               311591.19 RPS
After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)

Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 306044.38 RPS               102745.88 RPS
After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)

The performance is a lot better when persistence is applied. This should
apply to many other workloads that involve sharing memory and COW. A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches. This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).

vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:

                           Before:         After:
System time:               282.22s         283.47s
Sum Throughput:            5677.35 MB/s    5688.78 MB/s
Single process Throughput: 176.41 MB/s     176.23 MB/s
Free latency:              518477.96 us    521488.06 us

Which is almost identical.

Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1379.91s          1364.22s (-0.11%)

Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1822.52s          1803.33s (-0.11%)

Which is almost identical.

MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).

Before: 318162.18 qps
After:  318512.01 qps (+0.01%)

In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test. Next phases will start to merge
swap count into swap table and reduce memory usage.

One more gain here is that we now have better support for THP swapin.
Previously, the THP swapin was bound with swap cache bypassing, which
only works for single-mapped folios. Removing the bypassing path also
enabled THP swapin for all folios. It's still limited to SYNC_IO
devices, though, this limitation can will be removed later. This may
cause more serious thrashing for certain workloads, but that's not an
issue caused by this series, it's a common THP issue we should resolve
separately.

Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (18):
      mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
      mm, swap: split swap cache preparation loop into a standalone helper
      mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
      mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
      mm, swap: simplify the code and reduce indention
      mm, swap: free the swap cache after folio is mapped
      mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
      mm, swap: swap entry of a bad slot should not be considered as swapped out
      mm, swap: consolidate cluster reclaim and check logic
      mm, swap: split locked entry duplicating into a standalone helper
      mm, swap: use swap cache as the swap in synchronize layer
      mm, swap: remove workaround for unsynchronized swap map cache state
      mm, swap: sanitize swap entry management workflow
      mm, swap: add folio to swap cache directly on allocation
      mm, swap: check swap table directly for checking cache
      mm, swap: clean up and improve swap entries freeing
      mm, swap: drop the SWAP_HAS_CACHE flag
      mm, swap: remove no longer needed _swap_info_get

Nhat Pham (1):
      mm/shmem, swap: remove SWAP_MAP_SHMEM

 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |  77 ++---
 kernel/power/swap.c    |  10 +-
 mm/madvise.c           |   2 +-
 mm/memory.c            | 270 +++++++---------
 mm/rmap.c              |   7 +-
 mm/shmem.c             |  75 ++---
 mm/swap.h              |  69 +++-
 mm/swap_state.c        | 341 +++++++++++++-------
 mm/swapfile.c          | 849 +++++++++++++++++++++----------------------------
 mm/userfaultfd.c       |  10 +-
 mm/vmscan.c            |   1 -
 mm/zswap.c             |   4 +-
 13 files changed, 840 insertions(+), 877 deletions(-)
---
base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
change-id: 20251007-swap-table-p2-7d3086e5c38a

Best regards,
-- 
Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-30 22:53   ` Yosry Ahmed
  2025-10-29 15:58 ` [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
                   ` (19 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.

It's not async, and it's not doing any read. Rename it to better present
its usage, and prepare to be reworked as part of new swap cache APIs.

Also, add some comments for the function. Worth noting that the
skip_if_exists argument is an long existing workaround that will be
dropped soon.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  6 +++---
 mm/swap_state.c | 49 ++++++++++++++++++++++++++++++++-----------------
 mm/swapfile.c   |  2 +-
 mm/zswap.c      |  4 ++--
 4 files changed, 38 insertions(+), 23 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index d034c13d8dd2..0fff92e42cfe 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
 void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
 void swap_cache_del_folio(struct folio *folio);
+struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool *alloced, bool skip_if_exists);
 /* Below helpers require the caller to lock and pass in the swap cluster. */
 void __swap_cache_del_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry, void *shadow);
@@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists);
 struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b13e9c4baa90..7765b9474632 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -402,9 +402,28 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 	}
 }
 
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
-		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
-		bool skip_if_exists)
+/**
+ * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
+ * @entry: the swapped out swap entry to be binded to the folio.
+ * @gfp_mask: memory allocation flags
+ * @mpol: NUMA memory allocation policy to be applied
+ * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @new_page_allocated: sets true if allocation happened, false otherwise
+ * @skip_if_exists: if the slot is a partially cached state, return NULL.
+ *                  This is a workaround that would be removed shortly.
+ *
+ * Allocate a folio in the swap cache for one swap slot, typically before
+ * doing IO (swap in or swap out). The swap slot indicated by @entry must
+ * have a non-zero swap count (swapped out). Currently only supports order 0.
+ *
+ * Context: Caller must protect the swap device with reference count or locks.
+ * Return: Returns the existing folio if @entry is cached already. Returns
+ * NULL if failed due to -ENOMEM or @entry have a swap count < 1.
+ */
+struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
+				     struct mempolicy *mpol, pgoff_t ilx,
+				     bool *new_page_allocated,
+				     bool skip_if_exists)
 {
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct folio *folio;
@@ -452,12 +471,12 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			goto put_and_return;
 
 		/*
-		 * Protect against a recursive call to __read_swap_cache_async()
+		 * Protect against a recursive call to swap_cache_alloc_folio()
 		 * on the same entry waiting forever here because SWAP_HAS_CACHE
 		 * is set but the folio is not the swap cache yet. This can
 		 * happen today if mem_cgroup_swapin_charge_folio() below
 		 * triggers reclaim through zswap, which may call
-		 * __read_swap_cache_async() in the writeback path.
+		 * swap_cache_alloc_folio() in the writeback path.
 		 */
 		if (skip_if_exists)
 			goto put_and_return;
@@ -466,7 +485,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * We might race against __swap_cache_del_folio(), and
 		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
 		 * has not yet been cleared.  Or race against another
-		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
+		 * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
 		 * in swap_map, but not yet added its folio to swap cache.
 		 */
 		schedule_timeout_uninterruptible(1);
@@ -509,10 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * and reading the disk if it is not already cached.
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
- *
- * get/put_swap_device() aren't needed to call this function, because
- * __read_swap_cache_async() call them and swap_read_folio() holds the
- * swap cache folio lock.
  */
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
@@ -529,7 +544,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		return NULL;
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	mpol_cond_put(mpol);
 
@@ -647,9 +662,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	blk_start_plug(&plug);
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
-		folio = __read_swap_cache_async(
-				swp_entry(swp_type(entry), offset),
-				gfp_mask, mpol, ilx, &page_allocated, false);
+		folio = swap_cache_alloc_folio(
+			swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
+			&page_allocated, false);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -666,7 +681,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 skip:
 	/* The page was likely read above, so no need for plugging here */
-	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
@@ -761,7 +776,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 			continue;
 		pte_unmap(pte);
 		pte = NULL;
-		folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+		folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
 						&page_allocated, false);
 		if (!folio)
 			continue;
@@ -781,7 +796,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 	lru_add_drain();
 skip:
 	/* The folio was likely read above, so no need for plugging here */
-	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
+	folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
 					&page_allocated, false);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c35bb8593f50..849be32377d9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1573,7 +1573,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
  *   CPU1				CPU2
  *   do_swap_page()
  *     ...				swapoff+swapon
- *     __read_swap_cache_async()
+ *     swap_cache_alloc_folio()
  *       swapcache_prepare()
  *         __swap_duplicate()
  *           // check swap_map
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958..a7a2443912f4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -EEXIST;
 
 	mpol = get_task_policy(current);
-	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
-			NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
+				       NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
 	put_swap_device(si);
 	if (!folio)
 		return -ENOMEM;

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
  2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
@ 2025-10-30 22:53   ` Yosry Ahmed
       [not found]     ` <CAGsJ_4x1P0ypm70De7qDcDxqvY93GEPW6X2sBS_xfSUem5_S2w@mail.gmail.com>
  0 siblings, 1 reply; 50+ messages in thread
From: Yosry Ahmed @ 2025-10-30 22:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:58:27PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> __read_swap_cache_async is widely used to allocate and ensure a folio is
> in swapcache, or get the folio if a folio is already there.
> 
> It's not async, and it's not doing any read. Rename it to better present
> its usage, and prepare to be reworked as part of new swap cache APIs.
> 
> Also, add some comments for the function. Worth noting that the
> skip_if_exists argument is an long existing workaround that will be
> dropped soon.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h       |  6 +++---
>  mm/swap_state.c | 49 ++++++++++++++++++++++++++++++++-----------------
>  mm/swapfile.c   |  2 +-
>  mm/zswap.c      |  4 ++--
>  4 files changed, 38 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/swap.h b/mm/swap.h
> index d034c13d8dd2..0fff92e42cfe 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry);
>  void *swap_cache_get_shadow(swp_entry_t entry);
>  void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
>  void swap_cache_del_folio(struct folio *folio);
> +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
> +				     struct mempolicy *mpol, pgoff_t ilx,
> +				     bool *alloced, bool skip_if_exists);
>  /* Below helpers require the caller to lock and pass in the swap cluster. */
>  void __swap_cache_del_folio(struct swap_cluster_info *ci,
>  			    struct folio *folio, swp_entry_t entry, void *shadow);
> @@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct vm_area_struct *vma, unsigned long addr,
>  		struct swap_iocb **plug);
> -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
> -		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
> -		bool skip_if_exists);
>  struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>  		struct mempolicy *mpol, pgoff_t ilx);
>  struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index b13e9c4baa90..7765b9474632 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -402,9 +402,28 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>  	}
>  }
>  
> -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> -		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
> -		bool skip_if_exists)
> +/**
> + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
> + * @entry: the swapped out swap entry to be binded to the folio.
> + * @gfp_mask: memory allocation flags
> + * @mpol: NUMA memory allocation policy to be applied
> + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
> + * @new_page_allocated: sets true if allocation happened, false otherwise
> + * @skip_if_exists: if the slot is a partially cached state, return NULL.
> + *                  This is a workaround that would be removed shortly.
> + *
> + * Allocate a folio in the swap cache for one swap slot, typically before
> + * doing IO (swap in or swap out). The swap slot indicated by @entry must
> + * have a non-zero swap count (swapped out). Currently only supports order 0.

Is it used for swap in? That's confusing because the next sentence
mention that it needs to be already swapped out.

I suspect you're referring to the zswap writeback use case, but in this
case we're still "swapping-in" the folio from zswap to swap it out to
disk. I'd avoid mentioning swap in here because it's confusing.

Otherwise LGTM:
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>

> + *
> + * Context: Caller must protect the swap device with reference count or locks.
> + * Return: Returns the existing folio if @entry is cached already. Returns
> + * NULL if failed due to -ENOMEM or @entry have a swap count < 1.
> + */
> +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
> +				     struct mempolicy *mpol, pgoff_t ilx,
> +				     bool *new_page_allocated,
> +				     bool skip_if_exists)
>  {
>  	struct swap_info_struct *si = __swap_entry_to_info(entry);
>  	struct folio *folio;
> @@ -452,12 +471,12 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			goto put_and_return;
>  
>  		/*
> -		 * Protect against a recursive call to __read_swap_cache_async()
> +		 * Protect against a recursive call to swap_cache_alloc_folio()
>  		 * on the same entry waiting forever here because SWAP_HAS_CACHE
>  		 * is set but the folio is not the swap cache yet. This can
>  		 * happen today if mem_cgroup_swapin_charge_folio() below
>  		 * triggers reclaim through zswap, which may call
> -		 * __read_swap_cache_async() in the writeback path.
> +		 * swap_cache_alloc_folio() in the writeback path.
>  		 */
>  		if (skip_if_exists)
>  			goto put_and_return;
> @@ -466,7 +485,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * We might race against __swap_cache_del_folio(), and
>  		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
>  		 * has not yet been cleared.  Or race against another
> -		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> +		 * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
>  		 * in swap_map, but not yet added its folio to swap cache.
>  		 */
>  		schedule_timeout_uninterruptible(1);
> @@ -509,10 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>   * and reading the disk if it is not already cached.
>   * A failure return means that either the page allocation failed or that
>   * the swap entry is no longer in use.
> - *
> - * get/put_swap_device() aren't needed to call this function, because
> - * __read_swap_cache_async() call them and swap_read_folio() holds the
> - * swap cache folio lock.
>   */
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct vm_area_struct *vma, unsigned long addr,
> @@ -529,7 +544,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		return NULL;
>  
>  	mpol = get_vma_policy(vma, addr, 0, &ilx);
> -	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> +	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
>  					&page_allocated, false);
>  	mpol_cond_put(mpol);
>  
> @@ -647,9 +662,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	blk_start_plug(&plug);
>  	for (offset = start_offset; offset <= end_offset ; offset++) {
>  		/* Ok, do the async read-ahead now */
> -		folio = __read_swap_cache_async(
> -				swp_entry(swp_type(entry), offset),
> -				gfp_mask, mpol, ilx, &page_allocated, false);
> +		folio = swap_cache_alloc_folio(
> +			swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
> +			&page_allocated, false);
>  		if (!folio)
>  			continue;
>  		if (page_allocated) {
> @@ -666,7 +681,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	lru_add_drain();	/* Push any new pages onto the LRU now */
>  skip:
>  	/* The page was likely read above, so no need for plugging here */
> -	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> +	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
>  					&page_allocated, false);
>  	if (unlikely(page_allocated))
>  		swap_read_folio(folio, NULL);
> @@ -761,7 +776,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>  			continue;
>  		pte_unmap(pte);
>  		pte = NULL;
> -		folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> +		folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
>  						&page_allocated, false);
>  		if (!folio)
>  			continue;
> @@ -781,7 +796,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>  	lru_add_drain();
>  skip:
>  	/* The folio was likely read above, so no need for plugging here */
> -	folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
> +	folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
>  					&page_allocated, false);
>  	if (unlikely(page_allocated))
>  		swap_read_folio(folio, NULL);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c35bb8593f50..849be32377d9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1573,7 +1573,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
>   *   CPU1				CPU2
>   *   do_swap_page()
>   *     ...				swapoff+swapon
> - *     __read_swap_cache_async()
> + *     swap_cache_alloc_folio()
>   *       swapcache_prepare()
>   *         __swap_duplicate()
>   *           // check swap_map
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 5d0f8b13a958..a7a2443912f4 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  		return -EEXIST;
>  
>  	mpol = get_task_policy(current);
> -	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> -			NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> +	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
> +				       NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
>  	put_swap_device(si);
>  	if (!folio)
>  		return -ENOMEM;
> 
> -- 
> 2.51.1
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

[parent not found: <CAGsJ_4x1P0ypm70De7qDcDxqvY93GEPW6X2sBS_xfSUem5_S2w@mail.gmail.com>]

* Re: [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
       [not found]     ` <CAGsJ_4x1P0ypm70De7qDcDxqvY93GEPW6X2sBS_xfSUem5_S2w@mail.gmail.com>
@ 2025-11-03  9:02       ` Kairui Song
  2025-11-03  9:10         ` Barry Song
  2025-11-03 16:50         ` Yosry Ahmed
  0 siblings, 2 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-03  9:02 UTC (permalink / raw)
  To: Barry Song, Yosry Ahmed
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, David Hildenbrand, Youngjun Park, Hugh Dickins,
	Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
	Matthew Wilcox (Oracle), linux-kernel

On Mon, Nov 3, 2025 at 4:28 PM Barry Song <baohua@kernel.org> wrote:
> > > +/**
> > > + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
> > > + * @entry: the swapped out swap entry to be binded to the folio.
> > > + * @gfp_mask: memory allocation flags
> > > + * @mpol: NUMA memory allocation policy to be applied
> > > + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
> > > + * @new_page_allocated: sets true if allocation happened, false otherwise
> > > + * @skip_if_exists: if the slot is a partially cached state, return NULL.
> > > + *                  This is a workaround that would be removed shortly.
> > > + *
> > > + * Allocate a folio in the swap cache for one swap slot, typically before
> > > + * doing IO (swap in or swap out). The swap slot indicated by @entry must
> > > + * have a non-zero swap count (swapped out). Currently only supports order 0.

Hi Yosry and Barry, thanks for the review.

> >
> > Is it used for swap in? That's confusing because the next sentence
> > mention that it needs to be already swapped out.

Yes, swap in is the typical user, swap_vma_readahead calls this
function directly, allocate a folio then initiate the swap in IO.

I'm not sure why it is confusing. A swapped out slot getting swapped
in seems a very common thing?

> >
> > I suspect you're referring to the zswap writeback use case, but in this
> > case we're still "swapping-in" the folio from zswap to swap it out to
> > disk. I'd avoid mentioning swap in here because it's confusing.

Oh, I thought the zswap writeback is considered a kind of swap out :),
since it's technically writing data from ram to swap device.

>
> I assume you mean avoiding any mention of swap-out? As for swap-out, we’re
> swapping a folio out from the LRU — we’re not allocating a new folio.
>
> BTW, this sentence also feels a bit odd to me. I’d prefer removing
> “swap out” from
> “doing IO (swap in or swap out)”.

How about "doing IO (e.g. swap in or zswap writeback)"? Swap-in is a
very common user, and zswap writeback can be mentioned explicitly.

>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
  2025-11-03  9:02       ` Kairui Song
@ 2025-11-03  9:10         ` Barry Song
  2025-11-03 16:50         ` Yosry Ahmed
  1 sibling, 0 replies; 50+ messages in thread
From: Barry Song @ 2025-11-03  9:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: Yosry Ahmed, linux-mm, Andrew Morton, Baoquan He, Chris Li,
	Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

> > I assume you mean avoiding any mention of swap-out? As for swap-out, we’re
> > swapping a folio out from the LRU — we’re not allocating a new folio.
> >
> > BTW, this sentence also feels a bit odd to me. I’d prefer removing
> > “swap out” from
> > “doing IO (swap in or swap out)”.
>
> How about "doing IO (e.g. swap in or zswap writeback)"? Swap-in is a
> very common user, and zswap writeback can be mentioned explicitly.

Yes, that seems much better. Though Yosry seems to view zswap_writeback as a
swap-in from zswap followed by a swap-out to the device, I kind of
agree with him :-)

Mentioning the special case separately seems to be the clearest approach.

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
  2025-11-03  9:02       ` Kairui Song
  2025-11-03  9:10         ` Barry Song
@ 2025-11-03 16:50         ` Yosry Ahmed
  1 sibling, 0 replies; 50+ messages in thread
From: Yosry Ahmed @ 2025-11-03 16:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, linux-mm, Andrew Morton, Baoquan He, Chris Li,
	Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Mon, Nov 03, 2025 at 05:02:28PM +0800, Kairui Song wrote:
> On Mon, Nov 3, 2025 at 4:28 PM Barry Song <baohua@kernel.org> wrote:
> > > > +/**
> > > > + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
> > > > + * @entry: the swapped out swap entry to be binded to the folio.
> > > > + * @gfp_mask: memory allocation flags
> > > > + * @mpol: NUMA memory allocation policy to be applied
> > > > + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
> > > > + * @new_page_allocated: sets true if allocation happened, false otherwise
> > > > + * @skip_if_exists: if the slot is a partially cached state, return NULL.
> > > > + *                  This is a workaround that would be removed shortly.
> > > > + *
> > > > + * Allocate a folio in the swap cache for one swap slot, typically before
> > > > + * doing IO (swap in or swap out). The swap slot indicated by @entry must
> > > > + * have a non-zero swap count (swapped out). Currently only supports order 0.
> 
> Hi Yosry and Barry, thanks for the review.
> 
> > >
> > > Is it used for swap in? That's confusing because the next sentence
> > > mention that it needs to be already swapped out.
> 
> Yes, swap in is the typical user, swap_vma_readahead calls this
> function directly, allocate a folio then initiate the swap in IO.
> 
> I'm not sure why it is confusing. A swapped out slot getting swapped
> in seems a very common thing?

I mixed up swapping in and swapping out :) I was complaining about
mentioning "swapping out", not vice versa. Sorry for the confusion.

> 
> > >
> > > I suspect you're referring to the zswap writeback use case, but in this
> > > case we're still "swapping-in" the folio from zswap to swap it out to
> > > disk. I'd avoid mentioning swap in here because it's confusing.
> 
> Oh, I thought the zswap writeback is considered a kind of swap out :),
> since it's technically writing data from ram to swap device.

We do swap the page out, but we use __read_swap_cache_async() to "swap
in" the page from zswap first.

> 
> >
> > I assume you mean avoiding any mention of swap-out? As for swap-out, we’re
> > swapping a folio out from the LRU — we’re not allocating a new folio.
> >
> > BTW, this sentence also feels a bit odd to me. I’d prefer removing
> > “swap out” from
> > “doing IO (swap in or swap out)”.
> 
> How about "doing IO (e.g. swap in or zswap writeback)"? Swap-in is a
> very common user, and zswap writeback can be mentioned explicitly.

Looks good to me. Thanks.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
  2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

To prepare for the removal of swap cache bypass swapin, introduce a new
helper that accepts an allocated and charged fresh folio, prepares the
folio, the swap map, and then adds the folio to the swap cache.

This doesn't change how swap cache works yet, we are still depending on
the SWAP_HAS_CACHE in the swap map for synchronization. But all
synchronization hacks are now all in this single helper.

No feature change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 197 +++++++++++++++++++++++++++++++-------------------------
 1 file changed, 109 insertions(+), 88 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7765b9474632..d18ca765c04f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -402,6 +402,97 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 	}
 }
 
+/**
+ * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cache.
+ * @entry: swap entry to be bound to the folio.
+ * @folio: folio to be added.
+ * @gfp: memory allocation flags for charge, can be 0 if @charged if true.
+ * @charged: if the folio is already charged.
+ * @skip_if_exists: if the slot is in a cached state, return NULL.
+ *                  This is an old workaround that will be removed shortly.
+ *
+ * Update the swap_map and add folio as swap cache, typically before swapin.
+ * All swap slots covered by the folio must have a non-zero swap count.
+ *
+ * Context: Caller must protect the swap device with reference count or locks.
+ * Return: Returns the folio being added on success. Returns the existing
+ * folio if @entry is cached. Returns NULL if raced with swapin or swapoff.
+ */
+static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
+						  struct folio *folio,
+						  gfp_t gfp, bool charged,
+						  bool skip_if_exists)
+{
+	struct folio *swapcache;
+	void *shadow;
+	int ret;
+
+	/*
+	 * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio
+	 * into the swap cache. Loop with a schedule delay if raced with
+	 * another process setting SWAP_HAS_CACHE. This hackish loop will
+	 * be fixed very soon.
+	 */
+	for (;;) {
+		ret = swapcache_prepare(entry, folio_nr_pages(folio));
+		if (!ret)
+			break;
+
+		/*
+		 * The skip_if_exists is for protecting against a recursive
+		 * call to this helper on the same entry waiting forever
+		 * here because SWAP_HAS_CACHE is set but the folio is not
+		 * in the swap cache yet. This can happen today if
+		 * mem_cgroup_swapin_charge_folio() below triggers reclaim
+		 * through zswap, which may call this helper again in the
+		 * writeback path.
+		 *
+		 * Large order allocation also needs special handling on
+		 * race: if a smaller folio exists in cache, swapin needs
+		 * to fallback to order 0, and doing a swap cache lookup
+		 * might return a folio that is irrelevant to the faulting
+		 * entry because @entry is aligned down. Just return NULL.
+		 */
+		if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
+			return NULL;
+
+		/*
+		 * Check the swap cache again, we can only arrive
+		 * here because swapcache_prepare returns -EEXIST.
+		 */
+		swapcache = swap_cache_get_folio(entry);
+		if (swapcache)
+			return swapcache;
+
+		/*
+		 * We might race against __swap_cache_del_folio(), and
+		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
+		 * has not yet been cleared.  Or race against another
+		 * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
+		 * in swap_map, but not yet added its folio to swap cache.
+		 */
+		schedule_timeout_uninterruptible(1);
+	}
+
+	__folio_set_locked(folio);
+	__folio_set_swapbacked(folio);
+
+	if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
+		put_swap_folio(folio, entry);
+		folio_unlock(folio);
+		return NULL;
+	}
+
+	swap_cache_add_folio(folio, entry, &shadow);
+	memcg1_swapin(entry, folio_nr_pages(folio));
+	if (shadow)
+		workingset_refault(folio, shadow);
+
+	/* Caller will initiate read into locked folio */
+	folio_add_lru(folio);
+	return folio;
+}
+
 /**
  * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
  * @entry: the swapped out swap entry to be binded to the folio.
@@ -427,99 +518,29 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 {
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct folio *folio;
-	struct folio *new_folio = NULL;
 	struct folio *result = NULL;
-	void *shadow = NULL;
 
 	*new_page_allocated = false;
-	for (;;) {
-		int err;
-
-		/*
-		 * Check the swap cache first, if a cached folio is found,
-		 * return it unlocked. The caller will lock and check it.
-		 */
-		folio = swap_cache_get_folio(entry);
-		if (folio)
-			goto got_folio;
-
-		/*
-		 * Just skip read ahead for unused swap slot.
-		 */
-		if (!swap_entry_swapped(si, entry))
-			goto put_and_return;
-
-		/*
-		 * Get a new folio to read into from swap.  Allocate it now if
-		 * new_folio not exist, before marking swap_map SWAP_HAS_CACHE,
-		 * when -EEXIST will cause any racers to loop around until we
-		 * add it to cache.
-		 */
-		if (!new_folio) {
-			new_folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
-			if (!new_folio)
-				goto put_and_return;
-		}
-
-		/*
-		 * Swap entry may have been freed since our caller observed it.
-		 */
-		err = swapcache_prepare(entry, 1);
-		if (!err)
-			break;
-		else if (err != -EEXIST)
-			goto put_and_return;
-
-		/*
-		 * Protect against a recursive call to swap_cache_alloc_folio()
-		 * on the same entry waiting forever here because SWAP_HAS_CACHE
-		 * is set but the folio is not the swap cache yet. This can
-		 * happen today if mem_cgroup_swapin_charge_folio() below
-		 * triggers reclaim through zswap, which may call
-		 * swap_cache_alloc_folio() in the writeback path.
-		 */
-		if (skip_if_exists)
-			goto put_and_return;
+	/* Check the swap cache again for readahead path. */
+	folio = swap_cache_get_folio(entry);
+	if (folio)
+		return folio;
 
-		/*
-		 * We might race against __swap_cache_del_folio(), and
-		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
-		 * has not yet been cleared.  Or race against another
-		 * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
-		 * in swap_map, but not yet added its folio to swap cache.
-		 */
-		schedule_timeout_uninterruptible(1);
-	}
-
-	/*
-	 * The swap entry is ours to swap in. Prepare the new folio.
-	 */
-	__folio_set_locked(new_folio);
-	__folio_set_swapbacked(new_folio);
-
-	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
-		goto fail_unlock;
-
-	swap_cache_add_folio(new_folio, entry, &shadow);
-	memcg1_swapin(entry, 1);
+	/* Skip allocation for unused swap slot for readahead path. */
+	if (!swap_entry_swapped(si, entry))
+		return NULL;
 
-	if (shadow)
-		workingset_refault(new_folio, shadow);
-
-	/* Caller will initiate read into locked new_folio */
-	folio_add_lru(new_folio);
-	*new_page_allocated = true;
-	folio = new_folio;
-got_folio:
-	result = folio;
-	goto put_and_return;
-
-fail_unlock:
-	put_swap_folio(new_folio, entry);
-	folio_unlock(new_folio);
-put_and_return:
-	if (!(*new_page_allocated) && new_folio)
-		folio_put(new_folio);
+	/* Allocate a new folio to be added into the swap cache. */
+	folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+	if (!folio)
+		return NULL;
+	/* Try add the new folio, returns existing folio or NULL on failure. */
+	result = __swap_cache_prepare_and_add(entry, folio, gfp_mask,
+					      false, skip_if_exists);
+	if (result == folio)
+		*new_page_allocated = true;
+	else
+		folio_put(folio);
 	return result;
 }
 

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
  2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
  2025-10-29 15:58 ` [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-11-04  3:47   ` Barry Song
  2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
                   ` (17 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now the overhead of the swap cache is trivial, bypassing the swap
cache is no longer a valid optimization. So unify the swapin path using
the swap cache. This changes the swap in behavior in multiple ways:

We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
the indicator to bypass both the swap cache and readahead. The swap
count check is not a good indicator for readahead. It existed because
the previously swap design made readahead strictly coupled with swap
cache bypassing. We actually want to always bypass readahead for
SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
swap cache will cause redundant IO.

Now that limitation is gone, with the new introduced helpers and design,
we will always swap cache, so this check can be simplified to check
SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.

The second thing here is that this enabled a large swap for all swap
entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
also coupled with swap cache bypassing, and so the count checking side
effect also makes large swap in less effective. Now this is also fixed.
We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
cases.

And to catch potential issues with large swap in, especially with page
exclusiveness and swap cache, more debug sanity checks and comments are
added. But overall, the code is simpler. And new helper and routines
will be used by other components in later commits too. And now it's
possible to rely on the swap cache layer for resolving synchronization
issues, which will also be done by a later commit.

Worth mentioning that for a large folio workload, this may cause more
serious thrashing. This isn't a problem with this commit, but a generic
large folio issue. For a 4K workload, this commit increases the
performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 136 +++++++++++++++++++++-----------------------------------
 mm/swap.h       |   6 +++
 mm/swap_state.c |  27 +++++++++++
 3 files changed, 84 insertions(+), 85 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4c3a7e09a159..9a43d4811781 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
-static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+/* Sanity check that a folio is fully exclusive */
+static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
+				 unsigned int nr_pages)
+{
+	do {
+		VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
+		entry.val++;
+	} while (--nr_pages);
+}
 
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
@@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
-	struct folio *swapcache, *folio = NULL;
-	DECLARE_WAITQUEUE(wait, current);
+	struct folio *swapcache = NULL, *folio;
 	struct page *page;
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
-	bool need_clear_cache = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
 	vm_fault_t ret = 0;
-	void *shadow = NULL;
 	int nr_pages;
 	unsigned long page_idx;
 	unsigned long address;
@@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	folio = swap_cache_get_folio(entry);
 	if (folio)
 		swap_update_readahead(folio, vma, vmf->address);
-	swapcache = folio;
-
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
-		    __swap_count(entry) == 1) {
-			/* skip swapcache */
+		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = alloc_swap_folio(vmf);
 			if (folio) {
-				__folio_set_locked(folio);
-				__folio_set_swapbacked(folio);
-
-				nr_pages = folio_nr_pages(folio);
-				if (folio_test_large(folio))
-					entry.val = ALIGN_DOWN(entry.val, nr_pages);
 				/*
-				 * Prevent parallel swapin from proceeding with
-				 * the cache flag. Otherwise, another thread
-				 * may finish swapin first, free the entry, and
-				 * swapout reusing the same entry. It's
-				 * undetectable as pte_same() returns true due
-				 * to entry reuse.
+				 * folio is charged, so swapin can only fail due
+				 * to raced swapin and return NULL.
 				 */
-				if (swapcache_prepare(entry, nr_pages)) {
-					/*
-					 * Relax a bit to prevent rapid
-					 * repeated page faults.
-					 */
-					add_wait_queue(&swapcache_wq, &wait);
-					schedule_timeout_uninterruptible(1);
-					remove_wait_queue(&swapcache_wq, &wait);
-					goto out_page;
-				}
-				need_clear_cache = true;
-
-				memcg1_swapin(entry, nr_pages);
-
-				shadow = swap_cache_get_shadow(entry);
-				if (shadow)
-					workingset_refault(folio, shadow);
-
-				folio_add_lru(folio);
-
-				/* To provide entry to swap_read_folio() */
-				folio->swap = entry;
-				swap_read_folio(folio, NULL);
-				folio->private = NULL;
+				swapcache = swapin_folio(entry, folio);
+				if (swapcache != folio)
+					folio_put(folio);
+				folio = swapcache;
 			}
 		} else {
-			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-						vmf);
-			swapcache = folio;
+			folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
 		}
 
 		if (!folio) {
@@ -4779,6 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 	}
 
+	swapcache = folio;
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
@@ -4848,24 +4818,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_nomap;
 	}
 
-	/* allocated large folios for SWP_SYNCHRONOUS_IO */
-	if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
-		unsigned long nr = folio_nr_pages(folio);
-		unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
-		unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
-		pte_t *folio_ptep = vmf->pte - idx;
-		pte_t folio_pte = ptep_get(folio_ptep);
-
-		if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
-		    swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
-			goto out_nomap;
-
-		page_idx = idx;
-		address = folio_start;
-		ptep = folio_ptep;
-		goto check_folio;
-	}
-
 	nr_pages = 1;
 	page_idx = 0;
 	address = vmf->address;
@@ -4909,12 +4861,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
 	BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
 
+	/*
+	 * If a large folio already belongs to anon mapping, then we
+	 * can just go on and map it partially.
+	 * If not, with the large swapin check above failing, the page table
+	 * have changed, so sub pages might got charged to the wrong cgroup,
+	 * or even should be shmem. So we have to free it and fallback.
+	 * Nothing should have touched it, both anon and shmem checks if a
+	 * large folio is fully appliable before use.
+	 *
+	 * This will be removed once we unify folio allocation in the swap cache
+	 * layer, where allocation of a folio stabilizes the swap entries.
+	 */
+	if (!folio_test_anon(folio) && folio_test_large(folio) &&
+	    nr_pages != folio_nr_pages(folio)) {
+		if (!WARN_ON_ONCE(folio_test_dirty(folio)))
+			swap_cache_del_folio(folio);
+		goto out_nomap;
+	}
+
 	/*
 	 * Check under PT lock (to protect against concurrent fork() sharing
 	 * the swap entry concurrently) for certainly exclusive pages.
 	 */
 	if (!folio_test_ksm(folio)) {
+		/*
+		 * The can_swapin_thp check above ensures all PTE have
+		 * same exclusivenss, only check one PTE is fine.
+		 */
 		exclusive = pte_swp_exclusive(vmf->orig_pte);
+		if (exclusive)
+			check_swap_exclusive(folio, entry, nr_pages);
 		if (folio != swapcache) {
 			/*
 			 * We have a fresh page that is not exposed to the
@@ -4992,18 +4969,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	vmf->orig_pte = pte_advance_pfn(pte, page_idx);
 
 	/* ksm created a completely new copy */
-	if (unlikely(folio != swapcache && swapcache)) {
+	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
 	} else if (!folio_test_anon(folio)) {
 		/*
-		 * We currently only expect small !anon folios which are either
-		 * fully exclusive or fully shared, or new allocated large
-		 * folios which are fully exclusive. If we ever get large
-		 * folios within swapcache here, we have to be careful.
+		 * We currently only expect !anon folios that are fully
+		 * mappable. See the comment after can_swapin_thp above.
 		 */
-		VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
-		VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
+		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
 	} else {
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
@@ -5043,12 +5018,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out:
-	/* Clear the swap cache pin for direct swapin after PTL unlock */
-	if (need_clear_cache) {
-		swapcache_clear(si, entry, nr_pages);
-		if (waitqueue_active(&swapcache_wq))
-			wake_up(&swapcache_wq);
-	}
 	if (si)
 		put_swap_device(si);
 	return ret;
@@ -5056,6 +5025,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (vmf->pte)
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 out_page:
+	if (folio_test_swapcache(folio))
+		folio_free_swap(folio);
 	folio_unlock(folio);
 out_release:
 	folio_put(folio);
@@ -5063,11 +5034,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}
-	if (need_clear_cache) {
-		swapcache_clear(si, entry, nr_pages);
-		if (waitqueue_active(&swapcache_wq))
-			wake_up(&swapcache_wq);
-	}
 	if (si)
 		put_swap_device(si);
 	return ret;
diff --git a/mm/swap.h b/mm/swap.h
index 0fff92e42cfe..214e7d041030 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -268,6 +268,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
+struct folio *swapin_folio(swp_entry_t entry, struct folio *folio);
 void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr);
 
@@ -386,6 +387,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
+{
+	return NULL;
+}
+
 static inline void swap_update_readahead(struct folio *folio,
 		struct vm_area_struct *vma, unsigned long addr)
 {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d18ca765c04f..b3737c60aad9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -544,6 +544,33 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 	return result;
 }
 
+/**
+ * swapin_folio - swap-in one or multiple entries skipping readahead.
+ * @entry: starting swap entry to swap in
+ * @folio: a new allocated and charged folio
+ *
+ * Reads @entry into @folio, @folio will be added to the swap cache.
+ * If @folio is a large folio, the @entry will be rounded down to align
+ * with the folio size.
+ *
+ * Return: returns pointer to @folio on success. If folio is a large folio
+ * and this raced with another swapin, NULL will be returned. Else, if
+ * another folio was already added to the swap cache, return that swap
+ * cache folio instead.
+ */
+struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
+{
+	struct folio *swapcache;
+	pgoff_t offset = swp_offset(entry);
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
+	swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true, false);
+	if (swapcache == folio)
+		swap_read_folio(folio, NULL);
+	return swapcache;
+}
+
 /*
  * Locate a page of swap in physical memory, reserving swap cache space
  * and reading the disk if it is not already cached.

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
  2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-11-04  3:47   ` Barry Song
  2025-11-04 10:44     ` Kairui Song
  0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04  3:47 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now the overhead of the swap cache is trivial, bypassing the swap
> cache is no longer a valid optimization. So unify the swapin path using
> the swap cache. This changes the swap in behavior in multiple ways:
>
> We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
> the indicator to bypass both the swap cache and readahead. The swap
> count check is not a good indicator for readahead. It existed because
> the previously swap design made readahead strictly coupled with swap
> cache bypassing. We actually want to always bypass readahead for
> SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
> swap cache will cause redundant IO.

I suppose it’s not only redundant I/O, but also causes additional memory
copies, as each swap-in allocates a new folio. Using swapcache allows the
folio to be shared instead?

>
> Now that limitation is gone, with the new introduced helpers and design,
> we will always swap cache, so this check can be simplified to check
> SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
> SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
>
> The second thing here is that this enabled a large swap for all swap
> entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
> also coupled with swap cache bypassing, and so the count checking side
> effect also makes large swap in less effective. Now this is also fixed.
> We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
> cases.
>

In your cover letter, you mentioned: “it’s especially better for workloads
with swap count > 1 on SYNC_IO devices, about ~20% gain in the above test.”
Is this improvement mainly from mTHP swap-in?


> And to catch potential issues with large swap in, especially with page
> exclusiveness and swap cache, more debug sanity checks and comments are
> added. But overall, the code is simpler. And new helper and routines
> will be used by other components in later commits too. And now it's
> possible to rely on the swap cache layer for resolving synchronization
> issues, which will also be done by a later commit.
>
> Worth mentioning that for a large folio workload, this may cause more
> serious thrashing. This isn't a problem with this commit, but a generic
> large folio issue. For a 4K workload, this commit increases the
> performance.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 136 +++++++++++++++++++++-----------------------------------
>  mm/swap.h       |   6 +++
>  mm/swap_state.c |  27 +++++++++++
>  3 files changed, 84 insertions(+), 85 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..9a43d4811781 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> +/* Sanity check that a folio is fully exclusive */
> +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> +                                unsigned int nr_pages)
> +{
> +       do {
> +               VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
> +               entry.val++;
> +       } while (--nr_pages);
> +}
>
>  /*
>   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
>  vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
>         struct vm_area_struct *vma = vmf->vma;
> -       struct folio *swapcache, *folio = NULL;
> -       DECLARE_WAITQUEUE(wait, current);
> +       struct folio *swapcache = NULL, *folio;
>         struct page *page;
>         struct swap_info_struct *si = NULL;
>         rmap_t rmap_flags = RMAP_NONE;
> -       bool need_clear_cache = false;
>         bool exclusive = false;
>         swp_entry_t entry;
>         pte_t pte;
>         vm_fault_t ret = 0;
> -       void *shadow = NULL;
>         int nr_pages;
>         unsigned long page_idx;
>         unsigned long address;
> @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         folio = swap_cache_get_folio(entry);
>         if (folio)
>                 swap_update_readahead(folio, vma, vmf->address);
> -       swapcache = folio;
> -

I wonder if we should move swap_update_readahead() elsewhere. Since for
sync IO you’ve completely dropped readahead, why do we still need to call
update_readahead()?

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
  2025-11-04  3:47   ` Barry Song
@ 2025-11-04 10:44     ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-04 10:44 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Tue, Nov 4, 2025 at 11:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now the overhead of the swap cache is trivial, bypassing the swap
> > cache is no longer a valid optimization. So unify the swapin path using
> > the swap cache. This changes the swap in behavior in multiple ways:
> >
> > We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
> > the indicator to bypass both the swap cache and readahead. The swap
> > count check is not a good indicator for readahead. It existed because
> > the previously swap design made readahead strictly coupled with swap
> > cache bypassing. We actually want to always bypass readahead for
> > SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
> > swap cache will cause redundant IO.
>
> I suppose it’s not only redundant I/O, but also causes additional memory
> copies, as each swap-in allocates a new folio. Using swapcache allows the
> folio to be shared instead?

Thanks for the review!

Right, one thing I forgot to mention is after this change, workloads
involving mTHP swapin are less likely to OOM, that's related.

>
> >
> > Now that limitation is gone, with the new introduced helpers and design,
> > we will always swap cache, so this check can be simplified to check
> > SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
> > SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
> >
> > The second thing here is that this enabled a large swap for all swap
> > entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
> > also coupled with swap cache bypassing, and so the count checking side
> > effect also makes large swap in less effective. Now this is also fixed.
> > We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
> > cases.
> >
>
> In your cover letter, you mentioned: “it’s especially better for workloads
> with swap count > 1 on SYNC_IO devices, about ~20% gain in the above test.”
> Is this improvement mainly from mTHP swap-in?

Mainly from bypassing readahead I think. mTHP swap-in might also help though.

> > And to catch potential issues with large swap in, especially with page
> > exclusiveness and swap cache, more debug sanity checks and comments are
> > added. But overall, the code is simpler. And new helper and routines
> > will be used by other components in later commits too. And now it's
> > possible to rely on the swap cache layer for resolving synchronization
> > issues, which will also be done by a later commit.
> >
> > Worth mentioning that for a large folio workload, this may cause more
> > serious thrashing. This isn't a problem with this commit, but a generic
> > large folio issue. For a 4K workload, this commit increases the
> > performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 136 +++++++++++++++++++++-----------------------------------
> >  mm/swap.h       |   6 +++
> >  mm/swap_state.c |  27 +++++++++++
> >  3 files changed, 84 insertions(+), 85 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4c3a7e09a159..9a43d4811781 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> >  }
> >  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > +/* Sanity check that a folio is fully exclusive */
> > +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> > +                                unsigned int nr_pages)
> > +{
> > +       do {
> > +               VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
> > +               entry.val++;
> > +       } while (--nr_pages);
> > +}
> >
> >  /*
> >   * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> >  vm_fault_t do_swap_page(struct vm_fault *vmf)
> >  {
> >         struct vm_area_struct *vma = vmf->vma;
> > -       struct folio *swapcache, *folio = NULL;
> > -       DECLARE_WAITQUEUE(wait, current);
> > +       struct folio *swapcache = NULL, *folio;
> >         struct page *page;
> >         struct swap_info_struct *si = NULL;
> >         rmap_t rmap_flags = RMAP_NONE;
> > -       bool need_clear_cache = false;
> >         bool exclusive = false;
> >         swp_entry_t entry;
> >         pte_t pte;
> >         vm_fault_t ret = 0;
> > -       void *shadow = NULL;
> >         int nr_pages;
> >         unsigned long page_idx;
> >         unsigned long address;
> > @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         folio = swap_cache_get_folio(entry);
> >         if (folio)
> >                 swap_update_readahead(folio, vma, vmf->address);
> > -       swapcache = folio;
> > -
>
> I wonder if we should move swap_update_readahead() elsewhere. Since for
> sync IO you’ve completely dropped readahead, why do we still need to call
> update_readahead()?

That's a very good suggestion, the overhead will be smaller too.

I'm not sure if the code will be messy if we move this right now, let
me try, or maybe this optimization can be done later.

I do plan to defer swap cache lookup inside swapin_reahahead /
swapin_folio. We can do that now because swapin_folio requires the
caller to alloc a folio for THP swapin, so doing swap cache lookup
early helps to reduce memory overhead.

Once we unify swapin folio allocation for shmem / anon and always do
folio allocation with swap_cache_alloc_folio, everything will be
arranged in a nice way I think.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (2 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-11-04  4:19   ` Barry Song
  2025-10-29 15:58 ` [PATCH 05/19] mm, swap: simplify the code and reduce indention Kairui Song
                   ` (16 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
effect is that a folio may stay in swap cache for a longer time due to
lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
are being swapped out very frequently right after swapin, hence improving
the performance. But the long pinning of swap slots also increases the
fragmentation rate of the swap device significantly, and currently,
all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
causes the backing memory to be pinned, increasing the memory pressure.

So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
after swapin finishes. Swap cache has served its role as a
synchronization layer to prevent any parallel swapin from wasting
CPU or memory allocation, and the redundant IO is not a major concern
for SWP_SYNCHRONOUS_IO devices.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 9a43d4811781..78457347ae60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 	return 0;
 }
 
-static inline bool should_try_to_free_swap(struct folio *folio,
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+					   struct folio *folio,
 					   struct vm_area_struct *vma,
 					   unsigned int fault_flags)
 {
 	if (!folio_test_swapcache(folio))
 		return false;
+	/*
+	 * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
+	 * Redundant IO is unlikely to be an issue for them, but a
+	 * slot being pinned by swap cache may cause more fragmentation
+	 * and delayed freeing of swap metadata.
+	 */
+	if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+		return true;
 	if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
 	    folio_test_mlocked(folio))
 		return true;
@@ -4935,7 +4944,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * yet.
 	 */
 	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(folio, vma, vmf->flags))
+	if (should_try_to_free_swap(si, folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
  2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
@ 2025-11-04  4:19   ` Barry Song
  2025-11-04  8:26     ` Barry Song
  0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04  4:19 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> effect is that a folio may stay in swap cache for a longer time due to
> lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> are being swapped out very frequently right after swapin, hence improving
> the performance. But the long pinning of swap slots also increases the
> fragmentation rate of the swap device significantly, and currently,
> all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> causes the backing memory to be pinned, increasing the memory pressure.
>
> So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> after swapin finishes. Swap cache has served its role as a
> synchronization layer to prevent any parallel swapin from wasting
> CPU or memory allocation, and the redundant IO is not a major concern
> for SWP_SYNCHRONOUS_IO devices.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 13 +++++++++++--
>  1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9a43d4811781..78457347ae60 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>         return 0;
>  }
>
> -static inline bool should_try_to_free_swap(struct folio *folio,
> +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> +                                          struct folio *folio,
>                                            struct vm_area_struct *vma,
>                                            unsigned int fault_flags)
>  {
>         if (!folio_test_swapcache(folio))
>                 return false;
> +       /*
> +        * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> +        * Redundant IO is unlikely to be an issue for them, but a
> +        * slot being pinned by swap cache may cause more fragmentation
> +        * and delayed freeing of swap metadata.
> +        */

I don’t like the claim about “redundant I/O” — it sounds misleading. Those
I/Os are not redundant; they are simply saved by swapcache, which prevents
some swap-out I/O when a recently swap-in folio is swapped out again.

So, could we make it a bit more specific in both the comment and the commit
message?

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
  2025-11-04  4:19   ` Barry Song
@ 2025-11-04  8:26     ` Barry Song
  2025-11-04 10:55       ` Kairui Song
  0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04  8:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Tue, Nov 4, 2025 at 12:19 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> > effect is that a folio may stay in swap cache for a longer time due to
> > lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> > are being swapped out very frequently right after swapin, hence improving
> > the performance. But the long pinning of swap slots also increases the
> > fragmentation rate of the swap device significantly, and currently,
> > all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> > causes the backing memory to be pinned, increasing the memory pressure.
> >
> > So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> > after swapin finishes. Swap cache has served its role as a
> > synchronization layer to prevent any parallel swapin from wasting
> > CPU or memory allocation, and the redundant IO is not a major concern
> > for SWP_SYNCHRONOUS_IO devices.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c | 13 +++++++++++--
> >  1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9a43d4811781..78457347ae60 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >         return 0;
> >  }
> >
> > -static inline bool should_try_to_free_swap(struct folio *folio,
> > +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > +                                          struct folio *folio,
> >                                            struct vm_area_struct *vma,
> >                                            unsigned int fault_flags)
> >  {
> >         if (!folio_test_swapcache(folio))
> >                 return false;
> > +       /*
> > +        * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> > +        * Redundant IO is unlikely to be an issue for them, but a
> > +        * slot being pinned by swap cache may cause more fragmentation
> > +        * and delayed freeing of swap metadata.
> > +        */
>
> I don’t like the claim about “redundant I/O” — it sounds misleading. Those
> I/Os are not redundant; they are simply saved by swapcache, which prevents
> some swap-out I/O when a recently swap-in folio is swapped out again.
>
> So, could we make it a bit more specific in both the comment and the commit
> message?

Sorry, on second thought—consider a case where process A mmaps 100 MB and writes
to it to populate memory, then forks process B. If that 100 MB gets swapped out,
and A and B later swap it in separately for reading, with this change it seems
they would each get their own 100 MB copy (total 2 × 100 MB), whereas previously
they could share the same 100 MB?

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
  2025-11-04  8:26     ` Barry Song
@ 2025-11-04 10:55       ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-04 10:55 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Tue, Nov 4, 2025 at 4:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 12:19 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> > > effect is that a folio may stay in swap cache for a longer time due to
> > > lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> > > are being swapped out very frequently right after swapin, hence improving
> > > the performance. But the long pinning of swap slots also increases the
> > > fragmentation rate of the swap device significantly, and currently,
> > > all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> > > causes the backing memory to be pinned, increasing the memory pressure.
> > >
> > > So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> > > after swapin finishes. Swap cache has served its role as a
> > > synchronization layer to prevent any parallel swapin from wasting
> > > CPU or memory allocation, and the redundant IO is not a major concern
> > > for SWP_SYNCHRONOUS_IO devices.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/memory.c | 13 +++++++++++--
> > >  1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 9a43d4811781..78457347ae60 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > >         return 0;
> > >  }
> > >
> > > -static inline bool should_try_to_free_swap(struct folio *folio,
> > > +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > +                                          struct folio *folio,
> > >                                            struct vm_area_struct *vma,
> > >                                            unsigned int fault_flags)
> > >  {
> > >         if (!folio_test_swapcache(folio))
> > >                 return false;
> > > +       /*
> > > +        * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> > > +        * Redundant IO is unlikely to be an issue for them, but a
> > > +        * slot being pinned by swap cache may cause more fragmentation
> > > +        * and delayed freeing of swap metadata.
> > > +        */
> >
> > I don’t like the claim about “redundant I/O” — it sounds misleading. Those
> > I/Os are not redundant; they are simply saved by swapcache, which prevents
> > some swap-out I/O when a recently swap-in folio is swapped out again.
> >
> > So, could we make it a bit more specific in both the comment and the commit
> > message?
>
> Sorry, on second thought—consider a case where process A mmaps 100 MB and writes
> to it to populate memory, then forks process B. If that 100 MB gets swapped out,
> and A and B later swap it in separately for reading, with this change it seems
> they would each get their own 100 MB copy (total 2 × 100 MB), whereas previously
> they could share the same 100 MB?

It's a bit tricky here, folio_free_swap only frees the swap cache if a
folio's swap count is 0, so if A swapin these folios first, the swap
cache won't be freed until B also mapped these folios and reduced the
swap count.

And this function is called should_try_to_free_swap: it's only trying
to free the swap cache if swap count == 0. I think I can add some
comments on that.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 05/19] mm, swap: simplify the code and reduce indention
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (3 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
                   ` (15 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swap cache is always used, multiple swap cache checks are no longer
useful, remove them and reduce the code indention.

No behavior change.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 89 +++++++++++++++++++++++++++++--------------------------------
 1 file changed, 43 insertions(+), 46 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 78457347ae60..6c5cd86c4a66 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4763,55 +4763,52 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out_release;
 
 	page = folio_file_page(folio, swp_offset(entry));
-	if (swapcache) {
-		/*
-		 * Make sure folio_free_swap() or swapoff did not release the
-		 * swapcache from under us.  The page pin, and pte_same test
-		 * below, are not enough to exclude that.  Even if it is still
-		 * swapcache, we need to check that the page's swap has not
-		 * changed.
-		 */
-		if (unlikely(!folio_matches_swap_entry(folio, entry)))
-			goto out_page;
-
-		if (unlikely(PageHWPoison(page))) {
-			/*
-			 * hwpoisoned dirty swapcache pages are kept for killing
-			 * owner processes (which may be unknown at hwpoison time)
-			 */
-			ret = VM_FAULT_HWPOISON;
-			goto out_page;
-		}
-
-		/*
-		 * KSM sometimes has to copy on read faults, for example, if
-		 * folio->index of non-ksm folios would be nonlinear inside the
-		 * anon VMA -- the ksm flag is lost on actual swapout.
-		 */
-		folio = ksm_might_need_to_copy(folio, vma, vmf->address);
-		if (unlikely(!folio)) {
-			ret = VM_FAULT_OOM;
-			folio = swapcache;
-			goto out_page;
-		} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
-			ret = VM_FAULT_HWPOISON;
-			folio = swapcache;
-			goto out_page;
-		}
-		if (folio != swapcache)
-			page = folio_page(folio, 0);
+	/*
+	 * Make sure folio_free_swap() or swapoff did not release the
+	 * swapcache from under us.  The page pin, and pte_same test
+	 * below, are not enough to exclude that.  Even if it is still
+	 * swapcache, we need to check that the page's swap has not
+	 * changed.
+	 */
+	if (unlikely(!folio_matches_swap_entry(folio, entry)))
+		goto out_page;
 
+	if (unlikely(PageHWPoison(page))) {
 		/*
-		 * If we want to map a page that's in the swapcache writable, we
-		 * have to detect via the refcount if we're really the exclusive
-		 * owner. Try removing the extra reference from the local LRU
-		 * caches if required.
+		 * hwpoisoned dirty swapcache pages are kept for killing
+		 * owner processes (which may be unknown at hwpoison time)
 		 */
-		if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
-		    !folio_test_ksm(folio) && !folio_test_lru(folio))
-			lru_add_drain();
+		ret = VM_FAULT_HWPOISON;
+		goto out_page;
 	}
 
+	/*
+	 * KSM sometimes has to copy on read faults, for example, if
+	 * folio->index of non-ksm folios would be nonlinear inside the
+	 * anon VMA -- the ksm flag is lost on actual swapout.
+	 */
+	folio = ksm_might_need_to_copy(folio, vma, vmf->address);
+	if (unlikely(!folio)) {
+		ret = VM_FAULT_OOM;
+		folio = swapcache;
+		goto out_page;
+	} else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
+		ret = VM_FAULT_HWPOISON;
+		folio = swapcache;
+		goto out_page;
+	} else if (folio != swapcache)
+		page = folio_page(folio, 0);
+
+	/*
+	 * If we want to map a page that's in the swapcache writable, we
+	 * have to detect via the refcount if we're really the exclusive
+	 * owner. Try removing the extra reference from the local LRU
+	 * caches if required.
+	 */
+	if ((vmf->flags & FAULT_FLAG_WRITE) &&
+	    !folio_test_ksm(folio) && !folio_test_lru(folio))
+		lru_add_drain();
+
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
@@ -5001,7 +4998,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			pte, pte, nr_pages);
 
 	folio_unlock(folio);
-	if (folio != swapcache && swapcache) {
+	if (unlikely(folio != swapcache)) {
 		/*
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
@@ -5039,7 +5036,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	folio_unlock(folio);
 out_release:
 	folio_put(folio);
-	if (folio != swapcache && swapcache) {
+	if (folio != swapcache) {
 		folio_unlock(swapcache);
 		folio_put(swapcache);
 	}

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (4 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 05/19] mm, swap: simplify the code and reduce indention Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-11-04  9:14   ` Barry Song
  2025-10-29 15:58 ` [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
                   ` (14 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

To prevent repeated faults of parallel swapin of the same PTE, remove
the folio from the swap cache after the folio is mapped. So any user
faulting from the swap PTE should see the folio in the swap cache and
wait on it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6c5cd86c4a66..589d6fc3d424 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 static inline bool should_try_to_free_swap(struct swap_info_struct *si,
 					   struct folio *folio,
 					   struct vm_area_struct *vma,
+					   unsigned int extra_refs,
 					   unsigned int fault_flags)
 {
 	if (!folio_test_swapcache(folio))
@@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
 	 * reference only in case it's likely that we'll be the exclusive user.
 	 */
 	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
+		folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
@@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
-	/*
-	 * Remove the swap entry and conditionally try to free up the swapcache.
-	 * We're already holding a reference on the page but haven't mapped it
-	 * yet.
-	 */
-	swap_free_nr(entry, nr_pages);
-	if (should_try_to_free_swap(si, folio, vma, vmf->flags))
-		folio_free_swap(folio);
-
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
 	pte = mk_pte(page, vma->vm_page_prot);
@@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	arch_do_swap_page_nr(vma->vm_mm, vma, address,
 			pte, pte, nr_pages);
 
+	/*
+	 * Remove the swap entry and conditionally try to free up the
+	 * swapcache. Do it after mapping so any raced page fault will
+	 * see the folio in swap cache and wait for us.
+	 */
+	swap_free_nr(entry, nr_pages);
+	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
+		folio_free_swap(folio);
+
 	folio_unlock(folio);
 	if (unlikely(folio != swapcache)) {
 		/*

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
  2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
@ 2025-11-04  9:14   ` Barry Song
  2025-11-04 10:50     ` Kairui Song
  0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04  9:14 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> To prevent repeated faults of parallel swapin of the same PTE, remove
> the folio from the swap cache after the folio is mapped. So any user
> faulting from the swap PTE should see the folio in the swap cache and
> wait on it.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c | 21 +++++++++++----------
>  1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6c5cd86c4a66..589d6fc3d424 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
>  static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>                                            struct folio *folio,
>                                            struct vm_area_struct *vma,
> +                                          unsigned int extra_refs,
>                                            unsigned int fault_flags)
>  {
>         if (!folio_test_swapcache(folio))
> @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
>          * reference only in case it's likely that we'll be the exclusive user.
>          */
>         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> -               folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> +               folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
>  }
>
>  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          */
>         arch_swap_restore(folio_swap(entry, folio), folio);
>
> -       /*
> -        * Remove the swap entry and conditionally try to free up the swapcache.
> -        * We're already holding a reference on the page but haven't mapped it
> -        * yet.
> -        */
> -       swap_free_nr(entry, nr_pages);
> -       if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> -               folio_free_swap(folio);
> -
>         add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
>         add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
>         pte = mk_pte(page, vma->vm_page_prot);
> @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         arch_do_swap_page_nr(vma->vm_mm, vma, address,
>                         pte, pte, nr_pages);
>
> +       /*
> +        * Remove the swap entry and conditionally try to free up the
> +        * swapcache. Do it after mapping so any raced page fault will
> +        * see the folio in swap cache and wait for us.

This seems like the right optimization—it reduces the race window where we might
allocate a folio, perform the read, and then attempt to map it, only
to find after
taking the PTL that the PTE has already changed.

Although I am not entirely sure that “any raced page fault will see the folio in
swapcache,” it seems there could still be cases where a fault occurs after
folio_free_swap(), and thus can’t see the swapcache entry.

T1:
swap in PF, allocate and add swapcache, map PTE, delete swapcache

T2:
swap in PF before PTE is changed;
...........................................................;
check swapcache after T1 deletes swapcache -> no swapcache found.


> +        */
> +       swap_free_nr(entry, nr_pages);
> +       if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> +               folio_free_swap(folio);
> +
>         folio_unlock(folio);
>         if (unlikely(folio != swapcache)) {
>                 /*
>
> --
> 2.51.1
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
  2025-11-04  9:14   ` Barry Song
@ 2025-11-04 10:50     ` Kairui Song
  2025-11-04 19:52       ` Barry Song
  0 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-11-04 10:50 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Tue, Nov 4, 2025 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > To prevent repeated faults of parallel swapin of the same PTE, remove
> > the folio from the swap cache after the folio is mapped. So any user
> > faulting from the swap PTE should see the folio in the swap cache and
> > wait on it.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c | 21 +++++++++++----------
> >  1 file changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6c5cd86c4a66..589d6fc3d424 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> >  static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> >                                            struct folio *folio,
> >                                            struct vm_area_struct *vma,
> > +                                          unsigned int extra_refs,
> >                                            unsigned int fault_flags)
> >  {
> >         if (!folio_test_swapcache(folio))
> > @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> >          * reference only in case it's likely that we'll be the exclusive user.
> >          */
> >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > -               folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > +               folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
> >  }
> >
> >  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >          */
> >         arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > -       /*
> > -        * Remove the swap entry and conditionally try to free up the swapcache.
> > -        * We're already holding a reference on the page but haven't mapped it
> > -        * yet.
> > -        */
> > -       swap_free_nr(entry, nr_pages);
> > -       if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> > -               folio_free_swap(folio);
> > -
> >         add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> >         add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> >         pte = mk_pte(page, vma->vm_page_prot);
> > @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         arch_do_swap_page_nr(vma->vm_mm, vma, address,
> >                         pte, pte, nr_pages);
> >
> > +       /*
> > +        * Remove the swap entry and conditionally try to free up the
> > +        * swapcache. Do it after mapping so any raced page fault will
> > +        * see the folio in swap cache and wait for us.
>
> This seems like the right optimization—it reduces the race window where we might
> allocate a folio, perform the read, and then attempt to map it, only
> to find after
> taking the PTL that the PTE has already changed.
>
> Although I am not entirely sure that “any raced page fault will see the folio in
> swapcache,” it seems there could still be cases where a fault occurs after
> folio_free_swap(), and thus can’t see the swapcache entry.
>
> T1:
> swap in PF, allocate and add swapcache, map PTE, delete swapcache
>
> T2:
> swap in PF before PTE is changed;
> ...........................................................;
> check swapcache after T1 deletes swapcache -> no swapcache found.

Right, that's true. But we will at most only have one repeated fault,
and the time window is much smaller. T2 will PTE != orig_pte and then
return just fine.

So this patch is only reducing the race time window for a potentially
better performance, and this race is basically harmless anyway. I
think it's good enough.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
  2025-11-04 10:50     ` Kairui Song
@ 2025-11-04 19:52       ` Barry Song
  0 siblings, 0 replies; 50+ messages in thread
From: Barry Song @ 2025-11-04 19:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Tue, Nov 4, 2025 at 6:51 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > To prevent repeated faults of parallel swapin of the same PTE, remove
> > > the folio from the swap cache after the folio is mapped. So any user
> > > faulting from the swap PTE should see the folio in the swap cache and
> > > wait on it.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/memory.c | 21 +++++++++++----------
> > >  1 file changed, 11 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6c5cd86c4a66..589d6fc3d424 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > >  static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > >                                            struct folio *folio,
> > >                                            struct vm_area_struct *vma,
> > > +                                          unsigned int extra_refs,
> > >                                            unsigned int fault_flags)
> > >  {
> > >         if (!folio_test_swapcache(folio))
> > > @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > >          * reference only in case it's likely that we'll be the exclusive user.
> > >          */
> > >         return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > > -               folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > > +               folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
> > >  }
> > >
> > >  static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > > @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >          */
> > >         arch_swap_restore(folio_swap(entry, folio), folio);
> > >
> > > -       /*
> > > -        * Remove the swap entry and conditionally try to free up the swapcache.
> > > -        * We're already holding a reference on the page but haven't mapped it
> > > -        * yet.
> > > -        */
> > > -       swap_free_nr(entry, nr_pages);
> > > -       if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> > > -               folio_free_swap(folio);
> > > -
> > >         add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > >         add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > >         pte = mk_pte(page, vma->vm_page_prot);
> > > @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >         arch_do_swap_page_nr(vma->vm_mm, vma, address,
> > >                         pte, pte, nr_pages);
> > >
> > > +       /*
> > > +        * Remove the swap entry and conditionally try to free up the
> > > +        * swapcache. Do it after mapping so any raced page fault will
> > > +        * see the folio in swap cache and wait for us.
> >
> > This seems like the right optimization—it reduces the race window where we might
> > allocate a folio, perform the read, and then attempt to map it, only
> > to find after
> > taking the PTL that the PTE has already changed.
> >
> > Although I am not entirely sure that “any raced page fault will see the folio in
> > swapcache,” it seems there could still be cases where a fault occurs after
> > folio_free_swap(), and thus can’t see the swapcache entry.
> >
> > T1:
> > swap in PF, allocate and add swapcache, map PTE, delete swapcache
> >
> > T2:
> > swap in PF before PTE is changed;
> > ...........................................................;
> > check swapcache after T1 deletes swapcache -> no swapcache found.
>
> Right, that's true. But we will at most only have one repeated fault,
> and the time window is much smaller. T2 will PTE != orig_pte and then
> return just fine.
>
> So this patch is only reducing the race time window for a potentially
> better performance, and this race is basically harmless anyway. I
> think it's good enough.

Right. What I really disagree with is "Do it after mapping so any
raced page fault
will see the folio in swap cache and wait for". It sounds like it
guarantees no race
at all, so I’d rather we change it to something like "reduced race window".

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (5 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now the overhead of the swap cache is trivial to none, bypassing the
swap cache is no longer a valid optimization.

We have removed the cache bypass swapin for anon memory, now do the same
for shmem. Many helpers and functions can be dropped now.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c    | 65 +++++++++++++++++------------------------------------------
 mm/swap.h     |  4 ----
 mm/swapfile.c | 35 +++++++++-----------------------
 3 files changed, 27 insertions(+), 77 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 6580f3cd24bb..759981435953 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2012,10 +2012,9 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		swp_entry_t entry, int order, gfp_t gfp)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	struct folio *new, *swapcache;
 	int nr_pages = 1 << order;
-	struct folio *new;
 	gfp_t alloc_gfp;
-	void *shadow;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
@@ -2055,34 +2054,19 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 		goto fallback;
 	}
 
-	/*
-	 * Prevent parallel swapin from proceeding with the swap cache flag.
-	 *
-	 * Of course there is another possible concurrent scenario as well,
-	 * that is to say, the swap cache flag of a large folio has already
-	 * been set by swapcache_prepare(), while another thread may have
-	 * already split the large swap entry stored in the shmem mapping.
-	 * In this case, shmem_add_to_page_cache() will help identify the
-	 * concurrent swapin and return -EEXIST.
-	 */
-	if (swapcache_prepare(entry, nr_pages)) {
+	swapcache = swapin_folio(entry, new);
+	if (swapcache != new) {
 		folio_put(new);
-		new = ERR_PTR(-EEXIST);
-		/* Try smaller folio to avoid cache conflict */
-		goto fallback;
+		if (!swapcache) {
+			/*
+			 * The new folio is charged already, swapin can
+			 * only fail due to another raced swapin.
+			 */
+			new = ERR_PTR(-EEXIST);
+			goto fallback;
+		}
 	}
-
-	__folio_set_locked(new);
-	__folio_set_swapbacked(new);
-	new->swap = entry;
-
-	memcg1_swapin(entry, nr_pages);
-	shadow = swap_cache_get_shadow(entry);
-	if (shadow)
-		workingset_refault(new, shadow);
-	folio_add_lru(new);
-	swap_read_folio(new, NULL);
-	return new;
+	return swapcache;
 fallback:
 	/* Order 0 swapin failed, nothing to fallback to, abort */
 	if (!order)
@@ -2172,8 +2156,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 }
 
 static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
-					 struct folio *folio, swp_entry_t swap,
-					 bool skip_swapcache)
+					 struct folio *folio, swp_entry_t swap)
 {
 	struct address_space *mapping = inode->i_mapping;
 	swp_entry_t swapin_error;
@@ -2189,8 +2172,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
-	if (!skip_swapcache)
-		swap_cache_del_folio(folio);
+	swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
@@ -2289,7 +2271,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	swp_entry_t swap, index_entry;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
-	bool skip_swapcache = false;
 	int error, nr_pages, order;
 	pgoff_t offset;
 
@@ -2332,7 +2313,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 				folio = NULL;
 				goto failed;
 			}
-			skip_swapcache = true;
 		} else {
 			/* Cached swapin only supports order 0 folio */
 			folio = shmem_swapin_cluster(swap, gfp, info, index);
@@ -2388,9 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	 * and swap cache folios are never partially freed.
 	 */
 	folio_lock(folio);
-	if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
-	    shmem_confirm_swap(mapping, index, swap) < 0 ||
-	    folio->swap.val != swap.val) {
+	if (!folio_matches_swap_entry(folio, swap) ||
+	    shmem_confirm_swap(mapping, index, swap) < 0) {
 		error = -EEXIST;
 		goto unlock;
 	}
@@ -2422,12 +2401,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
-	if (skip_swapcache) {
-		folio->swap.val = 0;
-		swapcache_clear(si, swap, nr_pages);
-	} else {
-		swap_cache_del_folio(folio);
-	}
+	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
@@ -2438,14 +2412,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (shmem_confirm_swap(mapping, index, swap) < 0)
 		error = -EEXIST;
 	if (error == -EIO)
-		shmem_set_folio_swapin_error(inode, index, folio, swap,
-					     skip_swapcache);
+		shmem_set_folio_swapin_error(inode, index, folio, swap);
 unlock:
 	if (folio)
 		folio_unlock(folio);
 failed_nolock:
-	if (skip_swapcache)
-		swapcache_clear(si, folio->swap, folio_nr_pages(folio));
 	if (folio)
 		folio_put(folio);
 	put_swap_device(si);
diff --git a/mm/swap.h b/mm/swap.h
index 214e7d041030..e0f05babe13a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -403,10 +403,6 @@ static inline int swap_writeout(struct folio *folio,
 	return 0;
 }
 
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
-}
-
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 849be32377d9..3898c3a2be62 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1613,22 +1613,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-static void swap_entries_put_cache(struct swap_info_struct *si,
-				   swp_entry_t entry, int nr)
-{
-	unsigned long offset = swp_offset(entry);
-	struct swap_cluster_info *ci;
-
-	ci = swap_cluster_lock(si, offset);
-	if (swap_only_has_cache(si, offset, nr)) {
-		swap_entries_free(si, ci, entry, nr);
-	} else {
-		for (int i = 0; i < nr; i++, entry.val++)
-			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
-	}
-	swap_cluster_unlock(ci);
-}
-
 static bool swap_entries_put_map(struct swap_info_struct *si,
 				 swp_entry_t entry, int nr)
 {
@@ -1764,13 +1748,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
 	int size = 1 << swap_entry_order(folio_order(folio));
 
 	si = _swap_info_get(entry);
 	if (!si)
 		return;
 
-	swap_entries_put_cache(si, entry, size);
+	ci = swap_cluster_lock(si, offset);
+	if (swap_only_has_cache(si, offset, size))
+		swap_entries_free(si, ci, entry, size);
+	else
+		for (int i = 0; i < size; i++, entry.val++)
+			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+	swap_cluster_unlock(ci);
 }
 
 int __swap_count(swp_entry_t entry)
@@ -3778,15 +3770,6 @@ int swapcache_prepare(swp_entry_t entry, int nr)
 	return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
 }
 
-/*
- * Caller should ensure entries belong to the same folio so
- * the entries won't span cross cluster boundary.
- */
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
-	swap_entries_put_cache(si, entry, nr);
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (6 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Nhat Pham <nphamcs@gmail.com>

The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a
("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
belongs to shmem during swapoff.

However, swapoff has since been rewritten in the commit b56a2d8af914
("mm: rid swapoff of quadratic complexity"). Now having swap count ==
SWAP_MAP_SHMEM value is basically the same as having swap count == 1,
and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only
difference of note is that swap_shmem_alloc() does not check for
-ENOMEM returned from __swap_duplicate(), but it is OK because shmem
never re-duplicates any swap entry it owns. This will stil be safe if we
use (batched) swap_duplicate() instead.

This commit adds swap_duplicate_nr(), the batched variant of
swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the
associated swap_shmem_alloc() helper to simplify the state machine (both
mentally and in terms of actual code). We will also have an extra
state/special value that can be repurposed (for swap entries that never
gets re-duplicated).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 15 +++++++--------
 mm/shmem.c           |  2 +-
 mm/swapfile.c        | 42 +++++++++++++++++-------------------------
 3 files changed, 25 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38ca3df68716..bf72b548a96d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -230,7 +230,6 @@ enum {
 /* Special value in first swap_map */
 #define SWAP_MAP_MAX	0x3e	/* Max count */
 #define SWAP_MAP_BAD	0x3f	/* Note page is bad */
-#define SWAP_MAP_SHMEM	0xbf	/* Owned by shmem/tmpfs */
 
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
@@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
+extern int swap_duplicate_nr(swp_entry_t entry, int nr);
 extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
-{
-}
-
-static inline int swap_duplicate(swp_entry_t swp)
+static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
 {
 	return 0;
 }
@@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 }
 #endif /* CONFIG_SWAP */
 
+static inline int swap_duplicate(swp_entry_t entry)
+{
+	return swap_duplicate_nr(entry, 1);
+}
+
 static inline void free_swap_and_cache(swp_entry_t entry)
 {
 	free_swap_and_cache_nr(entry, 1);
diff --git a/mm/shmem.c b/mm/shmem.c
index 759981435953..46d54a1288fd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1665,7 +1665,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 			spin_unlock(&shmem_swaplist_lock);
 		}
 
-		swap_shmem_alloc(folio->swap, nr_pages);
+		swap_duplicate_nr(folio->swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3898c3a2be62..55362bb2a781 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -201,7 +201,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 	unsigned char *map_end = map + nr_pages;
 	unsigned char count = *map;
 
-	if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
+	if (swap_count(count) != 1)
 		return false;
 
 	while (++map < map_end) {
@@ -1522,12 +1522,6 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
 	if (usage == SWAP_HAS_CACHE) {
 		VM_BUG_ON(!has_cache);
 		has_cache = 0;
-	} else if (count == SWAP_MAP_SHMEM) {
-		/*
-		 * Or we could insist on shmem.c using a special
-		 * swap_shmem_free() and free_shmem_swap_and_cache()...
-		 */
-		count = 0;
 	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
 		if (count == COUNT_CONTINUED) {
 			if (swap_count_continued(si, offset, count))
@@ -1625,7 +1619,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	if (nr <= 1)
 		goto fallback;
 	count = swap_count(data_race(si->swap_map[offset]));
-	if (count != 1 && count != SWAP_MAP_SHMEM)
+	if (count != 1)
 		goto fallback;
 
 	ci = swap_cluster_lock(si, offset);
@@ -1679,12 +1673,10 @@ static bool swap_entries_put_map_nr(struct swap_info_struct *si,
 
 /*
  * Check if it's the last ref of swap entry in the freeing path.
- * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM.
  */
 static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
 {
-	return (count == SWAP_HAS_CACHE) || (count == 1) ||
-	       (count == SWAP_MAP_SHMEM);
+	return (count == SWAP_HAS_CACHE) || (count == 1);
 }
 
 /*
@@ -3672,7 +3664,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	VM_WARN_ON(usage == 1 && nr > 1);
 	ci = swap_cluster_lock(si, offset);
 
 	err = 0;
@@ -3732,27 +3723,28 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	return err;
 }
 
-/*
- * Help swapoff by noting that swap entry belongs to shmem/tmpfs
- * (in which case its reference count is never incremented).
- */
-void swap_shmem_alloc(swp_entry_t entry, int nr)
-{
-	__swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
-}
-
-/*
- * Increase reference count of swap entry by 1.
+/**
+ * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
+ *                       by 1.
+ *
+ * @entry: first swap entry from which we want to increase the refcount.
+ * @nr: Number of entries in range.
+ *
  * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
  * but could not be atomically allocated.  Returns 0, just as if it succeeded,
  * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
  * might occur if a page table entry has got corrupted.
+ *
+ * Note that we are currently not handling the case where nr > 1 and we need to
+ * add swap count continuation. This is OK, because no such user exists - shmem
+ * is the only user that can pass nr > 1, and it never re-duplicates any swap
+ * entry it owns.
  */
-int swap_duplicate(swp_entry_t entry)
+int swap_duplicate_nr(swp_entry_t entry, int nr)
 {
 	int err = 0;
 
-	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
+	while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (7 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

When checking if a swap entry is swapped out, we simply check if the
bitwise result of the count value is larger than 0. But SWAP_MAP_BAD
will also be considered as a swao count value larger than 0.

SWAP_MAP_BAD being considered as a count value larger than 0 is useful
for the swap allocator: they will be seen as a used slot, so the
allocator will skip them. But for the swapped out check, this
isn't correct.

There is currently no observable issue. The swapped out check is only
useful for readahead and folio swapped-out status check. For readahead,
the swap cache layer will abort upon checking and updating the swap map.
For the folio swapped out status check, the swap allocator will never
allocate an entry of bad slots to folio, so that part is fine too. The
worst that could happen now is redundant allocation/freeing of folios
and waste CPU time.

This also makes it easier to get rid of swap map checking and update
during folio insertion in the swap cache layer.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  6 ++++--
 mm/swap_state.c      |  4 ++--
 mm/swapfile.c        | 22 +++++++++++-----------
 3 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index bf72b548a96d..936fa8f9e5f3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -466,7 +466,8 @@ int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
-extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
+extern bool swap_entry_swapped(struct swap_info_struct *si,
+			       unsigned long offset);
 extern int swp_swapcount(swp_entry_t entry);
 struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
@@ -535,7 +536,8 @@ static inline int __swap_count(swp_entry_t entry)
 	return 0;
 }
 
-static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
+static inline bool swap_entry_swapped(struct swap_info_struct *si,
+				      unsigned long offset)
 {
 	return false;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b3737c60aad9..aaf8d202434d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -526,8 +526,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 	if (folio)
 		return folio;
 
-	/* Skip allocation for unused swap slot for readahead path. */
-	if (!swap_entry_swapped(si, entry))
+	/* Skip allocation for unused and bad swap slot for readahead. */
+	if (!swap_entry_swapped(si, swp_offset(entry)))
 		return NULL;
 
 	/* Allocate a new folio to be added into the swap cache. */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 55362bb2a781..d66141f1c452 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1765,21 +1765,21 @@ int __swap_count(swp_entry_t entry)
 	return swap_count(si->swap_map[offset]);
 }
 
-/*
- * How many references to @entry are currently swapped out?
- * This does not give an exact answer when swap count is continued,
- * but does include the high COUNT_CONTINUED flag to allow for that.
+/**
+ * swap_entry_swapped - Check if the swap entry at @offset is swapped.
+ * @si: the swap device.
+ * @offset: offset of the swap entry.
  */
-bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
+bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset)
 {
-	pgoff_t offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
 	int count;
 
 	ci = swap_cluster_lock(si, offset);
 	count = swap_count(si->swap_map[offset]);
 	swap_cluster_unlock(ci);
-	return !!count;
+
+	return count && count != SWAP_MAP_BAD;
 }
 
 /*
@@ -1865,7 +1865,7 @@ static bool folio_swapped(struct folio *folio)
 		return false;
 
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
-		return swap_entry_swapped(si, entry);
+		return swap_entry_swapped(si, swp_offset(entry));
 
 	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
 }
@@ -3671,10 +3671,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 		count = si->swap_map[offset + i];
 
 		/*
-		 * swapin_readahead() doesn't check if a swap entry is valid, so the
-		 * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+		 * Allocator never allocates bad slots, and readahead is guarded
+		 * by swap_entry_swapped.
 		 */
-		if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
+		if (WARN_ON(swap_count(count) == SWAP_MAP_BAD)) {
 			err = -ENOENT;
 			goto unlock_out;
 		}

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (8 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-31  5:25   ` YoungJun Park
  2025-10-29 15:58 ` [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap cluster cache reclaim requires releasing the lock, so some extra
checks are needed after the reclaim. To prepare for checking swap cache
using the swap table directly, consolidate the swap cluster reclaim and
check the logic.

Also, adjust it very slightly. By moving the cluster empty and usable
check into the reclaim helper, it will avoid a redundant scan of the
slots if the cluster is empty.

And always scan the whole region during reclaim, don't skip slots
covered by a reclaimed folio. Because the reclaim is lockless, it's
possible that new cache lands at any time. And for allocation, we want
all caches to be reclaimed to avoid fragmentation. And besides, if the
scan offset is not aligned with the size of the reclaimed folio, we are
skipping some existing caches.

There should be no observable behavior change, which might slightly
improve the fragmentation issue or performance.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
 1 file changed, 23 insertions(+), 24 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index d66141f1c452..e4c521528817 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
 	return 0;
 }
 
-static bool cluster_reclaim_range(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
-				  unsigned long start, unsigned long end)
+static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
+					  struct swap_cluster_info *ci,
+					  unsigned long start, unsigned int order)
 {
+	unsigned int nr_pages = 1 << order;
+	unsigned long offset = start, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
-	unsigned long offset = start;
 	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
 	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
-			offset++;
 			break;
 		case SWAP_HAS_CACHE:
 			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
-			if (nr_reclaim > 0)
-				offset += nr_reclaim;
-			else
+			if (nr_reclaim < 0)
 				goto out;
 			break;
 		default:
 			goto out;
 		}
-	} while (offset < end);
+	} while (++offset < end);
 out:
 	spin_lock(&ci->lock);
+
+	/*
+	 * We just dropped ci->lock so cluster could be used by another
+	 * order or got freed, check if it's still usable or empty.
+	 */
+	if (!cluster_is_usable(ci, order))
+		return SWAP_ENTRY_INVALID;
+	if (cluster_is_empty(ci))
+		return cluster_offset(si, ci);
+
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
 	 */
 	for (offset = start; offset < end; offset++)
 		if (READ_ONCE(map[offset]))
-			return false;
+			return SWAP_ENTRY_INVALID;
 
-	return true;
+	return start;
 }
 
 static bool cluster_scan_range(struct swap_info_struct *si,
@@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
-	bool need_reclaim, ret;
+	bool need_reclaim;
 
 	lockdep_assert_held(&ci->lock);
 
@@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
 			continue;
 		if (need_reclaim) {
-			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
-			/*
-			 * Reclaim drops ci->lock and cluster could be used
-			 * by another order. Not checking flag as off-list
-			 * cluster has no flag set, and change of list
-			 * won't cause fragmentation.
-			 */
-			if (!cluster_is_usable(ci, order))
-				goto out;
-			if (cluster_is_empty(ci))
-				offset = start;
+			found = cluster_reclaim_range(si, ci, offset, order);
 			/* Reclaim failed but cluster is usable, try next */
-			if (!ret)
+			if (!found)
 				continue;
+			offset = found;
 		}
 		if (!cluster_alloc_range(si, ci, offset, usage, order))
 			break;

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic
  2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
@ 2025-10-31  5:25   ` YoungJun Park
  2025-10-31  7:11     ` Kairui Song
  0 siblings, 1 reply; 50+ messages in thread
From: YoungJun Park @ 2025-10-31  5:25 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:58:36PM +0800, Kairui Song wrote:

> From: Kairui Song <kasong@tencent.com>
> 

Hello Kairu, great work on your patchwork. :)                                    
> Swap cluster cache reclaim requires releasing the lock, so some extra
> checks are needed after the reclaim. To prepare for checking swap cache
> using the swap table directly, consolidate the swap cluster reclaim and
> check the logic.
> 
> Also, adjust it very slightly. By moving the cluster empty and usable
> check into the reclaim helper, it will avoid a redundant scan of the
> slots if the cluster is empty.

This is Change 1

> And always scan the whole region during reclaim, don't skip slots
> covered by a reclaimed folio. Because the reclaim is lockless, it's
> possible that new cache lands at any time. And for allocation, we want
> all caches to be reclaimed to avoid fragmentation. And besides, if the
> scan offset is not aligned with the size of the reclaimed folio, we are
> skipping some existing caches.

This is Change 2

> There should be no observable behavior change, which might slightly
> improve the fragmentation issue or performance.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
>  1 file changed, 23 insertions(+), 24 deletions(-)
> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index d66141f1c452..e4c521528817 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
>  	return 0;
>  }
>  
> -static bool cluster_reclaim_range(struct swap_info_struct *si,
> -				  struct swap_cluster_info *ci,
> -				  unsigned long start, unsigned long end)
> +static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
> +					  struct swap_cluster_info *ci,
> +					  unsigned long start, unsigned int order)
>  {
> +	unsigned int nr_pages = 1 << order;
> +	unsigned long offset = start, end = start + nr_pages;
>  	unsigned char *map = si->swap_map;
> -	unsigned long offset = start;
>  	int nr_reclaim;
>  
>  	spin_unlock(&ci->lock);
>  	do {
>  		switch (READ_ONCE(map[offset])) {
>  		case 0:
> -			offset++;
>  			break;
>  		case SWAP_HAS_CACHE:
>  			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> -			if (nr_reclaim > 0)
> -				offset += nr_reclaim;
> -			else
> +			if (nr_reclaim < 0)
>  				goto out;
>  			break;
>  		default:
>  			goto out;
>  		}
> -	} while (offset < end);
> +	} while (++offset < end);

Change 2

>  out:
>  	spin_lock(&ci->lock);
> +
> +	/*
> +	 * We just dropped ci->lock so cluster could be used by another
> +	 * order or got freed, check if it's still usable or empty.
> +	 */
> +	if (!cluster_is_usable(ci, order))
> +		return SWAP_ENTRY_INVALID;
> +	if (cluster_is_empty(ci))
> +		return cluster_offset(si, ci);
> +

Change 1

>  	/*
>  	 * Recheck the range no matter reclaim succeeded or not, the slot
>  	 * could have been be freed while we are not holding the lock.
>  	 */
>  	for (offset = start; offset < end; offset++)
>  		if (READ_ONCE(map[offset]))
> -			return false;
> +			return SWAP_ENTRY_INVALID;
>  
> -	return true;
> +	return start;
>  }
>  
>  static bool cluster_scan_range(struct swap_info_struct *si,
> @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
>  	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
>  	unsigned int nr_pages = 1 << order;
> -	bool need_reclaim, ret;
> +	bool need_reclaim;
>  
>  	lockdep_assert_held(&ci->lock);
>  
> @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>  		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
>  			continue;
>  		if (need_reclaim) {
> -			ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> -			/*
> -			 * Reclaim drops ci->lock and cluster could be used
> -			 * by another order. Not checking flag as off-list
> -			 * cluster has no flag set, and change of list
> -			 * won't cause fragmentation.
> -			 */
> -			if (!cluster_is_usable(ci, order))
> -				goto out;
> -			if (cluster_is_empty(ci))
> -				offset = start;
> +			found = cluster_reclaim_range(si, ci, offset, order);
>  			/* Reclaim failed but cluster is usable, try next */
> -			if (!ret)

Part of Change 1 (apply return value change)

As I understand Change 1 just remove redudant checking.
But, I think another part changed also.
(maybe I don't fully understand comment or something)

cluster_reclaim_range can return SWAP_ENTRY_INVALID
if the cluster becomes unusable for the requested order. 
(!cluster_is_usable return SWAP_ENTRY_INVALID)
And it continues loop to the next offset for reclaim try.
Is this the intended behavior?

If this is the intended behavior, the comment:
    /* Reclaim failed but cluster is usable, try next */
might be a bit misleading, as the cluster could be unusable in this
failure case. Perhaps it could be updated to reflect this? 
Or I think any other thing need to be changed..? 
(cluster_is_usable function name change etc)

Thanks.
Youngjun Park

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic
  2025-10-31  5:25   ` YoungJun Park
@ 2025-10-31  7:11     ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-31  7:11 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Fri, Oct 31, 2025 at 1:25 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:36PM +0800, Kairui Song wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
>
> Hello Kairu, great work on your patchwork. :)
> > Swap cluster cache reclaim requires releasing the lock, so some extra
> > checks are needed after the reclaim. To prepare for checking swap cache
> > using the swap table directly, consolidate the swap cluster reclaim and
> > check the logic.
> >
> > Also, adjust it very slightly. By moving the cluster empty and usable
> > check into the reclaim helper, it will avoid a redundant scan of the
> > slots if the cluster is empty.
>
> This is Change 1
>
> > And always scan the whole region during reclaim, don't skip slots
> > covered by a reclaimed folio. Because the reclaim is lockless, it's
> > possible that new cache lands at any time. And for allocation, we want
> > all caches to be reclaimed to avoid fragmentation. And besides, if the
> > scan offset is not aligned with the size of the reclaimed folio, we are
> > skipping some existing caches.
>
> This is Change 2
>
> > There should be no observable behavior change, which might slightly
> > improve the fragmentation issue or performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
> >  1 file changed, 23 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index d66141f1c452..e4c521528817 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> >       return 0;
> >  }
> >
> > -static bool cluster_reclaim_range(struct swap_info_struct *si,
> > -                               struct swap_cluster_info *ci,
> > -                               unsigned long start, unsigned long end)
> > +static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
> > +                                       struct swap_cluster_info *ci,
> > +                                       unsigned long start, unsigned int order)
> >  {
> > +     unsigned int nr_pages = 1 << order;
> > +     unsigned long offset = start, end = start + nr_pages;
> >       unsigned char *map = si->swap_map;
> > -     unsigned long offset = start;
> >       int nr_reclaim;
> >
> >       spin_unlock(&ci->lock);
> >       do {
> >               switch (READ_ONCE(map[offset])) {
> >               case 0:
> > -                     offset++;
> >                       break;
> >               case SWAP_HAS_CACHE:
> >                       nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > -                     if (nr_reclaim > 0)
> > -                             offset += nr_reclaim;
> > -                     else
> > +                     if (nr_reclaim < 0)
> >                               goto out;
> >                       break;
> >               default:
> >                       goto out;
> >               }
> > -     } while (offset < end);
> > +     } while (++offset < end);
>
> Change 2
>
> >  out:
> >       spin_lock(&ci->lock);
> > +
> > +     /*
> > +      * We just dropped ci->lock so cluster could be used by another
> > +      * order or got freed, check if it's still usable or empty.
> > +      */
> > +     if (!cluster_is_usable(ci, order))
> > +             return SWAP_ENTRY_INVALID;
> > +     if (cluster_is_empty(ci))
> > +             return cluster_offset(si, ci);
> > +
>
> Change 1
>
> >       /*
> >        * Recheck the range no matter reclaim succeeded or not, the slot
> >        * could have been be freed while we are not holding the lock.
> >        */
> >       for (offset = start; offset < end; offset++)
> >               if (READ_ONCE(map[offset]))
> > -                     return false;
> > +                     return SWAP_ENTRY_INVALID;
> >
> > -     return true;
> > +     return start;
> >  }
> >
> >  static bool cluster_scan_range(struct swap_info_struct *si,
> > @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >       unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> >       unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> >       unsigned int nr_pages = 1 << order;
> > -     bool need_reclaim, ret;
> > +     bool need_reclaim;
> >
> >       lockdep_assert_held(&ci->lock);
> >
> > @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> >               if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
> >                       continue;
> >               if (need_reclaim) {
> > -                     ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> > -                     /*
> > -                      * Reclaim drops ci->lock and cluster could be used
> > -                      * by another order. Not checking flag as off-list
> > -                      * cluster has no flag set, and change of list
> > -                      * won't cause fragmentation.
> > -                      */
> > -                     if (!cluster_is_usable(ci, order))
> > -                             goto out;
> > -                     if (cluster_is_empty(ci))
> > -                             offset = start;
> > +                     found = cluster_reclaim_range(si, ci, offset, order);
> >                       /* Reclaim failed but cluster is usable, try next */
> > -                     if (!ret)
>
> Part of Change 1 (apply return value change)
>
> As I understand Change 1 just remove redudant checking.
> But, I think another part changed also.
> (maybe I don't fully understand comment or something)
>
> cluster_reclaim_range can return SWAP_ENTRY_INVALID
> if the cluster becomes unusable for the requested order.
> (!cluster_is_usable return SWAP_ENTRY_INVALID)
> And it continues loop to the next offset for reclaim try.
> Is this the intended behavior?

Thanks for the very careful review! I should keep the
cluster_is_usable check or abort in other ways to avoid touching an
unusable cluster, will fix it.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (9 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, split the common logic into a stand alone helper to
be reused later.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 62 +++++++++++++++++++++++++++++------------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index e4c521528817..56054af12afd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3646,26 +3646,14 @@ void si_swapinfo(struct sysinfo *val)
  * - swap-cache reference is requested but the entry is not used. -> ENOENT
  * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+static int swap_dup_entries(struct swap_info_struct *si,
+			    struct swap_cluster_info *ci,
+			    unsigned long offset,
+			    unsigned char usage, int nr)
 {
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset;
-	unsigned char count;
-	unsigned char has_cache;
-	int err, i;
-
-	si = swap_entry_to_info(entry);
-	if (WARN_ON_ONCE(!si)) {
-		pr_err("%s%08lx\n", Bad_file, entry.val);
-		return -EINVAL;
-	}
-
-	offset = swp_offset(entry);
-	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-	ci = swap_cluster_lock(si, offset);
+	int i;
+	unsigned char count, has_cache;
 
-	err = 0;
 	for (i = 0; i < nr; i++) {
 		count = si->swap_map[offset + i];
 
@@ -3673,25 +3661,20 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 		 * Allocator never allocates bad slots, and readahead is guarded
 		 * by swap_entry_swapped.
 		 */
-		if (WARN_ON(swap_count(count) == SWAP_MAP_BAD)) {
-			err = -ENOENT;
-			goto unlock_out;
-		}
+		if (WARN_ON(swap_count(count) == SWAP_MAP_BAD))
+			return -ENOENT;
 
 		has_cache = count & SWAP_HAS_CACHE;
 		count &= ~SWAP_HAS_CACHE;
 
 		if (!count && !has_cache) {
-			err = -ENOENT;
+			return -ENOENT;
 		} else if (usage == SWAP_HAS_CACHE) {
 			if (has_cache)
-				err = -EEXIST;
+				return -EEXIST;
 		} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
-			err = -EINVAL;
+			return -EINVAL;
 		}
-
-		if (err)
-			goto unlock_out;
 	}
 
 	for (i = 0; i < nr; i++) {
@@ -3710,14 +3693,31 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 			 * Don't need to rollback changes, because if
 			 * usage == 1, there must be nr == 1.
 			 */
-			err = -ENOMEM;
-			goto unlock_out;
+			return -ENOMEM;
 		}
 
 		WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
 	}
 
-unlock_out:
+	return 0;
+}
+
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+{
+	int err;
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+
+	si = swap_entry_to_info(entry);
+	if (WARN_ON_ONCE(!si)) {
+		pr_err("%s%08lx\n", Bad_file, entry.val);
+		return -EINVAL;
+	}
+
+	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+	ci = swap_cluster_lock(si, offset);
+	err = swap_dup_entries(si, ci, offset, usage, nr);
 	swap_cluster_unlock(ci);
 	return err;
 }

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (10 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 19:25   ` kernel test robot
  2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Current swap in synchronization mostly uses the swap_map's
SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual
work to swap in a folio.

This has been causing many issues as it's just a poor implementation
of a bit lock. Raced users have no idea what is pinning a slot, so
it has to loop with a schedule_timeout_uninterruptible(1), which is
ugly and causes long-tailing or other performance issues. Besides,
the abuse of SWAP_HAS_CACHE has been causing many other troubles for
synchronization or maintenance.

This is the first step to remove this bit completely. This will also save
one bit for the 8-bit swap counting field.

We have just removed all swap in paths that bypass the swap cache, and
now both the swap cache and swap map are protected by the cluster lock.
So now we can just resolve the swap synchronization with the swap cache
layer directly using the cluster lock. Whoever inserts a folio in the
swap cache first does the swap in work. And because folios are locked
during swap operations, other raced users will just wait on the folio
lock.

The SWAP_HAS_CACHE will be removed in later commit. For now, we still set
it for some remaining users. But now we do the bit setting and swap cache
folio adding in the same critical section, after swap cache is ready.
No one will have to spin on the SWAP_HAS_CACHE bit anymore.

This both simplifies the logic and should improve the performance,
eliminating issues like the one solved in commit 01626a1823024
("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"),
or the "skip_if_exists" from commit a65b0e7607ccb
("zswap: make shrinking memcg-aware"), which will be removed very soon.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   6 ---
 mm/swap.h            |  14 ++++++-
 mm/swap_state.c      | 103 +++++++++++++++++++++++++++++----------------------
 mm/swapfile.c        |  39 ++++++++++++-------
 mm/vmscan.c          |   1 -
 5 files changed, 95 insertions(+), 68 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 936fa8f9e5f3..69025b473472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
 extern void swap_free_nr(swp_entry_t entry, int nr_pages);
 extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
@@ -518,11 +517,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
 	return 0;
 }
 
-static inline int swapcache_prepare(swp_entry_t swp, int nr)
-{
-	return 0;
-}
-
 static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }
diff --git a/mm/swap.h b/mm/swap.h
index e0f05babe13a..3cd99850bbaf 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -234,6 +234,14 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
 	return folio_entry.val == round_down(entry.val, nr_pages);
 }
 
+/* Temporary internal helpers */
+void __swapcache_set_cached(struct swap_info_struct *si,
+			    struct swap_cluster_info *ci,
+			    swp_entry_t entry);
+void __swapcache_clear_cached(struct swap_info_struct *si,
+			      struct swap_cluster_info *ci,
+			      swp_entry_t entry, unsigned int nr);
+
 /*
  * All swap cache helpers below require the caller to ensure the swap entries
  * used are valid and stablize the device by any of the following ways:
@@ -247,7 +255,8 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+			 void **shadow, bool alloc);
 void swap_cache_del_folio(struct folio *folio);
 struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
 				     struct mempolicy *mpol, pgoff_t ilx,
@@ -413,7 +422,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
-static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
+static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+				       void **shadow, bool alloc)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index aaf8d202434d..2d53e3b5e8e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -128,34 +128,66 @@ void *swap_cache_get_shadow(swp_entry_t entry)
  * @entry: The swap entry corresponding to the folio.
  * @gfp: gfp_mask for XArray node allocation.
  * @shadowp: If a shadow is found, return the shadow.
+ * @alloc: If it's the allocator that is trying to insert a folio. Allocator
+ *         sets SWAP_HAS_CACHE to pin slots before insert so skip map update.
  *
  * Context: Caller must ensure @entry is valid and protect the swap device
  * with reference count or locks.
  * The caller also needs to update the corresponding swap_map slots with
  * SWAP_HAS_CACHE bit to avoid race or conflict.
  */
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+			 void **shadowp, bool alloc)
 {
+	int err;
 	void *shadow = NULL;
+	struct swap_info_struct *si;
 	unsigned long old_tb, new_tb;
 	struct swap_cluster_info *ci;
-	unsigned int ci_start, ci_off, ci_end;
+	unsigned int ci_start, ci_off, ci_end, offset;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
+	si = __swap_entry_to_info(entry);
 	new_tb = folio_to_swp_tb(folio);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
-	ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+	offset = swp_offset(entry);
+	ci = swap_cluster_lock(si, swp_offset(entry));
+	if (unlikely(!ci->table)) {
+		err = -ENOENT;
+		goto failed;
+	}
 	do {
-		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
-		WARN_ON_ONCE(swp_tb_is_folio(old_tb));
+		old_tb = __swap_table_get(ci, ci_off);
+		if (unlikely(swp_tb_is_folio(old_tb))) {
+			err = -EEXIST;
+			goto failed;
+		}
+		if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
+			err = -ENOENT;
+			goto failed;
+		}
 		if (swp_tb_is_shadow(old_tb))
 			shadow = swp_tb_to_shadow(old_tb);
+		offset++;
+	} while (++ci_off < ci_end);
+
+	ci_off = ci_start;
+	offset = swp_offset(entry);
+	do {
+		/*
+		 * Still need to pin the slots with SWAP_HAS_CACHE since
+		 * swap allocator depends on that.
+		 */
+		if (!alloc)
+			__swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
+		__swap_table_set(ci, ci_off, new_tb);
+		offset++;
 	} while (++ci_off < ci_end);
 
 	folio_ref_add(folio, nr_pages);
@@ -168,6 +200,11 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
 
 	if (shadowp)
 		*shadowp = shadow;
+	return 0;
+
+failed:
+	swap_cluster_unlock(ci);
+	return err;
 }
 
 /**
@@ -186,6 +223,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
 void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 			    swp_entry_t entry, void *shadow)
 {
+	struct swap_info_struct *si;
 	unsigned long old_tb, new_tb;
 	unsigned int ci_start, ci_off, ci_end;
 	unsigned long nr_pages = folio_nr_pages(folio);
@@ -195,6 +233,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
 
+	si = __swap_entry_to_info(entry);
 	new_tb = shadow_swp_to_tb(shadow);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
@@ -210,6 +249,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	folio_clear_swapcache(folio);
 	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
+	__swapcache_clear_cached(si, ci, entry, nr_pages);
 }
 
 /**
@@ -231,7 +271,6 @@ void swap_cache_del_folio(struct folio *folio)
 	__swap_cache_del_folio(ci, folio, entry, NULL);
 	swap_cluster_unlock(ci);
 
-	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
@@ -423,67 +462,37 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
 						  gfp_t gfp, bool charged,
 						  bool skip_if_exists)
 {
-	struct folio *swapcache;
+	struct folio *swapcache = NULL;
 	void *shadow;
 	int ret;
 
-	/*
-	 * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio
-	 * into the swap cache. Loop with a schedule delay if raced with
-	 * another process setting SWAP_HAS_CACHE. This hackish loop will
-	 * be fixed very soon.
-	 */
+	__folio_set_locked(folio);
+	__folio_set_swapbacked(folio);
 	for (;;) {
-		ret = swapcache_prepare(entry, folio_nr_pages(folio));
+		ret = swap_cache_add_folio(folio, entry, &shadow, false);
 		if (!ret)
 			break;
 
 		/*
-		 * The skip_if_exists is for protecting against a recursive
-		 * call to this helper on the same entry waiting forever
-		 * here because SWAP_HAS_CACHE is set but the folio is not
-		 * in the swap cache yet. This can happen today if
-		 * mem_cgroup_swapin_charge_folio() below triggers reclaim
-		 * through zswap, which may call this helper again in the
-		 * writeback path.
-		 *
-		 * Large order allocation also needs special handling on
+		 * Large order allocation needs special handling on
 		 * race: if a smaller folio exists in cache, swapin needs
 		 * to fallback to order 0, and doing a swap cache lookup
 		 * might return a folio that is irrelevant to the faulting
 		 * entry because @entry is aligned down. Just return NULL.
 		 */
 		if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
-			return NULL;
+			goto failed;
 
-		/*
-		 * Check the swap cache again, we can only arrive
-		 * here because swapcache_prepare returns -EEXIST.
-		 */
 		swapcache = swap_cache_get_folio(entry);
 		if (swapcache)
-			return swapcache;
-
-		/*
-		 * We might race against __swap_cache_del_folio(), and
-		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
-		 * has not yet been cleared.  Or race against another
-		 * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
-		 * in swap_map, but not yet added its folio to swap cache.
-		 */
-		schedule_timeout_uninterruptible(1);
+			goto failed;
 	}
 
-	__folio_set_locked(folio);
-	__folio_set_swapbacked(folio);
-
 	if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
-		put_swap_folio(folio, entry);
-		folio_unlock(folio);
-		return NULL;
+		swap_cache_del_folio(folio);
+		goto failed;
 	}
 
-	swap_cache_add_folio(folio, entry, &shadow);
 	memcg1_swapin(entry, folio_nr_pages(folio));
 	if (shadow)
 		workingset_refault(folio, shadow);
@@ -491,6 +500,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
 	/* Caller will initiate read into locked folio */
 	folio_add_lru(folio);
 	return folio;
+
+failed:
+	folio_unlock(folio);
+	return swapcache;
 }
 
 /**
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 56054af12afd..415db36d85d3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1461,7 +1461,11 @@ int folio_alloc_swap(struct folio *folio)
 	if (!entry.val)
 		return -ENOMEM;
 
-	swap_cache_add_folio(folio, entry, NULL);
+	/*
+	 * Allocator has pinned the slots with SWAP_HAS_CACHE
+	 * so it should never fail
+	 */
+	WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
 
 	return 0;
 
@@ -1567,9 +1571,8 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
  *   do_swap_page()
  *     ...				swapoff+swapon
  *     swap_cache_alloc_folio()
- *       swapcache_prepare()
- *         __swap_duplicate()
- *           // check swap_map
+ *       swap_cache_add_folio()
+ *         // check swap_map
  *     // verify PTE not changed
  *
  * In __swap_duplicate(), the swap_map need to be checked before
@@ -3748,17 +3751,25 @@ int swap_duplicate_nr(swp_entry_t entry, int nr)
 	return err;
 }
 
-/*
- * @entry: first swap entry from which we allocate nr swap cache.
- *
- * Called when allocating swap cache for existing swap entries,
- * This can return error codes. Returns 0 at success.
- * -EEXIST means there is a swap cache.
- * Note: return code is different from swap_duplicate().
- */
-int swapcache_prepare(swp_entry_t entry, int nr)
+/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */
+void __swapcache_set_cached(struct swap_info_struct *si,
+			    struct swap_cluster_info *ci,
+			    swp_entry_t entry)
+{
+	WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1));
+}
+
+/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock */
+void __swapcache_clear_cached(struct swap_info_struct *si,
+			      struct swap_cluster_info *ci,
+			      swp_entry_t entry, unsigned int nr)
 {
-	return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
+	if (swap_only_has_cache(si, swp_offset(entry), nr)) {
+		swap_entries_free(si, ci, entry, nr);
+	} else {
+		for (int i = 0; i < nr; i++, entry.val++)
+			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+	}
 }
 
 /*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e74a2807930..76b9c21a7fe2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -762,7 +762,6 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 		__swap_cache_del_folio(ci, folio, swap, shadow);
 		memcg1_swapout(folio, swap);
 		swap_cluster_unlock_irq(ci);
-		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
 

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer
  2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
@ 2025-10-29 19:25   ` kernel test robot
  0 siblings, 0 replies; 50+ messages in thread
From: kernel test robot @ 2025-10-29 19:25 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Baoquan He, Barry Song, Chris Li, Nhat Pham, Johannes Weiner,
	Yosry Ahmed, David Hildenbrand, Youngjun Park, Hugh Dickins,
	Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
	Matthew Wilcox (Oracle), linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build warnings:

[auto build test WARNING on f30d294530d939fa4b77d61bc60f25c4284841fa]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
base:   f30d294530d939fa4b77d61bc60f25c4284841fa
patch link:    https://lore.kernel.org/r/20251029-swap-table-p2-v1-12-3d43f3b6ec32%40tencent.com
patch subject: [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300338.GvcdaiCz-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project d1c086e82af239b245fe8d7832f2753436634990)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300338.GvcdaiCz-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510300338.GvcdaiCz-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from mm/filemap.c:66:
>> mm/swap.h:428:1: warning: non-void function does not return a value [-Wreturn-type]
     428 | }
         | ^
   1 warning generated.
--
   In file included from mm/gup.c:29:
>> mm/swap.h:428:1: warning: non-void function does not return a value [-Wreturn-type]
     428 | }
         | ^
   mm/gup.c:74:29: warning: unused function 'try_get_folio' [-Wunused-function]
      74 | static inline struct folio *try_get_folio(struct page *page, int refs)
         |                             ^~~~~~~~~~~~~
   2 warnings generated.


vim +428 mm/swap.h

014bb1de4fc17d5 NeilBrown   2022-05-09  424  
2eaa2d7ed6e0caa Kairui Song 2025-10-29  425  static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
2eaa2d7ed6e0caa Kairui Song 2025-10-29  426  				       void **shadow, bool alloc)
014bb1de4fc17d5 NeilBrown   2022-05-09  427  {
014bb1de4fc17d5 NeilBrown   2022-05-09 @428  }
014bb1de4fc17d5 NeilBrown   2022-05-09  429  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (11 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-11-07  3:07   ` Barry Song
  2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap:
make shrinking memcg-aware"). It was needed because there is a tiny time
window between setting the SWAP_HAS_CACHE bit and actually adding the
folio to the swap cache. If a user is trying to add the folio into the
swap cache but another user was interrupted after setting SWAP_HAS_CACHE
but hasn't added the folio to the swap cache yet, it might lead to a
deadlock.

We have moved the bit setting to the same critical section as adding the
folio, so this is no longer needed. Remove it and clean it up.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |  2 +-
 mm/swap_state.c | 27 ++++++++++-----------------
 mm/zswap.c      |  2 +-
 3 files changed, 12 insertions(+), 19 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 3cd99850bbaf..a3c5f2dca0d5 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -260,7 +260,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 void swap_cache_del_folio(struct folio *folio);
 struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
 				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *alloced, bool skip_if_exists);
+				     bool *alloced);
 /* Below helpers require the caller to lock and pass in the swap cluster. */
 void __swap_cache_del_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry, void *shadow);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2d53e3b5e8e9..d2bcca92b6e0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -447,8 +447,6 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
  * @folio: folio to be added.
  * @gfp: memory allocation flags for charge, can be 0 if @charged if true.
  * @charged: if the folio is already charged.
- * @skip_if_exists: if the slot is in a cached state, return NULL.
- *                  This is an old workaround that will be removed shortly.
  *
  * Update the swap_map and add folio as swap cache, typically before swapin.
  * All swap slots covered by the folio must have a non-zero swap count.
@@ -459,8 +457,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
  */
 static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
 						  struct folio *folio,
-						  gfp_t gfp, bool charged,
-						  bool skip_if_exists)
+						  gfp_t gfp, bool charged)
 {
 	struct folio *swapcache = NULL;
 	void *shadow;
@@ -480,7 +477,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
 		 * might return a folio that is irrelevant to the faulting
 		 * entry because @entry is aligned down. Just return NULL.
 		 */
-		if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
+		if (ret != -EEXIST || folio_test_large(folio))
 			goto failed;
 
 		swapcache = swap_cache_get_folio(entry);
@@ -513,8 +510,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
  * @mpol: NUMA memory allocation policy to be applied
  * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
  * @new_page_allocated: sets true if allocation happened, false otherwise
- * @skip_if_exists: if the slot is a partially cached state, return NULL.
- *                  This is a workaround that would be removed shortly.
  *
  * Allocate a folio in the swap cache for one swap slot, typically before
  * doing IO (swap in or swap out). The swap slot indicated by @entry must
@@ -526,8 +521,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
  */
 struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 				     struct mempolicy *mpol, pgoff_t ilx,
-				     bool *new_page_allocated,
-				     bool skip_if_exists)
+				     bool *new_page_allocated)
 {
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct folio *folio;
@@ -548,8 +542,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
 	if (!folio)
 		return NULL;
 	/* Try add the new folio, returns existing folio or NULL on failure. */
-	result = __swap_cache_prepare_and_add(entry, folio, gfp_mask,
-					      false, skip_if_exists);
+	result = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
 	if (result == folio)
 		*new_page_allocated = true;
 	else
@@ -578,7 +571,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
-	swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true, false);
+	swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true);
 	if (swapcache == folio)
 		swap_read_folio(folio, NULL);
 	return swapcache;
@@ -606,7 +599,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+				       &page_allocated);
 	mpol_cond_put(mpol);
 
 	if (page_allocated)
@@ -725,7 +718,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		/* Ok, do the async read-ahead now */
 		folio = swap_cache_alloc_folio(
 			swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
-			&page_allocated, false);
+			&page_allocated);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -743,7 +736,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 skip:
 	/* The page was likely read above, so no need for plugging here */
 	folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
-					&page_allocated, false);
+				       &page_allocated);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
 	return folio;
@@ -838,7 +831,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 		pte_unmap(pte);
 		pte = NULL;
 		folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
-						&page_allocated, false);
+					       &page_allocated);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -858,7 +851,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
 skip:
 	/* The folio was likely read above, so no need for plugging here */
 	folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
-					&page_allocated, false);
+				       &page_allocated);
 	if (unlikely(page_allocated))
 		swap_read_folio(folio, NULL);
 	return folio;
diff --git a/mm/zswap.c b/mm/zswap.c
index a7a2443912f4..d8a33db9d3cc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1015,7 +1015,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 	mpol = get_task_policy(current);
 	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
-				       NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+				       NO_INTERLEAVE_INDEX, &folio_was_allocated);
 	put_swap_device(si);
 	if (!folio)
 		return -ENOMEM;

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state
  2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
@ 2025-11-07  3:07   ` Barry Song
  0 siblings, 0 replies; 50+ messages in thread
From: Barry Song @ 2025-11-07  3:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

>  struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
>                                      struct mempolicy *mpol, pgoff_t ilx,
> -                                    bool *new_page_allocated,
> -                                    bool skip_if_exists)
> +                                    bool *new_page_allocated)
>  {
>         struct swap_info_struct *si = __swap_entry_to_info(entry);
>         struct folio *folio;
> @@ -548,8 +542,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
>         if (!folio)
>                 return NULL;
>         /* Try add the new folio, returns existing folio or NULL on failure. */
> -       result = __swap_cache_prepare_and_add(entry, folio, gfp_mask,
> -                                             false, skip_if_exists);
> +       result = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
>         if (result == folio)
>                 *new_page_allocated = true;
>         else
> @@ -578,7 +571,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
>         unsigned long nr_pages = folio_nr_pages(folio);
>
>         entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
> -       swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true, false);
> +       swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true);
>         if (swapcache == folio)
>                 swap_read_folio(folio, NULL);
>         return swapcache;

I wonder if we could also drop the "charged" — it doesn’t seem
difficult to move the charging step before
__swap_cache_prepare_and_add(), even for swap_cache_alloc_folio()?

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (12 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 19:25   ` kernel test robot
                     ` (2 more replies)
  2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
                   ` (6 subsequent siblings)
  20 siblings, 3 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.

Swap entry will be mostly allocated and free with a folio bound.
The folio lock will be useful for resolving many swap ralated races.

Now swap allocation (except hibernation) always starts with a folio in
the swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap
table if fully implemented. Currently dup still needs the caller
to lock the swap entry container (e.g. PTL), or a concurrent zap
may underflow the swap count.

Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |  58 +++++++++----------
 kernel/power/swap.c    |  10 ++--
 mm/madvise.c           |   2 +-
 mm/memory.c            |  15 +++--
 mm/rmap.c              |   7 ++-
 mm/shmem.c             |  10 ++--
 mm/swap.h              |  37 +++++++++++++
 mm/swapfile.c          | 148 ++++++++++++++++++++++++++++++++++---------------
 9 files changed, 192 insertions(+), 97 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 0fde20bbc50b..c51304a4418e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -692,7 +692,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 
 		dec_mm_counter(mm, mm_counter(folio));
 	}
-	free_swap_and_cache(entry);
+	swap_put_entries_direct(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 69025b473472..ac3caa4c6999 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -472,6 +466,29 @@ struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
+/*
+ * If there is an existing swap slot reference (swap entry) and the caller
+ * guarantees that there is no race modification of it (e.g., PTL
+ * protecting the swap entry in page table; shmem's cmpxchg protects t
+ * he swap entry in shmem mapping), these two helpers below can be used
+ * to put/dup the entries directly.
+ *
+ * All entries must be allocated by folio_alloc_swap(). And they must have
+ * a swap count > 1. See comments of folio_*_swap helpers for more info.
+ */
+int swap_dup_entry_direct(swp_entry_t entry);
+void swap_put_entries_direct(swp_entry_t entry, int nr);
+
+/*
+ * folio_free_swap tries to free the swap entries pinned by a swap cache
+ * folio, it has to be here to be called by other components.
+ */
+bool folio_free_swap(struct folio *folio);
+
+/* Allocate / free (hibernation) exclusive entries */
+swp_entry_t swap_alloc_hibernation_slot(int type);
+void swap_free_hibernation_slot(swp_entry_t entry);
+
 static inline void put_swap_device(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
@@ -499,10 +516,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
-{
-}
-
 static inline void free_swap_cache(struct folio *folio)
 {
 }
@@ -512,12 +525,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
+static inline int swap_dup_entry_direct(swp_entry_t ent)
 {
 	return 0;
 }
 
-static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
 {
 }
 
@@ -541,11 +554,6 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline int folio_alloc_swap(struct folio *folio)
-{
-	return -EINVAL;
-}
-
 static inline bool folio_free_swap(struct folio *folio)
 {
 	return false;
@@ -558,22 +566,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 	return -EINVAL;
 }
 #endif /* CONFIG_SWAP */
-
-static inline int swap_duplicate(swp_entry_t entry)
-{
-	return swap_duplicate_nr(entry, 1);
-}
-
-static inline void free_swap_and_cache(swp_entry_t entry)
-{
-	free_swap_and_cache_nr(entry, 1);
-}
-
-static inline void swap_free(swp_entry_t entry)
-{
-	swap_free_nr(entry, 1);
-}
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 0beff7eeaaba..546a0c701970 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -179,10 +179,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_offset(swap_alloc_hibernation_slot(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free_hibernation_slot(swp_entry(swap, offset));
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap)
 
 void free_all_swap_pages(int swap)
 {
+	unsigned long offset;
 	struct rb_node *node;
 
 	while ((node = swsusp_extents.rb_node)) {
@@ -204,8 +205,9 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
-			     ext->end - ext->start + 1);
+
+		for (offset = ext->start; offset < ext->end; offset++)
+			swap_free_hibernation_slot(swp_entry(swap, offset));
 
 		kfree(ext);
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..3cf2097d2085 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -697,7 +697,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				max_nr = (end - addr) / PAGE_SIZE;
 				nr = swap_pte_batch(pte, max_nr, ptent);
 				nr_swap -= nr;
-				free_swap_and_cache_nr(entry, nr);
+				swap_put_entries_direct(entry, nr);
 				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 			} else if (is_hwpoison_entry(entry) ||
 				   is_poisoned_swp_entry(entry)) {
diff --git a/mm/memory.c b/mm/memory.c
index 589d6fc3d424..27d91ae3648a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -933,7 +933,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	swp_entry_t entry = pte_to_swp_entry(orig_pte);
 
 	if (likely(!non_swap_entry(entry))) {
-		if (swap_duplicate(entry) < 0)
+		if (swap_dup_entry_direct(entry) < 0)
 			return -EIO;
 
 		/* make sure dst_mm is on swapoff's mmlist. */
@@ -1746,7 +1746,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 
 		nr = swap_pte_batch(pte, max_nr, ptent);
 		rss[MM_SWAPENTS] -= nr;
-		free_swap_and_cache_nr(entry, nr);
+		swap_put_entries_direct(entry, nr);
 	} else if (is_migration_entry(entry)) {
 		struct folio *folio = pfn_swap_entry_folio(entry);
 
@@ -4932,7 +4932,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -4970,6 +4970,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
+		folio_put_swap(swapcache, NULL);
 	} else if (!folio_test_anon(folio)) {
 		/*
 		 * We currently only expect !anon folios that are fully
@@ -4978,9 +4979,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
+		folio_put_swap(folio, NULL);
 	} else {
+		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
-					rmap_flags);
+					 rmap_flags);
+		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
@@ -4994,7 +4998,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * swapcache. Do it after mapping so any raced page fault will
 	 * see the folio in swap cache and wait for us.
 	 */
-	swap_free_nr(entry, nr_pages);
 	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
 		folio_free_swap(folio);
 
@@ -5004,7 +5007,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
 		 * (to avoid false positives from pte_same). For
-		 * further safety release the lock after the swap_free
+		 * further safety release the lock after the folio_put_swap
 		 * so that the swap count won't change under a
 		 * parallel locked swapcache.
 		 */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1954c538a991..844864831797 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -82,6 +82,7 @@
 #include <trace/events/migrate.h>
 
 #include "internal.h"
+#include "swap.h"
 
 static struct kmem_cache *anon_vma_cachep;
 static struct kmem_cache *anon_vma_chain_cachep;
@@ -2146,7 +2147,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto discard;
 			}
 
-			if (swap_duplicate(entry) < 0) {
+			if (folio_dup_swap(folio, subpage) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2157,7 +2158,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * so we'll not check/care.
 			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2165,7 +2166,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index 46d54a1288fd..5e6cb763d945 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -982,7 +982,7 @@ static long shmem_free_swap(struct address_space *mapping,
 	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
 	if (old != radswap)
 		return 0;
-	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+	swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
 
 	return 1 << order;
 }
@@ -1665,7 +1665,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 			spin_unlock(&shmem_swaplist_lock);
 		}
 
-		swap_duplicate_nr(folio->swap, nr_pages);
+		folio_dup_swap(folio, NULL);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
@@ -1686,7 +1686,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		/* Swap entry might be erased by racing shmem_free_swap() */
 		if (!error) {
 			shmem_recalc_inode(inode, 0, -nr_pages);
-			swap_free_nr(folio->swap, nr_pages);
+			folio_put_swap(folio, NULL);
 		}
 
 		/*
@@ -2172,6 +2172,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
+	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2179,7 +2180,6 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * in shmem_evict_inode().
 	 */
 	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
-	swap_free_nr(swap, nr_pages);
 }
 
 static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
@@ -2401,9 +2401,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
+	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
-	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
 
 	*foliop = folio;
diff --git a/mm/swap.h b/mm/swap.h
index a3c5f2dca0d5..74c61129d7b7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
+/*
+ * Below are the core routines for doing swap for a folio.
+ * All helpers requires the folio to be locked, and a locked folio
+ * in the swap cache pins the swap entries / slots allocated to the
+ * folio, swap relies heavily on the swap cache and folio lock for
+ * synchronization.
+ *
+ * folio_alloc_swap(): the entry point for a folio to be swapped
+ * out. It allocates swap slots and pins the slots with swap cache.
+ * The slots start with a swap count of zero.
+ *
+ * folio_dup_swap(): increases the swap count of a folio, usually
+ * during it gets unmapped and a swap entry is installed to replace
+ * it (e.g., swap entry in page table). A swap slot with swap
+ * count == 0 should only be increasd by this helper.
+ *
+ * folio_put_swap(): does the opposite thing of folio_dup_swap().
+ */
+int folio_alloc_swap(struct folio *folio);
+int folio_dup_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio, struct page *subpage);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
 	return NULL;
 }
 
+static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+{
+	return -EINVAL;
+}
+
+static inline int folio_dup_swap(struct folio *folio, struct page *page)
+{
+	return -EINVAL;
+}
+
+static inline void folio_put_swap(struct folio *folio, struct page *page)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
+
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 415db36d85d3..426b0b6d583f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si,
 			      swp_entry_t entry, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
+static bool swap_entries_put_map(struct swap_info_struct *si,
+				 swp_entry_t entry, int nr);
 static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
@@ -1467,6 +1470,12 @@ int folio_alloc_swap(struct folio *folio)
 	 */
 	WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
 
+	/*
+	 * Allocator should always allocate aligned entries so folio based
+	 * operations never crossed more than one cluster.
+	 */
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
+
 	return 0;
 
 out_free:
@@ -1474,6 +1483,62 @@ int folio_alloc_swap(struct folio *folio)
 	return -ENOMEM;
 }
 
+/**
+ * folio_dup_swap() - Increase swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded.
+ * @subpage: if not NULL, only increase the swap count of this subpage.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ * The caller also has to ensure there is no raced call to
+ * swap_put_entries_direct before this helper returns, or the swap
+ * map may underflow (TODO: maybe we should allow or avoid underflow to
+ * make swap refcount lockless).
+ */
+int folio_dup_swap(struct folio *folio, struct page *subpage)
+{
+	int err = 0;
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
+		err = add_swap_count_continuation(entry, GFP_ATOMIC);
+
+	return err;
+}
+
+/**
+ * folio_put_swap() - Decrease swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded, must be in swap cache and locked.
+ * @subpage: if not NULL, only decrease the swap count of this subpage.
+ *
+ * This won't free the swap slots even if swap count drops to zero, they are
+ * still pinned by the swap cache. User may call folio_free_swap to free them.
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ */
+void folio_put_swap(struct folio *folio, struct page *subpage)
+{
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages);
+}
+
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
@@ -1714,28 +1779,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 		partial_free_cluster(si, ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
-{
-	int nr;
-	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
-
-	sis = _swap_info_get(entry);
-	if (!sis)
-		return;
-
-	while (nr_pages) {
-		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-		swap_entries_put_map(sis, swp_entry(sis->type, offset), nr);
-		offset += nr;
-		nr_pages -= nr;
-	}
-}
-
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
@@ -1924,16 +1967,19 @@ bool folio_free_swap(struct folio *folio)
 }
 
 /**
- * free_swap_and_cache_nr() - Release reference on range of swap entries and
- *                            reclaim their cache if no more references remain.
+ * swap_put_entries_direct() - Release reference on range of swap entries and
+ *                             reclaim their cache if no more references remain.
  * @entry: First entry of range.
  * @nr: Number of entries in range.
  *
  * For each swap entry in the contiguous range, release a reference. If any swap
  * entries become free, try to reclaim their underlying folios, if present. The
  * offset range is defined by [entry.offset, entry.offset + nr).
+ *
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being released.
  */
-void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+void swap_put_entries_direct(swp_entry_t entry, int nr)
 {
 	const unsigned long start_offset = swp_offset(entry);
 	const unsigned long end_offset = start_offset + nr;
@@ -1942,10 +1988,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	unsigned long offset;
 
 	si = get_swap_device(entry);
-	if (!si)
+	if (WARN_ON_ONCE(!si))
 		return;
-
-	if (WARN_ON(end_offset > si->max))
+	if (WARN_ON_ONCE(end_offset > si->max))
 		goto out;
 
 	/*
@@ -1989,8 +2034,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 }
 
 #ifdef CONFIG_HIBERNATION
-
-swp_entry_t get_swap_page_of_type(int type)
+/* Allocate a slot for hibernation */
+swp_entry_t swap_alloc_hibernation_slot(int type)
 {
 	struct swap_info_struct *si = swap_type_to_info(type);
 	unsigned long offset;
@@ -2020,6 +2065,27 @@ swp_entry_t get_swap_page_of_type(int type)
 	return entry;
 }
 
+/* Free a slot allocated by swap_alloc_hibernation_slot */
+void swap_free_hibernation_slot(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	pgoff_t offset = swp_offset(entry);
+
+	si = get_swap_device(entry);
+	if (WARN_ON(!si))
+		return;
+
+	ci = swap_cluster_lock(si, offset);
+	swap_entry_put_locked(si, ci, entry, 1);
+	WARN_ON(swap_entry_swapped(si, offset));
+	swap_cluster_unlock(ci);
+
+	/* In theory readahead might add it to the swap cache by accident */
+	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+	put_swap_device(si);
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
@@ -2181,7 +2247,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -2222,7 +2288,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	swap_free(entry);
+	folio_put_swap(folio, page);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
@@ -3725,28 +3791,22 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	return err;
 }
 
-/**
- * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
- *                       by 1.
- *
+/*
+ * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
  * @entry: first swap entry from which we want to increase the refcount.
- * @nr: Number of entries in range.
  *
  * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
  * but could not be atomically allocated.  Returns 0, just as if it succeeded,
  * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
  * might occur if a page table entry has got corrupted.
  *
- * Note that we are currently not handling the case where nr > 1 and we need to
- * add swap count continuation. This is OK, because no such user exists - shmem
- * is the only user that can pass nr > 1, and it never re-duplicates any swap
- * entry it owns.
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being increased.
  */
-int swap_duplicate_nr(swp_entry_t entry, int nr)
+int swap_dup_entry_direct(swp_entry_t entry)
 {
 	int err = 0;
-
-	while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
+	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
@ 2025-10-29 19:25   ` kernel test robot
  2025-10-30  5:25     ` Kairui Song
  2025-10-29 19:25   ` kernel test robot
  2025-11-01  4:51   ` YoungJun Park
  2 siblings, 1 reply; 50+ messages in thread
From: kernel test robot @ 2025-10-29 19:25 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Baoquan He, Barry Song, Chris Li, Nhat Pham, Johannes Weiner,
	Yosry Ahmed, David Hildenbrand, Youngjun Park, Hugh Dickins,
	Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
	Matthew Wilcox (Oracle), linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on f30d294530d939fa4b77d61bc60f25c4284841fa]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
base:   f30d294530d939fa4b77d61bc60f25c4284841fa
patch link:    https://lore.kernel.org/r/20251029-swap-table-p2-v1-14-3d43f3b6ec32%40tencent.com
patch subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510300316.UL4gxAlC-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/vmscan.c:70:
   mm/swap.h: In function 'swap_cache_add_folio':
   mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
     465 | }
         | ^
   mm/vmscan.c: In function 'shrink_folio_list':
>> mm/vmscan.c:1298:37: error: too few arguments to function 'folio_alloc_swap'
    1298 |                                 if (folio_alloc_swap(folio)) {
         |                                     ^~~~~~~~~~~~~~~~
   mm/swap.h:388:19: note: declared here
     388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
         |                   ^~~~~~~~~~~~~~~~
   mm/vmscan.c:1314:45: error: too few arguments to function 'folio_alloc_swap'
    1314 |                                         if (folio_alloc_swap(folio))
         |                                             ^~~~~~~~~~~~~~~~
   mm/swap.h:388:19: note: declared here
     388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
         |                   ^~~~~~~~~~~~~~~~
--
   In file included from mm/shmem.c:44:
   mm/swap.h: In function 'swap_cache_add_folio':
   mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
     465 | }
         | ^
   mm/shmem.c: In function 'shmem_writeout':
>> mm/shmem.c:1649:14: error: too few arguments to function 'folio_alloc_swap'
    1649 |         if (!folio_alloc_swap(folio)) {
         |              ^~~~~~~~~~~~~~~~
   mm/swap.h:388:19: note: declared here
     388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
         |                   ^~~~~~~~~~~~~~~~


vim +/folio_alloc_swap +1298 mm/vmscan.c

d791ea676b6648 NeilBrown               2022-05-09  1072  
^1da177e4c3f41 Linus Torvalds          2005-04-16  1073  /*
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1074)  * shrink_folio_list() returns the number of reclaimed pages
^1da177e4c3f41 Linus Torvalds          2005-04-16  1075   */
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1076) static unsigned int shrink_folio_list(struct list_head *folio_list,
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1077) 		struct pglist_data *pgdat, struct scan_control *sc,
7d709f49babc28 Gregory Price           2025-04-24  1078  		struct reclaim_stat *stat, bool ignore_references,
7d709f49babc28 Gregory Price           2025-04-24  1079  		struct mem_cgroup *memcg)
^1da177e4c3f41 Linus Torvalds          2005-04-16  1080  {
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1081) 	struct folio_batch free_folios;
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1082) 	LIST_HEAD(ret_folios);
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1083) 	LIST_HEAD(demote_folios);
a479b078fddb0a Li Zhijian              2025-01-10  1084  	unsigned int nr_reclaimed = 0, nr_demoted = 0;
730ec8c01a2bd6 Maninder Singh          2020-06-03  1085  	unsigned int pgactivate = 0;
26aa2d199d6f2c Dave Hansen             2021-09-02  1086  	bool do_demote_pass;
2282679fb20bf0 NeilBrown               2022-05-09  1087  	struct swap_iocb *plug = NULL;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1088  
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1089) 	folio_batch_init(&free_folios);
060f005f074791 Kirill Tkhai            2019-03-05  1090  	memset(stat, 0, sizeof(*stat));
^1da177e4c3f41 Linus Torvalds          2005-04-16  1091  	cond_resched();
7d709f49babc28 Gregory Price           2025-04-24  1092  	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
^1da177e4c3f41 Linus Torvalds          2005-04-16  1093  
26aa2d199d6f2c Dave Hansen             2021-09-02  1094  retry:
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1095) 	while (!list_empty(folio_list)) {
^1da177e4c3f41 Linus Torvalds          2005-04-16  1096  		struct address_space *mapping;
be7c07d60e13ac Matthew Wilcox (Oracle  2021-12-23  1097) 		struct folio *folio;
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1098) 		enum folio_references references = FOLIOREF_RECLAIM;
d791ea676b6648 NeilBrown               2022-05-09  1099  		bool dirty, writeback;
98879b3b9edc16 Yang Shi                2019-07-11  1100  		unsigned int nr_pages;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1101  
^1da177e4c3f41 Linus Torvalds          2005-04-16  1102  		cond_resched();
^1da177e4c3f41 Linus Torvalds          2005-04-16  1103  
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1104) 		folio = lru_to_folio(folio_list);
be7c07d60e13ac Matthew Wilcox (Oracle  2021-12-23  1105) 		list_del(&folio->lru);
^1da177e4c3f41 Linus Torvalds          2005-04-16  1106  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1107) 		if (!folio_trylock(folio))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1108  			goto keep;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1109  
1b0449544c6482 Jinjiang Tu             2025-03-18  1110  		if (folio_contain_hwpoisoned_page(folio)) {
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1111  			/*
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1112  			 * unmap_poisoned_folio() can't handle large
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1113  			 * folio, just skip it. memory_failure() will
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1114  			 * handle it if the UCE is triggered again.
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1115  			 */
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1116  			if (folio_test_large(folio))
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1117  				goto keep_locked;
9f1e8cd0b7c4c9 Jinjiang Tu             2025-06-27  1118  
1b0449544c6482 Jinjiang Tu             2025-03-18  1119  			unmap_poisoned_folio(folio, folio_pfn(folio), false);
1b0449544c6482 Jinjiang Tu             2025-03-18  1120  			folio_unlock(folio);
1b0449544c6482 Jinjiang Tu             2025-03-18  1121  			folio_put(folio);
1b0449544c6482 Jinjiang Tu             2025-03-18  1122  			continue;
1b0449544c6482 Jinjiang Tu             2025-03-18  1123  		}
1b0449544c6482 Jinjiang Tu             2025-03-18  1124  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1125) 		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
^1da177e4c3f41 Linus Torvalds          2005-04-16  1126  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1127) 		nr_pages = folio_nr_pages(folio);
98879b3b9edc16 Yang Shi                2019-07-11  1128  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1129) 		/* Account the number of base pages */
98879b3b9edc16 Yang Shi                2019-07-11  1130  		sc->nr_scanned += nr_pages;
80e4342601abfa Christoph Lameter       2006-02-11  1131  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1132) 		if (unlikely(!folio_evictable(folio)))
ad6b67041a4549 Minchan Kim             2017-05-03  1133  			goto activate_locked;
894bc310419ac9 Lee Schermerhorn        2008-10-18  1134  
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1135) 		if (!sc->may_unmap && folio_mapped(folio))
80e4342601abfa Christoph Lameter       2006-02-11  1136  			goto keep_locked;
80e4342601abfa Christoph Lameter       2006-02-11  1137  
e2be15f6c3eece Mel Gorman              2013-07-03  1138  		/*
894befec4d70b1 Andrey Ryabinin         2018-04-10  1139  		 * The number of dirty pages determines if a node is marked
8cd7c588decf47 Mel Gorman              2021-11-05  1140  		 * reclaim_congested. kswapd will stall and start writing
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1141) 		 * folios if the tail of the LRU is all dirty unqueued folios.
e2be15f6c3eece Mel Gorman              2013-07-03  1142  		 */
e20c41b1091a24 Matthew Wilcox (Oracle  2022-01-17  1143) 		folio_check_dirty_writeback(folio, &dirty, &writeback);
e2be15f6c3eece Mel Gorman              2013-07-03  1144  		if (dirty || writeback)
c79b7b96db8b12 Matthew Wilcox (Oracle  2022-01-17  1145) 			stat->nr_dirty += nr_pages;
e2be15f6c3eece Mel Gorman              2013-07-03  1146  
e2be15f6c3eece Mel Gorman              2013-07-03  1147  		if (dirty && !writeback)
c79b7b96db8b12 Matthew Wilcox (Oracle  2022-01-17  1148) 			stat->nr_unqueued_dirty += nr_pages;
e2be15f6c3eece Mel Gorman              2013-07-03  1149  
d04e8acd03e5c3 Mel Gorman              2013-07-03  1150  		/*
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1151) 		 * Treat this folio as congested if folios are cycling
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1152) 		 * through the LRU so quickly that the folios marked
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1153) 		 * for immediate reclaim are making it to the end of
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1154) 		 * the LRU a second time.
d04e8acd03e5c3 Mel Gorman              2013-07-03  1155  		 */
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1156) 		if (writeback && folio_test_reclaim(folio))
c79b7b96db8b12 Matthew Wilcox (Oracle  2022-01-17  1157) 			stat->nr_congested += nr_pages;
e2be15f6c3eece Mel Gorman              2013-07-03  1158  
e62e384e9da8d9 Michal Hocko            2012-07-31  1159  		/*
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1160) 		 * If a folio at the tail of the LRU is under writeback, there
283aba9f9e0e48 Mel Gorman              2013-07-03  1161  		 * are three cases to consider.
283aba9f9e0e48 Mel Gorman              2013-07-03  1162  		 *
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1163) 		 * 1) If reclaim is encountering an excessive number
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1164) 		 *    of folios under writeback and this folio has both
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1165) 		 *    the writeback and reclaim flags set, then it
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1166) 		 *    indicates that folios are being queued for I/O but
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1167) 		 *    are being recycled through the LRU before the I/O
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1168) 		 *    can complete. Waiting on the folio itself risks an
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1169) 		 *    indefinite stall if it is impossible to writeback
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1170) 		 *    the folio due to I/O error or disconnected storage
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1171) 		 *    so instead note that the LRU is being scanned too
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1172) 		 *    quickly and the caller can stall after the folio
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1173) 		 *    list has been processed.
283aba9f9e0e48 Mel Gorman              2013-07-03  1174  		 *
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1175) 		 * 2) Global or new memcg reclaim encounters a folio that is
ecf5fc6e9654cd Michal Hocko            2015-08-04  1176  		 *    not marked for immediate reclaim, or the caller does not
ecf5fc6e9654cd Michal Hocko            2015-08-04  1177  		 *    have __GFP_FS (or __GFP_IO if it's simply going to swap,
0c4f8ed498cea1 Joanne Koong            2025-04-14  1178  		 *    not to fs), or the folio belongs to a mapping where
0c4f8ed498cea1 Joanne Koong            2025-04-14  1179  		 *    waiting on writeback during reclaim may lead to a deadlock.
0c4f8ed498cea1 Joanne Koong            2025-04-14  1180  		 *    In this case mark the folio for immediate reclaim and
0c4f8ed498cea1 Joanne Koong            2025-04-14  1181  		 *    continue scanning.
283aba9f9e0e48 Mel Gorman              2013-07-03  1182  		 *
d791ea676b6648 NeilBrown               2022-05-09  1183  		 *    Require may_enter_fs() because we would wait on fs, which
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1184) 		 *    may not have submitted I/O yet. And the loop driver might
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1185) 		 *    enter reclaim, and deadlock if it waits on a folio for
283aba9f9e0e48 Mel Gorman              2013-07-03  1186  		 *    which it is needed to do the write (loop masks off
283aba9f9e0e48 Mel Gorman              2013-07-03  1187  		 *    __GFP_IO|__GFP_FS for this reason); but more thought
283aba9f9e0e48 Mel Gorman              2013-07-03  1188  		 *    would probably show more reasons.
283aba9f9e0e48 Mel Gorman              2013-07-03  1189  		 *
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1190) 		 * 3) Legacy memcg encounters a folio that already has the
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1191) 		 *    reclaim flag set. memcg does not have any dirty folio
283aba9f9e0e48 Mel Gorman              2013-07-03  1192  		 *    throttling so we could easily OOM just because too many
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1193) 		 *    folios are in writeback and there is nothing else to
283aba9f9e0e48 Mel Gorman              2013-07-03  1194  		 *    reclaim. Wait for the writeback to complete.
c55e8d035b28b2 Johannes Weiner         2017-02-24  1195  		 *
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1196) 		 * In cases 1) and 2) we activate the folios to get them out of
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1197) 		 * the way while we continue scanning for clean folios on the
c55e8d035b28b2 Johannes Weiner         2017-02-24  1198  		 * inactive list and refilling from the active list. The
c55e8d035b28b2 Johannes Weiner         2017-02-24  1199  		 * observation here is that waiting for disk writes is more
c55e8d035b28b2 Johannes Weiner         2017-02-24  1200  		 * expensive than potentially causing reloads down the line.
c55e8d035b28b2 Johannes Weiner         2017-02-24  1201  		 * Since they're marked for immediate reclaim, they won't put
c55e8d035b28b2 Johannes Weiner         2017-02-24  1202  		 * memory pressure on the cache working set any longer than it
c55e8d035b28b2 Johannes Weiner         2017-02-24  1203  		 * takes to write them to disk.
e62e384e9da8d9 Michal Hocko            2012-07-31  1204  		 */
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1205) 		if (folio_test_writeback(folio)) {
0c4f8ed498cea1 Joanne Koong            2025-04-14  1206  			mapping = folio_mapping(folio);
0c4f8ed498cea1 Joanne Koong            2025-04-14  1207  
283aba9f9e0e48 Mel Gorman              2013-07-03  1208  			/* Case 1 above */
283aba9f9e0e48 Mel Gorman              2013-07-03  1209  			if (current_is_kswapd() &&
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1210) 			    folio_test_reclaim(folio) &&
599d0c954f91d0 Mel Gorman              2016-07-28  1211  			    test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
c79b7b96db8b12 Matthew Wilcox (Oracle  2022-01-17  1212) 				stat->nr_immediate += nr_pages;
c55e8d035b28b2 Johannes Weiner         2017-02-24  1213  				goto activate_locked;
283aba9f9e0e48 Mel Gorman              2013-07-03  1214  
283aba9f9e0e48 Mel Gorman              2013-07-03  1215  			/* Case 2 above */
b5ead35e7e1d34 Johannes Weiner         2019-11-30  1216  			} else if (writeback_throttling_sane(sc) ||
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1217) 			    !folio_test_reclaim(folio) ||
0c4f8ed498cea1 Joanne Koong            2025-04-14  1218  			    !may_enter_fs(folio, sc->gfp_mask) ||
0c4f8ed498cea1 Joanne Koong            2025-04-14  1219  			    (mapping &&
0c4f8ed498cea1 Joanne Koong            2025-04-14  1220  			     mapping_writeback_may_deadlock_on_reclaim(mapping))) {
c3b94f44fcb072 Hugh Dickins            2012-07-31  1221  				/*
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1222) 				 * This is slightly racy -
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1223) 				 * folio_end_writeback() might have
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1224) 				 * just cleared the reclaim flag, then
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1225) 				 * setting the reclaim flag here ends up
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1226) 				 * interpreted as the readahead flag - but
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1227) 				 * that does not matter enough to care.
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1228) 				 * What we do want is for this folio to
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1229) 				 * have the reclaim flag set next time
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1230) 				 * memcg reclaim reaches the tests above,
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1231) 				 * so it will then wait for writeback to
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1232) 				 * avoid OOM; and it's also appropriate
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1233) 				 * in global reclaim.
c3b94f44fcb072 Hugh Dickins            2012-07-31  1234  				 */
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1235) 				folio_set_reclaim(folio);
c79b7b96db8b12 Matthew Wilcox (Oracle  2022-01-17  1236) 				stat->nr_writeback += nr_pages;
c55e8d035b28b2 Johannes Weiner         2017-02-24  1237  				goto activate_locked;
283aba9f9e0e48 Mel Gorman              2013-07-03  1238  
283aba9f9e0e48 Mel Gorman              2013-07-03  1239  			/* Case 3 above */
283aba9f9e0e48 Mel Gorman              2013-07-03  1240  			} else {
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1241) 				folio_unlock(folio);
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1242) 				folio_wait_writeback(folio);
d33e4e1412c8b6 Matthew Wilcox (Oracle  2022-05-12  1243) 				/* then go back and try same folio again */
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1244) 				list_add_tail(&folio->lru, folio_list);
7fadc820222497 Hugh Dickins            2015-09-08  1245  				continue;
e62e384e9da8d9 Michal Hocko            2012-07-31  1246  			}
283aba9f9e0e48 Mel Gorman              2013-07-03  1247  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1248  
8940b34a4e082a Minchan Kim             2019-09-25  1249  		if (!ignore_references)
d92013d1e5e47f Matthew Wilcox (Oracle  2022-02-15  1250) 			references = folio_check_references(folio, sc);
02c6de8d757cb3 Minchan Kim             2012-10-08  1251  
dfc8d636cdb95f Johannes Weiner         2010-03-05  1252  		switch (references) {
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1253) 		case FOLIOREF_ACTIVATE:
^1da177e4c3f41 Linus Torvalds          2005-04-16  1254  			goto activate_locked;
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1255) 		case FOLIOREF_KEEP:
98879b3b9edc16 Yang Shi                2019-07-11  1256  			stat->nr_ref_keep += nr_pages;
645747462435d8 Johannes Weiner         2010-03-05  1257  			goto keep_locked;
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1258) 		case FOLIOREF_RECLAIM:
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1259) 		case FOLIOREF_RECLAIM_CLEAN:
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1260) 			; /* try to reclaim the folio below */
dfc8d636cdb95f Johannes Weiner         2010-03-05  1261  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1262  
26aa2d199d6f2c Dave Hansen             2021-09-02  1263  		/*
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1264) 		 * Before reclaiming the folio, try to relocate
26aa2d199d6f2c Dave Hansen             2021-09-02  1265  		 * its contents to another node.
26aa2d199d6f2c Dave Hansen             2021-09-02  1266  		 */
26aa2d199d6f2c Dave Hansen             2021-09-02  1267  		if (do_demote_pass &&
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1268) 		    (thp_migration_supported() || !folio_test_large(folio))) {
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1269) 			list_add(&folio->lru, &demote_folios);
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1270) 			folio_unlock(folio);
26aa2d199d6f2c Dave Hansen             2021-09-02  1271  			continue;
26aa2d199d6f2c Dave Hansen             2021-09-02  1272  		}
26aa2d199d6f2c Dave Hansen             2021-09-02  1273  
^1da177e4c3f41 Linus Torvalds          2005-04-16  1274  		/*
^1da177e4c3f41 Linus Torvalds          2005-04-16  1275  		 * Anonymous process memory has backing store?
^1da177e4c3f41 Linus Torvalds          2005-04-16  1276  		 * Try to allocate it some swap space here.
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1277) 		 * Lazyfree folio could be freed directly
^1da177e4c3f41 Linus Torvalds          2005-04-16  1278  		 */
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1279) 		if (folio_test_anon(folio) && folio_test_swapbacked(folio)) {
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1280) 			if (!folio_test_swapcache(folio)) {
63eb6b93ce725e Hugh Dickins            2008-11-19  1281  				if (!(sc->gfp_mask & __GFP_IO))
63eb6b93ce725e Hugh Dickins            2008-11-19  1282  					goto keep_locked;
d4b4084ac3154c Matthew Wilcox (Oracle  2022-02-04  1283) 				if (folio_maybe_dma_pinned(folio))
feb889fb40fafc Linus Torvalds          2021-01-16  1284  					goto keep_locked;
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1285) 				if (folio_test_large(folio)) {
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1286) 					/* cannot split folio, skip it */
8710f6ed34e7bc David Hildenbrand       2024-08-02  1287  					if (!can_split_folio(folio, 1, NULL))
b8f593cd0896b8 Ying Huang              2017-07-06  1288  						goto activate_locked;
747552b1e71b40 Ying Huang              2017-07-06  1289  					/*
5ed890ce514785 Ryan Roberts            2024-04-08  1290  					 * Split partially mapped folios right away.
5ed890ce514785 Ryan Roberts            2024-04-08  1291  					 * We can free the unmapped pages without IO.
747552b1e71b40 Ying Huang              2017-07-06  1292  					 */
8422acdc97ed58 Usama Arif              2024-08-30  1293  					if (data_race(!list_empty(&folio->_deferred_list) &&
8422acdc97ed58 Usama Arif              2024-08-30  1294  					    folio_test_partially_mapped(folio)) &&
5ed890ce514785 Ryan Roberts            2024-04-08  1295  					    split_folio_to_list(folio, folio_list))
747552b1e71b40 Ying Huang              2017-07-06  1296  						goto activate_locked;
747552b1e71b40 Ying Huang              2017-07-06  1297  				}
7d14492199f93c Kairui Song             2025-10-24 @1298  				if (folio_alloc_swap(folio)) {
d0f048ac39f6a7 Barry Song              2024-04-12  1299  					int __maybe_unused order = folio_order(folio);
d0f048ac39f6a7 Barry Song              2024-04-12  1300  
09c02e56327bda Matthew Wilcox (Oracle  2022-05-12  1301) 					if (!folio_test_large(folio))
98879b3b9edc16 Yang Shi                2019-07-11  1302  						goto activate_locked_split;
bd4c82c22c367e Ying Huang              2017-09-06  1303  					/* Fallback to swap normal pages */
5ed890ce514785 Ryan Roberts            2024-04-08  1304  					if (split_folio_to_list(folio, folio_list))
0f0746589e4be0 Minchan Kim             2017-07-06  1305  						goto activate_locked;
fe490cc0fe9e6e Ying Huang              2017-09-06  1306  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
5ed890ce514785 Ryan Roberts            2024-04-08  1307  					if (nr_pages >= HPAGE_PMD_NR) {
5ed890ce514785 Ryan Roberts            2024-04-08  1308  						count_memcg_folio_events(folio,
5ed890ce514785 Ryan Roberts            2024-04-08  1309  							THP_SWPOUT_FALLBACK, 1);
fe490cc0fe9e6e Ying Huang              2017-09-06  1310  						count_vm_event(THP_SWPOUT_FALLBACK);
5ed890ce514785 Ryan Roberts            2024-04-08  1311  					}
fe490cc0fe9e6e Ying Huang              2017-09-06  1312  #endif
e26060d1fbd31a Kanchana P Sridhar      2024-10-02  1313  					count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
7d14492199f93c Kairui Song             2025-10-24  1314  					if (folio_alloc_swap(folio))
98879b3b9edc16 Yang Shi                2019-07-11  1315  						goto activate_locked_split;
0f0746589e4be0 Minchan Kim             2017-07-06  1316  				}
b487a2da3575b6 Kairui Song             2025-03-14  1317  				/*
b487a2da3575b6 Kairui Song             2025-03-14  1318  				 * Normally the folio will be dirtied in unmap because its
b487a2da3575b6 Kairui Song             2025-03-14  1319  				 * pte should be dirty. A special case is MADV_FREE page. The
b487a2da3575b6 Kairui Song             2025-03-14  1320  				 * page's pte could have dirty bit cleared but the folio's
b487a2da3575b6 Kairui Song             2025-03-14  1321  				 * SwapBacked flag is still set because clearing the dirty bit
b487a2da3575b6 Kairui Song             2025-03-14  1322  				 * and SwapBacked flag has no lock protected. For such folio,
b487a2da3575b6 Kairui Song             2025-03-14  1323  				 * unmap will not set dirty bit for it, so folio reclaim will
b487a2da3575b6 Kairui Song             2025-03-14  1324  				 * not write the folio out. This can cause data corruption when
b487a2da3575b6 Kairui Song             2025-03-14  1325  				 * the folio is swapped in later. Always setting the dirty flag
b487a2da3575b6 Kairui Song             2025-03-14  1326  				 * for the folio solves the problem.
b487a2da3575b6 Kairui Song             2025-03-14  1327  				 */
b487a2da3575b6 Kairui Song             2025-03-14  1328  				folio_mark_dirty(folio);
bd4c82c22c367e Ying Huang              2017-09-06  1329  			}
e2be15f6c3eece Mel Gorman              2013-07-03  1330  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1331  
98879b3b9edc16 Yang Shi                2019-07-11  1332  		/*
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1333) 		 * If the folio was split above, the tail pages will make
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1334) 		 * their own pass through this function and be accounted
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1335) 		 * then.
98879b3b9edc16 Yang Shi                2019-07-11  1336  		 */
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1337) 		if ((nr_pages > 1) && !folio_test_large(folio)) {
98879b3b9edc16 Yang Shi                2019-07-11  1338  			sc->nr_scanned -= (nr_pages - 1);
98879b3b9edc16 Yang Shi                2019-07-11  1339  			nr_pages = 1;
98879b3b9edc16 Yang Shi                2019-07-11  1340  		}
98879b3b9edc16 Yang Shi                2019-07-11  1341  
^1da177e4c3f41 Linus Torvalds          2005-04-16  1342  		/*
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1343) 		 * The folio is mapped into the page tables of one or more
^1da177e4c3f41 Linus Torvalds          2005-04-16  1344  		 * processes. Try to unmap it here.
^1da177e4c3f41 Linus Torvalds          2005-04-16  1345  		 */
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1346) 		if (folio_mapped(folio)) {
013339df116c2e Shakeel Butt            2020-12-14  1347  			enum ttu_flags flags = TTU_BATCH_FLUSH;
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1348) 			bool was_swapbacked = folio_test_swapbacked(folio);
bd4c82c22c367e Ying Huang              2017-09-06  1349  
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1350) 			if (folio_test_pmd_mappable(folio))
bd4c82c22c367e Ying Huang              2017-09-06  1351  				flags |= TTU_SPLIT_HUGE_PMD;
73bc32875ee9b1 Barry Song              2024-03-06  1352  			/*
73bc32875ee9b1 Barry Song              2024-03-06  1353  			 * Without TTU_SYNC, try_to_unmap will only begin to
73bc32875ee9b1 Barry Song              2024-03-06  1354  			 * hold PTL from the first present PTE within a large
73bc32875ee9b1 Barry Song              2024-03-06  1355  			 * folio. Some initial PTEs might be skipped due to
73bc32875ee9b1 Barry Song              2024-03-06  1356  			 * races with parallel PTE writes in which PTEs can be
73bc32875ee9b1 Barry Song              2024-03-06  1357  			 * cleared temporarily before being written new present
73bc32875ee9b1 Barry Song              2024-03-06  1358  			 * values. This will lead to a large folio is still
73bc32875ee9b1 Barry Song              2024-03-06  1359  			 * mapped while some subpages have been partially
73bc32875ee9b1 Barry Song              2024-03-06  1360  			 * unmapped after try_to_unmap; TTU_SYNC helps
73bc32875ee9b1 Barry Song              2024-03-06  1361  			 * try_to_unmap acquire PTL from the first PTE,
73bc32875ee9b1 Barry Song              2024-03-06  1362  			 * eliminating the influence of temporary PTE values.
73bc32875ee9b1 Barry Song              2024-03-06  1363  			 */
e5a119c4a6835a Barry Song              2024-06-30  1364  			if (folio_test_large(folio))
73bc32875ee9b1 Barry Song              2024-03-06  1365  				flags |= TTU_SYNC;
1f318a9b0dc399 Jaewon Kim              2020-06-03  1366  
869f7ee6f64773 Matthew Wilcox (Oracle  2022-02-15  1367) 			try_to_unmap(folio, flags);
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1368) 			if (folio_mapped(folio)) {
98879b3b9edc16 Yang Shi                2019-07-11  1369  				stat->nr_unmap_fail += nr_pages;
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1370) 				if (!was_swapbacked &&
1bee2c1677bcb5 Matthew Wilcox (Oracle  2022-05-12  1371) 				    folio_test_swapbacked(folio))
1f318a9b0dc399 Jaewon Kim              2020-06-03  1372  					stat->nr_lazyfree_fail += nr_pages;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1373  				goto activate_locked;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1374  			}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1375  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1376  
d824ec2a154677 Jan Kara                2023-04-28  1377  		/*
d824ec2a154677 Jan Kara                2023-04-28  1378  		 * Folio is unmapped now so it cannot be newly pinned anymore.
d824ec2a154677 Jan Kara                2023-04-28  1379  		 * No point in trying to reclaim folio if it is pinned.
d824ec2a154677 Jan Kara                2023-04-28  1380  		 * Furthermore we don't want to reclaim underlying fs metadata
d824ec2a154677 Jan Kara                2023-04-28  1381  		 * if the folio is pinned and thus potentially modified by the
d824ec2a154677 Jan Kara                2023-04-28  1382  		 * pinning process as that may upset the filesystem.
d824ec2a154677 Jan Kara                2023-04-28  1383  		 */
d824ec2a154677 Jan Kara                2023-04-28  1384  		if (folio_maybe_dma_pinned(folio))
d824ec2a154677 Jan Kara                2023-04-28  1385  			goto activate_locked;
d824ec2a154677 Jan Kara                2023-04-28  1386  
5441d4902f9692 Matthew Wilcox (Oracle  2022-05-12  1387) 		mapping = folio_mapping(folio);
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1388) 		if (folio_test_dirty(folio)) {
e2a80749555d73 Baolin Wang             2025-10-17  1389  			if (folio_is_file_lru(folio)) {
49ea7eb65e7c50 Mel Gorman              2011-10-31  1390  				/*
49ea7eb65e7c50 Mel Gorman              2011-10-31  1391  				 * Immediately reclaim when written back.
5a9e34747c9f73 Vishal Moola (Oracle    2022-12-21  1392) 				 * Similar in principle to folio_deactivate()
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1393) 				 * except we already have the folio isolated
49ea7eb65e7c50 Mel Gorman              2011-10-31  1394  				 * and know it's dirty
49ea7eb65e7c50 Mel Gorman              2011-10-31  1395  				 */
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1396) 				node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1397) 						nr_pages);
e2a80749555d73 Baolin Wang             2025-10-17  1398  				if (!folio_test_reclaim(folio))
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1399) 					folio_set_reclaim(folio);
49ea7eb65e7c50 Mel Gorman              2011-10-31  1400  
c55e8d035b28b2 Johannes Weiner         2017-02-24  1401  				goto activate_locked;
ee72886d8ed5d9 Mel Gorman              2011-10-31  1402  			}
ee72886d8ed5d9 Mel Gorman              2011-10-31  1403  
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1404) 			if (references == FOLIOREF_RECLAIM_CLEAN)
^1da177e4c3f41 Linus Torvalds          2005-04-16  1405  				goto keep_locked;
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1406) 			if (!may_enter_fs(folio, sc->gfp_mask))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1407  				goto keep_locked;
52a8363eae3872 Christoph Lameter       2006-02-01  1408  			if (!sc->may_writepage)
^1da177e4c3f41 Linus Torvalds          2005-04-16  1409  				goto keep_locked;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1410  
d950c9477d51f0 Mel Gorman              2015-09-04  1411  			/*
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1412) 			 * Folio is dirty. Flush the TLB if a writable entry
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1413) 			 * potentially exists to avoid CPU writes after I/O
d950c9477d51f0 Mel Gorman              2015-09-04  1414  			 * starts and then write it out here.
d950c9477d51f0 Mel Gorman              2015-09-04  1415  			 */
d950c9477d51f0 Mel Gorman              2015-09-04  1416  			try_to_unmap_flush_dirty();
809bc86517cc40 Baolin Wang             2024-08-12  1417  			switch (pageout(folio, mapping, &plug, folio_list)) {
^1da177e4c3f41 Linus Torvalds          2005-04-16  1418  			case PAGE_KEEP:
^1da177e4c3f41 Linus Torvalds          2005-04-16  1419  				goto keep_locked;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1420  			case PAGE_ACTIVATE:
809bc86517cc40 Baolin Wang             2024-08-12  1421  				/*
809bc86517cc40 Baolin Wang             2024-08-12  1422  				 * If shmem folio is split when writeback to swap,
809bc86517cc40 Baolin Wang             2024-08-12  1423  				 * the tail pages will make their own pass through
809bc86517cc40 Baolin Wang             2024-08-12  1424  				 * this function and be accounted then.
809bc86517cc40 Baolin Wang             2024-08-12  1425  				 */
809bc86517cc40 Baolin Wang             2024-08-12  1426  				if (nr_pages > 1 && !folio_test_large(folio)) {
809bc86517cc40 Baolin Wang             2024-08-12  1427  					sc->nr_scanned -= (nr_pages - 1);
809bc86517cc40 Baolin Wang             2024-08-12  1428  					nr_pages = 1;
809bc86517cc40 Baolin Wang             2024-08-12  1429  				}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1430  				goto activate_locked;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1431  			case PAGE_SUCCESS:
809bc86517cc40 Baolin Wang             2024-08-12  1432  				if (nr_pages > 1 && !folio_test_large(folio)) {
809bc86517cc40 Baolin Wang             2024-08-12  1433  					sc->nr_scanned -= (nr_pages - 1);
809bc86517cc40 Baolin Wang             2024-08-12  1434  					nr_pages = 1;
809bc86517cc40 Baolin Wang             2024-08-12  1435  				}
c79b7b96db8b12 Matthew Wilcox (Oracle  2022-01-17  1436) 				stat->nr_pageout += nr_pages;
96f8bf4fb1dd26 Johannes Weiner         2020-06-03  1437  
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1438) 				if (folio_test_writeback(folio))
41ac1999c3e356 Mel Gorman              2012-05-29  1439  					goto keep;
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1440) 				if (folio_test_dirty(folio))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1441  					goto keep;
7d3579e8e61937 KOSAKI Motohiro         2010-10-26  1442  
^1da177e4c3f41 Linus Torvalds          2005-04-16  1443  				/*
^1da177e4c3f41 Linus Torvalds          2005-04-16  1444  				 * A synchronous write - probably a ramdisk.  Go
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1445) 				 * ahead and try to reclaim the folio.
^1da177e4c3f41 Linus Torvalds          2005-04-16  1446  				 */
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1447) 				if (!folio_trylock(folio))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1448  					goto keep;
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1449) 				if (folio_test_dirty(folio) ||
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1450) 				    folio_test_writeback(folio))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1451  					goto keep_locked;
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1452) 				mapping = folio_mapping(folio);
01359eb2013b4b Gustavo A. R. Silva     2020-12-14  1453  				fallthrough;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1454  			case PAGE_CLEAN:
49bd2bf9679f4a Matthew Wilcox (Oracle  2022-05-12  1455) 				; /* try to free the folio below */
^1da177e4c3f41 Linus Torvalds          2005-04-16  1456  			}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1457  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1458  
^1da177e4c3f41 Linus Torvalds          2005-04-16  1459  		/*
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1460) 		 * If the folio has buffers, try to free the buffer
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1461) 		 * mappings associated with this folio. If we succeed
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1462) 		 * we try to free the folio as well.
^1da177e4c3f41 Linus Torvalds          2005-04-16  1463  		 *
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1464) 		 * We do this even if the folio is dirty.
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1465) 		 * filemap_release_folio() does not perform I/O, but it
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1466) 		 * is possible for a folio to have the dirty flag set,
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1467) 		 * but it is actually clean (all its buffers are clean).
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1468) 		 * This happens if the buffers were written out directly,
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1469) 		 * with submit_bh(). ext3 will do this, as well as
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1470) 		 * the blockdev mapping.  filemap_release_folio() will
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1471) 		 * discover that cleanness and will drop the buffers
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1472) 		 * and mark the folio clean - it can be freed.
^1da177e4c3f41 Linus Torvalds          2005-04-16  1473  		 *
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1474) 		 * Rarely, folios can have buffers and no ->mapping.
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1475) 		 * These are the folios which were not successfully
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1476) 		 * invalidated in truncate_cleanup_folio().  We try to
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1477) 		 * drop those buffers here and if that worked, and the
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1478) 		 * folio is no longer mapped into process address space
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1479) 		 * (refcount == 1) it can be freed.  Otherwise, leave
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1480) 		 * the folio on the LRU so it is swappable.
^1da177e4c3f41 Linus Torvalds          2005-04-16  1481  		 */
0201ebf274a306 David Howells           2023-06-28  1482  		if (folio_needs_release(folio)) {
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1483) 			if (!filemap_release_folio(folio, sc->gfp_mask))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1484  				goto activate_locked;
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1485) 			if (!mapping && folio_ref_count(folio) == 1) {
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1486) 				folio_unlock(folio);
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1487) 				if (folio_put_testzero(folio))
^1da177e4c3f41 Linus Torvalds          2005-04-16  1488  					goto free_it;
e286781d5f2e9c Nicholas Piggin         2008-07-25  1489  				else {
e286781d5f2e9c Nicholas Piggin         2008-07-25  1490  					/*
e286781d5f2e9c Nicholas Piggin         2008-07-25  1491  					 * rare race with speculative reference.
e286781d5f2e9c Nicholas Piggin         2008-07-25  1492  					 * the speculative reference will free
0a36111c8c20b2 Matthew Wilcox (Oracle  2022-05-12  1493) 					 * this folio shortly, so we may
e286781d5f2e9c Nicholas Piggin         2008-07-25  1494  					 * increment nr_reclaimed here (and
e286781d5f2e9c Nicholas Piggin         2008-07-25  1495  					 * leave it off the LRU).
e286781d5f2e9c Nicholas Piggin         2008-07-25  1496  					 */
9aafcffc18785f Miaohe Lin              2022-05-12  1497  					nr_reclaimed += nr_pages;
e286781d5f2e9c Nicholas Piggin         2008-07-25  1498  					continue;
e286781d5f2e9c Nicholas Piggin         2008-07-25  1499  				}
e286781d5f2e9c Nicholas Piggin         2008-07-25  1500  			}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1501  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1502  
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1503) 		if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
802a3a92ad7ac0 Shaohua Li              2017-05-03  1504  			/* follow __remove_mapping for reference */
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1505) 			if (!folio_ref_freeze(folio, 1))
49d2e9cc454436 Christoph Lameter       2006-01-08  1506  				goto keep_locked;
d17be2d9ff6c68 Miaohe Lin              2021-09-02  1507  			/*
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1508) 			 * The folio has only one reference left, which is
d17be2d9ff6c68 Miaohe Lin              2021-09-02  1509  			 * from the isolation. After the caller puts the
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1510) 			 * folio back on the lru and drops the reference, the
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1511) 			 * folio will be freed anyway. It doesn't matter
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1512) 			 * which lru it goes on. So we don't bother checking
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1513) 			 * the dirty flag here.
d17be2d9ff6c68 Miaohe Lin              2021-09-02  1514  			 */
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1515) 			count_vm_events(PGLAZYFREED, nr_pages);
64daa5d818ae34 Matthew Wilcox (Oracle  2022-05-12  1516) 			count_memcg_folio_events(folio, PGLAZYFREED, nr_pages);
be7c07d60e13ac Matthew Wilcox (Oracle  2021-12-23  1517) 		} else if (!mapping || !__remove_mapping(mapping, folio, true,
b910718a948a91 Johannes Weiner         2019-11-30  1518  							 sc->target_mem_cgroup))
802a3a92ad7ac0 Shaohua Li              2017-05-03  1519  			goto keep_locked;
9a1ea439b16b92 Hugh Dickins            2018-12-28  1520  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1521) 		folio_unlock(folio);
e286781d5f2e9c Nicholas Piggin         2008-07-25  1522  free_it:
98879b3b9edc16 Yang Shi                2019-07-11  1523  		/*
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1524) 		 * Folio may get swapped out as a whole, need to account
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1525) 		 * all pages in it.
98879b3b9edc16 Yang Shi                2019-07-11  1526  		 */
98879b3b9edc16 Yang Shi                2019-07-11  1527  		nr_reclaimed += nr_pages;
abe4c3b50c3f25 Mel Gorman              2010-08-09  1528  
f8f931bba0f920 Hugh Dickins            2024-10-27  1529  		folio_unqueue_deferred_split(folio);
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1530) 		if (folio_batch_add(&free_folios, folio) == 0) {
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1531) 			mem_cgroup_uncharge_folios(&free_folios);
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1532) 			try_to_unmap_flush();
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1533) 			free_unref_folios(&free_folios);
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1534) 		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1535  		continue;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1536  
98879b3b9edc16 Yang Shi                2019-07-11  1537  activate_locked_split:
98879b3b9edc16 Yang Shi                2019-07-11  1538  		/*
98879b3b9edc16 Yang Shi                2019-07-11  1539  		 * The tail pages that are failed to add into swap cache
98879b3b9edc16 Yang Shi                2019-07-11  1540  		 * reach here.  Fixup nr_scanned and nr_pages.
98879b3b9edc16 Yang Shi                2019-07-11  1541  		 */
98879b3b9edc16 Yang Shi                2019-07-11  1542  		if (nr_pages > 1) {
98879b3b9edc16 Yang Shi                2019-07-11  1543  			sc->nr_scanned -= (nr_pages - 1);
98879b3b9edc16 Yang Shi                2019-07-11  1544  			nr_pages = 1;
98879b3b9edc16 Yang Shi                2019-07-11  1545  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1546  activate_locked:
68a22394c286a2 Rik van Riel            2008-10-18  1547  		/* Not a candidate for swapping, so reclaim swap space. */
246b648038096c Matthew Wilcox (Oracle  2022-05-12  1548) 		if (folio_test_swapcache(folio) &&
9202d527b715f6 Matthew Wilcox (Oracle  2022-09-02  1549) 		    (mem_cgroup_swap_full(folio) || folio_test_mlocked(folio)))
bdb0ed54a4768d Matthew Wilcox (Oracle  2022-09-02  1550) 			folio_free_swap(folio);
246b648038096c Matthew Wilcox (Oracle  2022-05-12  1551) 		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
246b648038096c Matthew Wilcox (Oracle  2022-05-12  1552) 		if (!folio_test_mlocked(folio)) {
246b648038096c Matthew Wilcox (Oracle  2022-05-12  1553) 			int type = folio_is_file_lru(folio);
246b648038096c Matthew Wilcox (Oracle  2022-05-12  1554) 			folio_set_active(folio);
98879b3b9edc16 Yang Shi                2019-07-11  1555  			stat->nr_activate[type] += nr_pages;
246b648038096c Matthew Wilcox (Oracle  2022-05-12  1556) 			count_memcg_folio_events(folio, PGACTIVATE, nr_pages);
ad6b67041a4549 Minchan Kim             2017-05-03  1557  		}
^1da177e4c3f41 Linus Torvalds          2005-04-16  1558  keep_locked:
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1559) 		folio_unlock(folio);
^1da177e4c3f41 Linus Torvalds          2005-04-16  1560  keep:
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1561) 		list_add(&folio->lru, &ret_folios);
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1562) 		VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1563) 				folio_test_unevictable(folio), folio);
^1da177e4c3f41 Linus Torvalds          2005-04-16  1564  	}
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1565) 	/* 'folio_list' is always empty here */
26aa2d199d6f2c Dave Hansen             2021-09-02  1566  
c28a0e9695b724 Matthew Wilcox (Oracle  2022-05-12  1567) 	/* Migrate folios selected for demotion */
a479b078fddb0a Li Zhijian              2025-01-10  1568  	nr_demoted = demote_folio_list(&demote_folios, pgdat);
a479b078fddb0a Li Zhijian              2025-01-10  1569  	nr_reclaimed += nr_demoted;
a479b078fddb0a Li Zhijian              2025-01-10  1570  	stat->nr_demoted += nr_demoted;
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1571) 	/* Folios that could not be demoted are still in @demote_folios */
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1572) 	if (!list_empty(&demote_folios)) {
6b426d071419a4 Mina Almasry            2022-12-01  1573  		/* Folios which weren't demoted go back on @folio_list */
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1574) 		list_splice_init(&demote_folios, folio_list);
6b426d071419a4 Mina Almasry            2022-12-01  1575  
6b426d071419a4 Mina Almasry            2022-12-01  1576  		/*
6b426d071419a4 Mina Almasry            2022-12-01  1577  		 * goto retry to reclaim the undemoted folios in folio_list if
6b426d071419a4 Mina Almasry            2022-12-01  1578  		 * desired.
6b426d071419a4 Mina Almasry            2022-12-01  1579  		 *
6b426d071419a4 Mina Almasry            2022-12-01  1580  		 * Reclaiming directly from top tier nodes is not often desired
6b426d071419a4 Mina Almasry            2022-12-01  1581  		 * due to it breaking the LRU ordering: in general memory
6b426d071419a4 Mina Almasry            2022-12-01  1582  		 * should be reclaimed from lower tier nodes and demoted from
6b426d071419a4 Mina Almasry            2022-12-01  1583  		 * top tier nodes.
6b426d071419a4 Mina Almasry            2022-12-01  1584  		 *
6b426d071419a4 Mina Almasry            2022-12-01  1585  		 * However, disabling reclaim from top tier nodes entirely
6b426d071419a4 Mina Almasry            2022-12-01  1586  		 * would cause ooms in edge scenarios where lower tier memory
6b426d071419a4 Mina Almasry            2022-12-01  1587  		 * is unreclaimable for whatever reason, eg memory being
6b426d071419a4 Mina Almasry            2022-12-01  1588  		 * mlocked or too hot to reclaim. We can disable reclaim
6b426d071419a4 Mina Almasry            2022-12-01  1589  		 * from top tier nodes in proactive reclaim though as that is
6b426d071419a4 Mina Almasry            2022-12-01  1590  		 * not real memory pressure.
6b426d071419a4 Mina Almasry            2022-12-01  1591  		 */
6b426d071419a4 Mina Almasry            2022-12-01  1592  		if (!sc->proactive) {
26aa2d199d6f2c Dave Hansen             2021-09-02  1593  			do_demote_pass = false;
26aa2d199d6f2c Dave Hansen             2021-09-02  1594  			goto retry;
26aa2d199d6f2c Dave Hansen             2021-09-02  1595  		}
6b426d071419a4 Mina Almasry            2022-12-01  1596  	}
abe4c3b50c3f25 Mel Gorman              2010-08-09  1597  
98879b3b9edc16 Yang Shi                2019-07-11  1598  	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
98879b3b9edc16 Yang Shi                2019-07-11  1599  
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1600) 	mem_cgroup_uncharge_folios(&free_folios);
72b252aed506b8 Mel Gorman              2015-09-04  1601  	try_to_unmap_flush();
bc2ff4cbc3294c Matthew Wilcox (Oracle  2024-02-27  1602) 	free_unref_folios(&free_folios);
abe4c3b50c3f25 Mel Gorman              2010-08-09  1603  
49fd9b6df54e61 Matthew Wilcox (Oracle  2022-09-02  1604) 	list_splice(&ret_folios, folio_list);
886cf1901db962 Kirill Tkhai            2019-05-13  1605  	count_vm_events(PGACTIVATE, pgactivate);
060f005f074791 Kirill Tkhai            2019-03-05  1606  
2282679fb20bf0 NeilBrown               2022-05-09  1607  	if (plug)
2282679fb20bf0 NeilBrown               2022-05-09  1608  		swap_write_unplug(plug);
05ff51376f01fd Andrew Morton           2006-03-22  1609  	return nr_reclaimed;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1610  }
^1da177e4c3f41 Linus Torvalds          2005-04-16  1611  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-10-29 19:25   ` kernel test robot
@ 2025-10-30  5:25     ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-30  5:25 UTC (permalink / raw)
  To: kernel test robot
  Cc: linux-mm, oe-kbuild-all, Andrew Morton, Baoquan He, Barry Song,
	Chris Li, Nhat Pham, Johannes Weiner, Yosry Ahmed,
	David Hildenbrand, Youngjun Park, Hugh Dickins, Baolin Wang,
	Huang, Ying, Kemeng Shi, Lorenzo Stoakes, Matthew Wilcox (Oracle),
	linux-kernel

On Thu, Oct 30, 2025 at 3:30 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Kairui,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on f30d294530d939fa4b77d61bc60f25c4284841fa]
>
> url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
> base:   f30d294530d939fa4b77d61bc60f25c4284841fa
> patch link:    https://lore.kernel.org/r/20251029-swap-table-p2-v1-14-3d43f3b6ec32%40tencent.com
> patch subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
> config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/config)
> compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202510300316.UL4gxAlC-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
>    In file included from mm/vmscan.c:70:
>    mm/swap.h: In function 'swap_cache_add_folio':
>    mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
>      465 | }
>          | ^
>    mm/vmscan.c: In function 'shrink_folio_list':
> >> mm/vmscan.c:1298:37: error: too few arguments to function 'folio_alloc_swap'
>     1298 |                                 if (folio_alloc_swap(folio)) {
>          |                                     ^~~~~~~~~~~~~~~~
>    mm/swap.h:388:19: note: declared here
>      388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>          |                   ^~~~~~~~~~~~~~~~
>    mm/vmscan.c:1314:45: error: too few arguments to function 'folio_alloc_swap'
>     1314 |                                         if (folio_alloc_swap(folio))
>          |                                             ^~~~~~~~~~~~~~~~
>    mm/swap.h:388:19: note: declared here
>      388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>          |                   ^~~~~~~~~~~~~~~~
> --
>    In file included from mm/shmem.c:44:
>    mm/swap.h: In function 'swap_cache_add_folio':
>    mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
>      465 | }
>          | ^
>    mm/shmem.c: In function 'shmem_writeout':
> >> mm/shmem.c:1649:14: error: too few arguments to function 'folio_alloc_swap'
>     1649 |         if (!folio_alloc_swap(folio)) {
>          |              ^~~~~~~~~~~~~~~~
>    mm/swap.h:388:19: note: declared here
>      388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>          |                   ^~~~~~~~~~~~~~~~
>

Thanks, I forgot to update the empty place holder for folio_alloc_swap
during rebase:

diff --git a/mm/swap.h b/mm/swap.h
index 74c61129d7b7..9aa99061573a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -385,7 +385,7 @@ static inline struct swap_info_struct
*__swap_entry_to_info(swp_entry_t entry)
        return NULL;
 }

-static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+static inline int folio_alloc_swap(struct folio *folio)
 {
        return -EINVAL;
 }

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
  2025-10-29 19:25   ` kernel test robot
@ 2025-10-29 19:25   ` kernel test robot
  2025-11-01  4:51   ` YoungJun Park
  2 siblings, 0 replies; 50+ messages in thread
From: kernel test robot @ 2025-10-29 19:25 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
	Baoquan He, Barry Song, Chris Li, Nhat Pham, Johannes Weiner,
	Yosry Ahmed, David Hildenbrand, Youngjun Park, Hugh Dickins,
	Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
	Matthew Wilcox (Oracle), linux-kernel, Kairui Song

Hi Kairui,

kernel test robot noticed the following build errors:

[auto build test ERROR on f30d294530d939fa4b77d61bc60f25c4284841fa]

url:    https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
base:   f30d294530d939fa4b77d61bc60f25c4284841fa
patch link:    https://lore.kernel.org/r/20251029-swap-table-p2-v1-14-3d43f3b6ec32%40tencent.com
patch subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300341.cOYqY4ki-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300341.cOYqY4ki-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510300341.cOYqY4ki-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from mm/shmem.c:44:
   mm/swap.h:465:1: warning: non-void function does not return a value [-Wreturn-type]
     465 | }
         | ^
>> mm/shmem.c:1649:29: error: too few arguments to function call, expected 2, have 1
    1649 |         if (!folio_alloc_swap(folio)) {
         |              ~~~~~~~~~~~~~~~~      ^
   mm/swap.h:388:19: note: 'folio_alloc_swap' declared here
     388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
         |                   ^                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 warning and 1 error generated.


vim +1649 mm/shmem.c

^1da177e4c3f41 Linus Torvalds          2005-04-16  1563  
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1564) /**
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1565)  * shmem_writeout - Write the folio to swap
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1566)  * @folio: The folio to write
44b1b073eb3614 Christoph Hellwig       2025-06-10  1567   * @plug: swap plug
44b1b073eb3614 Christoph Hellwig       2025-06-10  1568   * @folio_list: list to put back folios on split
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1569)  *
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1570)  * Move the folio from the page cache to the swap cache.
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1571)  */
44b1b073eb3614 Christoph Hellwig       2025-06-10  1572  int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
44b1b073eb3614 Christoph Hellwig       2025-06-10  1573  		struct list_head *folio_list)
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1574) {
8ccee8c19c605a Luis Chamberlain        2023-03-09  1575  	struct address_space *mapping = folio->mapping;
8ccee8c19c605a Luis Chamberlain        2023-03-09  1576  	struct inode *inode = mapping->host;
8ccee8c19c605a Luis Chamberlain        2023-03-09  1577  	struct shmem_inode_info *info = SHMEM_I(inode);
2c6efe9cf2d784 Luis Chamberlain        2023-03-09  1578  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
6922c0c7abd387 Hugh Dickins            2011-08-03  1579  	pgoff_t index;
650180760be6bb Baolin Wang             2024-08-12  1580  	int nr_pages;
809bc86517cc40 Baolin Wang             2024-08-12  1581  	bool split = false;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1582  
adae46ac1e38a2 Ricardo Cañuelo Navarro 2025-02-26  1583  	if ((info->flags & VM_LOCKED) || sbinfo->noswap)
9a976f0c847b67 Luis Chamberlain        2023-03-09  1584  		goto redirty;
9a976f0c847b67 Luis Chamberlain        2023-03-09  1585  
9a976f0c847b67 Luis Chamberlain        2023-03-09  1586  	if (!total_swap_pages)
9a976f0c847b67 Luis Chamberlain        2023-03-09  1587  		goto redirty;
9a976f0c847b67 Luis Chamberlain        2023-03-09  1588  
1e6decf30af5c5 Hugh Dickins            2021-09-02  1589  	/*
809bc86517cc40 Baolin Wang             2024-08-12  1590  	 * If CONFIG_THP_SWAP is not enabled, the large folio should be
809bc86517cc40 Baolin Wang             2024-08-12  1591  	 * split when swapping.
809bc86517cc40 Baolin Wang             2024-08-12  1592  	 *
809bc86517cc40 Baolin Wang             2024-08-12  1593  	 * And shrinkage of pages beyond i_size does not split swap, so
809bc86517cc40 Baolin Wang             2024-08-12  1594  	 * swapout of a large folio crossing i_size needs to split too
809bc86517cc40 Baolin Wang             2024-08-12  1595  	 * (unless fallocate has been used to preallocate beyond EOF).
1e6decf30af5c5 Hugh Dickins            2021-09-02  1596  	 */
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1597) 	if (folio_test_large(folio)) {
809bc86517cc40 Baolin Wang             2024-08-12  1598  		index = shmem_fallocend(inode,
809bc86517cc40 Baolin Wang             2024-08-12  1599  			DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
809bc86517cc40 Baolin Wang             2024-08-12  1600  		if ((index > folio->index && index < folio_next_index(folio)) ||
809bc86517cc40 Baolin Wang             2024-08-12  1601  		    !IS_ENABLED(CONFIG_THP_SWAP))
809bc86517cc40 Baolin Wang             2024-08-12  1602  			split = true;
809bc86517cc40 Baolin Wang             2024-08-12  1603  	}
809bc86517cc40 Baolin Wang             2024-08-12  1604  
809bc86517cc40 Baolin Wang             2024-08-12  1605  	if (split) {
809bc86517cc40 Baolin Wang             2024-08-12  1606  try_split:
1e6decf30af5c5 Hugh Dickins            2021-09-02  1607  		/* Ensure the subpages are still dirty */
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1608) 		folio_test_set_dirty(folio);
44b1b073eb3614 Christoph Hellwig       2025-06-10  1609  		if (split_folio_to_list(folio, folio_list))
1e6decf30af5c5 Hugh Dickins            2021-09-02  1610  			goto redirty;
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1611) 		folio_clear_dirty(folio);
1e6decf30af5c5 Hugh Dickins            2021-09-02  1612  	}
1e6decf30af5c5 Hugh Dickins            2021-09-02  1613  
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1614) 	index = folio->index;
650180760be6bb Baolin Wang             2024-08-12  1615  	nr_pages = folio_nr_pages(folio);
1635f6a74152f1 Hugh Dickins            2012-05-29  1616  
1635f6a74152f1 Hugh Dickins            2012-05-29  1617  	/*
1635f6a74152f1 Hugh Dickins            2012-05-29  1618  	 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
1635f6a74152f1 Hugh Dickins            2012-05-29  1619  	 * value into swapfile.c, the only way we can correctly account for a
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1620) 	 * fallocated folio arriving here is now to initialize it and write it.
1aac1400319d30 Hugh Dickins            2012-05-29  1621  	 *
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1622) 	 * That's okay for a folio already fallocated earlier, but if we have
1aac1400319d30 Hugh Dickins            2012-05-29  1623  	 * not yet completed the fallocation, then (a) we want to keep track
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1624) 	 * of this folio in case we have to undo it, and (b) it may not be a
1aac1400319d30 Hugh Dickins            2012-05-29  1625  	 * good idea to continue anyway, once we're pushing into swap.  So
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1626) 	 * reactivate the folio, and let shmem_fallocate() quit when too many.
1635f6a74152f1 Hugh Dickins            2012-05-29  1627  	 */
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1628) 	if (!folio_test_uptodate(folio)) {
1aac1400319d30 Hugh Dickins            2012-05-29  1629  		if (inode->i_private) {
1aac1400319d30 Hugh Dickins            2012-05-29  1630  			struct shmem_falloc *shmem_falloc;
1aac1400319d30 Hugh Dickins            2012-05-29  1631  			spin_lock(&inode->i_lock);
1aac1400319d30 Hugh Dickins            2012-05-29  1632  			shmem_falloc = inode->i_private;
1aac1400319d30 Hugh Dickins            2012-05-29  1633  			if (shmem_falloc &&
8e205f779d1443 Hugh Dickins            2014-07-23  1634  			    !shmem_falloc->waitq &&
1aac1400319d30 Hugh Dickins            2012-05-29  1635  			    index >= shmem_falloc->start &&
1aac1400319d30 Hugh Dickins            2012-05-29  1636  			    index < shmem_falloc->next)
d77b90d2b26426 Baolin Wang             2024-12-19  1637  				shmem_falloc->nr_unswapped += nr_pages;
1aac1400319d30 Hugh Dickins            2012-05-29  1638  			else
1aac1400319d30 Hugh Dickins            2012-05-29  1639  				shmem_falloc = NULL;
1aac1400319d30 Hugh Dickins            2012-05-29  1640  			spin_unlock(&inode->i_lock);
1aac1400319d30 Hugh Dickins            2012-05-29  1641  			if (shmem_falloc)
1aac1400319d30 Hugh Dickins            2012-05-29  1642  				goto redirty;
1aac1400319d30 Hugh Dickins            2012-05-29  1643  		}
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1644) 		folio_zero_range(folio, 0, folio_size(folio));
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1645) 		flush_dcache_folio(folio);
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1646) 		folio_mark_uptodate(folio);
1635f6a74152f1 Hugh Dickins            2012-05-29  1647  	}
1635f6a74152f1 Hugh Dickins            2012-05-29  1648  
7d14492199f93c Kairui Song             2025-10-24 @1649  	if (!folio_alloc_swap(folio)) {
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1650  		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
6344a6d9ce13ae Hugh Dickins            2025-07-16  1651  		int error;
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1652  
b1dea800ac3959 Hugh Dickins            2011-05-11  1653  		/*
b1dea800ac3959 Hugh Dickins            2011-05-11  1654  		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1655) 		 * if it's not already there.  Do it now before the folio is
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1656  		 * removed from page cache, when its pagelock no longer
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1657  		 * protects the inode from eviction.  And do it now, after
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1658  		 * we've incremented swapped, because shmem_unuse() will
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1659  		 * prune a !swapped inode from the swaplist.
b1dea800ac3959 Hugh Dickins            2011-05-11  1660  		 */
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1661  		if (first_swapped) {
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1662  			spin_lock(&shmem_swaplist_lock);
05bf86b4ccfd0f Hugh Dickins            2011-05-14  1663  			if (list_empty(&info->swaplist))
b56a2d8af9147a Vineeth Remanan Pillai  2019-03-05  1664  				list_add(&info->swaplist, &shmem_swaplist);
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1665  			spin_unlock(&shmem_swaplist_lock);
ea693aaa5ce5ad Hugh Dickins            2025-07-16  1666  		}
b1dea800ac3959 Hugh Dickins            2011-05-11  1667  
80d6ed40156385 Kairui Song             2025-10-29  1668  		folio_dup_swap(folio, NULL);
b487a2da3575b6 Kairui Song             2025-03-14  1669  		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
267a4c76bbdb95 Hugh Dickins            2015-12-11  1670  
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1671) 		BUG_ON(folio_mapped(folio));
6344a6d9ce13ae Hugh Dickins            2025-07-16  1672  		error = swap_writeout(folio, plug);
6344a6d9ce13ae Hugh Dickins            2025-07-16  1673  		if (error != AOP_WRITEPAGE_ACTIVATE) {
6344a6d9ce13ae Hugh Dickins            2025-07-16  1674  			/* folio has been unlocked */
6344a6d9ce13ae Hugh Dickins            2025-07-16  1675  			return error;
6344a6d9ce13ae Hugh Dickins            2025-07-16  1676  		}
6344a6d9ce13ae Hugh Dickins            2025-07-16  1677  
6344a6d9ce13ae Hugh Dickins            2025-07-16  1678  		/*
6344a6d9ce13ae Hugh Dickins            2025-07-16  1679  		 * The intention here is to avoid holding on to the swap when
6344a6d9ce13ae Hugh Dickins            2025-07-16  1680  		 * zswap was unable to compress and unable to writeback; but
6344a6d9ce13ae Hugh Dickins            2025-07-16  1681  		 * it will be appropriate if other reactivate cases are added.
6344a6d9ce13ae Hugh Dickins            2025-07-16  1682  		 */
6344a6d9ce13ae Hugh Dickins            2025-07-16  1683  		error = shmem_add_to_page_cache(folio, mapping, index,
6344a6d9ce13ae Hugh Dickins            2025-07-16  1684  				swp_to_radix_entry(folio->swap),
6344a6d9ce13ae Hugh Dickins            2025-07-16  1685  				__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
6344a6d9ce13ae Hugh Dickins            2025-07-16  1686  		/* Swap entry might be erased by racing shmem_free_swap() */
6344a6d9ce13ae Hugh Dickins            2025-07-16  1687  		if (!error) {
6344a6d9ce13ae Hugh Dickins            2025-07-16  1688  			shmem_recalc_inode(inode, 0, -nr_pages);
80d6ed40156385 Kairui Song             2025-10-29  1689  			folio_put_swap(folio, NULL);
6344a6d9ce13ae Hugh Dickins            2025-07-16  1690  		}
6344a6d9ce13ae Hugh Dickins            2025-07-16  1691  
6344a6d9ce13ae Hugh Dickins            2025-07-16  1692  		/*
fd8d4f862f8c27 Kairui Song             2025-09-17  1693  		 * The swap_cache_del_folio() below could be left for
6344a6d9ce13ae Hugh Dickins            2025-07-16  1694  		 * shrink_folio_list()'s folio_free_swap() to dispose of;
6344a6d9ce13ae Hugh Dickins            2025-07-16  1695  		 * but I'm a little nervous about letting this folio out of
6344a6d9ce13ae Hugh Dickins            2025-07-16  1696  		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
6344a6d9ce13ae Hugh Dickins            2025-07-16  1697  		 * e.g. folio_mapping(folio) might give an unexpected answer.
6344a6d9ce13ae Hugh Dickins            2025-07-16  1698  		 */
fd8d4f862f8c27 Kairui Song             2025-09-17  1699  		swap_cache_del_folio(folio);
6344a6d9ce13ae Hugh Dickins            2025-07-16  1700  		goto redirty;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1701  	}
b487a2da3575b6 Kairui Song             2025-03-14  1702  	if (nr_pages > 1)
b487a2da3575b6 Kairui Song             2025-03-14  1703  		goto try_split;
^1da177e4c3f41 Linus Torvalds          2005-04-16  1704  redirty:
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1705) 	folio_mark_dirty(folio);
f530ed0e2d01aa Matthew Wilcox (Oracle  2022-09-02  1706) 	return AOP_WRITEPAGE_ACTIVATE;	/* Return with folio locked */
^1da177e4c3f41 Linus Torvalds          2005-04-16  1707  }
7b73c12c6ebf00 Matthew Wilcox (Oracle  2025-04-02  1708) EXPORT_SYMBOL_GPL(shmem_writeout);
^1da177e4c3f41 Linus Torvalds          2005-04-16  1709  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
  2025-10-29 19:25   ` kernel test robot
  2025-10-29 19:25   ` kernel test robot
@ 2025-11-01  4:51   ` YoungJun Park
  2025-11-01  8:59     ` Kairui Song
  2 siblings, 1 reply; 50+ messages in thread
From: YoungJun Park @ 2025-11-01  4:51 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:58:40PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>

Hello Kairui!

> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
> 
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.
> 
> Swap entry will be mostly allocated and free with a folio bound.
> The folio lock will be useful for resolving many swap ralated races.
> 
> Now swap allocation (except hibernation) always starts with a folio in
> the swap cache, and gets duped/freed protected by the folio lock:
> 
> - folio_alloc_swap() - The only allocation entry point now.
>   Context: The folio must be locked.
>   This allocates one or a set of continuous swap slots for a folio and
>   binds them to the folio by adding the folio to the swap cache. The
>   swap slots' swap count start with zero value.
> 
> - folio_dup_swap() - Increase the swap count of one or more entries.
>   Context: The folio must be locked and in the swap cache. For now, the
>   caller still has to lock the new swap entry owner (e.g., PTL).
>   This increases the ref count of swap entries allocated to a folio.
>   Newly allocated swap slots' count has to be increased by this helper
>   as the folio got unmapped (and swap entries got installed).
> 
> - folio_put_swap() - Decrease the swap count of one or more entries.
>   Context: The folio must be locked and in the swap cache. For now, the
>   caller still has to lock the new swap entry owner (e.g., PTL).
>   This decreases the ref count of swap entries allocated to a folio.
>   Typically, swapin will decrease the swap count as the folio got
>   installed back and the swap entry got uninstalled
> 
>   This won't remove the folio from the swap cache and free the
>   slot. Lazy freeing of swap cache is helpful for reducing IO.
>   There is already a folio_free_swap() for immediate cache reclaim.
>   This part could be further optimized later.
> 
> The above locking constraints could be further relaxed when the swap
> table if fully implemented. Currently dup still needs the caller
> to lock the swap entry container (e.g. PTL), or a concurrent zap
> may underflow the swap count.
> 
> Some swap users need to interact with swap count without involving folio
> (e.g. forking/zapping the page table or mapping truncate without swapin).
> In such cases, the caller has to ensure there is no race condition on
> whatever owns the swap count and use the below helpers:
> 
> - swap_put_entries_direct() - Decrease the swap count directly.
>   Context: The caller must lock whatever is referencing the slots to
>   avoid a race.
> 
>   Typically the page table zapping or shmem mapping truncate will need
>   to free swap slots directly. If a slot is cached (has a folio bound),
>   this will also try to release the swap cache.
> 
> - swap_dup_entry_direct() - Increase the swap count directly.
>   Context: The caller must lock whatever is referencing the entries to
>   avoid race, and the entries must already have a swap count > 1.
> 
>   Typically, forking will need to copy the page table and hence needs to
>   increase the swap count of the entries in the table. The page table is
>   locked while referencing the swap entries, so the entries all have a
>   swap count > 1 and can't be freed.
> 
> Hibernation subsystem is a bit different, so two special wrappers are here:
> 
> - swap_alloc_hibernation_slot() - Allocate one entry from one device.
> - swap_free_hibernation_slot() - Free one entry allocated by the above
> helper.

During the code review, I found something to be verified.
It is not directly releavant your patch, 
I send the email for checking it right and possible fix on this patch.

on the swap_alloc_hibernation_slot function
nr_swap_pages is decreased. but as I think it is decreased on swap_range_alloc.

The nr_swap_pages are decremented as the callflow as like the below.

cluster_alloc_swap_entry -> alloc_swap_scan_cluster
-> closter_alloc_range -> swap_range_alloc

Introduced on
4f78252da887ee7e9d1875dd6e07d9baa936c04f
mm: swap: move nr_swap_pages counter decrement  from folio_alloc_swap() to swap_range_alloc()

 #ifdef CONFIG_HIBERNATION
 /* Allocate a slot for hibernation */
 swp_entry_t swap_alloc_hibernation_slot(int type)
 {
....
                       local_unlock(&percpu_swap_cluster.lock);
                        if (offset) {
                                entry = swp_entry(si->type, offset);
                                atomic_long_dec(&nr_swap_pages); // here


Thank you,
Youngjun Park

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-11-01  4:51   ` YoungJun Park
@ 2025-11-01  8:59     ` Kairui Song
  2025-11-01  9:08       ` YoungJun Park
  0 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-11-01  8:59 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Sat, Nov 1, 2025 at 12:51 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:40PM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
>
> Hello Kairui!
>
> > The current swap entry allocation/freeing workflow has never had a clear
> > definition. This makes it hard to debug or add new optimizations.
> >
> > This commit introduces a proper definition of how swap entries would be
> > allocated and freed. Now, most operations are folio based, so they will
> > never exceed one swap cluster, and we now have a cleaner border between
> > swap and the rest of mm, making it much easier to follow and debug,
> > especially with new added sanity checks. Also making more optimization
> > possible.
> >
> > Swap entry will be mostly allocated and free with a folio bound.
> > The folio lock will be useful for resolving many swap ralated races.
> >
> > Now swap allocation (except hibernation) always starts with a folio in
> > the swap cache, and gets duped/freed protected by the folio lock:
> >
> > - folio_alloc_swap() - The only allocation entry point now.
> >   Context: The folio must be locked.
> >   This allocates one or a set of continuous swap slots for a folio and
> >   binds them to the folio by adding the folio to the swap cache. The
> >   swap slots' swap count start with zero value.
> >
> > - folio_dup_swap() - Increase the swap count of one or more entries.
> >   Context: The folio must be locked and in the swap cache. For now, the
> >   caller still has to lock the new swap entry owner (e.g., PTL).
> >   This increases the ref count of swap entries allocated to a folio.
> >   Newly allocated swap slots' count has to be increased by this helper
> >   as the folio got unmapped (and swap entries got installed).
> >
> > - folio_put_swap() - Decrease the swap count of one or more entries.
> >   Context: The folio must be locked and in the swap cache. For now, the
> >   caller still has to lock the new swap entry owner (e.g., PTL).
> >   This decreases the ref count of swap entries allocated to a folio.
> >   Typically, swapin will decrease the swap count as the folio got
> >   installed back and the swap entry got uninstalled
> >
> >   This won't remove the folio from the swap cache and free the
> >   slot. Lazy freeing of swap cache is helpful for reducing IO.
> >   There is already a folio_free_swap() for immediate cache reclaim.
> >   This part could be further optimized later.
> >
> > The above locking constraints could be further relaxed when the swap
> > table if fully implemented. Currently dup still needs the caller
> > to lock the swap entry container (e.g. PTL), or a concurrent zap
> > may underflow the swap count.
> >
> > Some swap users need to interact with swap count without involving folio
> > (e.g. forking/zapping the page table or mapping truncate without swapin).
> > In such cases, the caller has to ensure there is no race condition on
> > whatever owns the swap count and use the below helpers:
> >
> > - swap_put_entries_direct() - Decrease the swap count directly.
> >   Context: The caller must lock whatever is referencing the slots to
> >   avoid a race.
> >
> >   Typically the page table zapping or shmem mapping truncate will need
> >   to free swap slots directly. If a slot is cached (has a folio bound),
> >   this will also try to release the swap cache.
> >
> > - swap_dup_entry_direct() - Increase the swap count directly.
> >   Context: The caller must lock whatever is referencing the entries to
> >   avoid race, and the entries must already have a swap count > 1.
> >
> >   Typically, forking will need to copy the page table and hence needs to
> >   increase the swap count of the entries in the table. The page table is
> >   locked while referencing the swap entries, so the entries all have a
> >   swap count > 1 and can't be freed.
> >
> > Hibernation subsystem is a bit different, so two special wrappers are here:
> >
> > - swap_alloc_hibernation_slot() - Allocate one entry from one device.
> > - swap_free_hibernation_slot() - Free one entry allocated by the above
> > helper.
>
> During the code review, I found something to be verified.
> It is not directly releavant your patch,
> I send the email for checking it right and possible fix on this patch.
>
> on the swap_alloc_hibernation_slot function
> nr_swap_pages is decreased. but as I think it is decreased on swap_range_alloc.
>
> The nr_swap_pages are decremented as the callflow as like the below.
>
> cluster_alloc_swap_entry -> alloc_swap_scan_cluster
> -> closter_alloc_range -> swap_range_alloc
>
> Introduced on
> 4f78252da887ee7e9d1875dd6e07d9baa936c04f
> mm: swap: move nr_swap_pages counter decrement  from folio_alloc_swap() to swap_range_alloc()
>

Yeah, you are right, that's a bug introduced by 4f78252da887, will you
send a patch to fix that ? Or I can send one, just remove the
atomic_long_dec(&nr_swap_pages) in get_swap_page_of_type then we are
fine.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
  2025-11-01  8:59     ` Kairui Song
@ 2025-11-01  9:08       ` YoungJun Park
  0 siblings, 0 replies; 50+ messages in thread
From: YoungJun Park @ 2025-11-01  9:08 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Sat, Nov 01, 2025 at 04:59:05PM +0800, Kairui Song wrote:
> On Sat, Nov 1, 2025 at 12:51 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:58:40PM +0800, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>

> > During the code review, I found something to be verified.
> > It is not directly releavant your patch,
> > I send the email for checking it right and possible fix on this patch.
> >
> > on the swap_alloc_hibernation_slot function
> > nr_swap_pages is decreased. but as I think it is decreased on swap_range_alloc.
> >
> > The nr_swap_pages are decremented as the callflow as like the below.
> >
> > cluster_alloc_swap_entry -> alloc_swap_scan_cluster
> > -> closter_alloc_range -> swap_range_alloc
> >
> > Introduced on
> > 4f78252da887ee7e9d1875dd6e07d9baa936c04f
> > mm: swap: move nr_swap_pages counter decrement  from folio_alloc_swap() to swap_range_alloc()
> >
> 
> Yeah, you are right, that's a bug introduced by 4f78252da887, will you
> send a patch to fix that ? Or I can send one, just remove the
> atomic_long_dec(&nr_swap_pages) in get_swap_page_of_type then we are
> fine.

Thank you for double check. I will send a patch soon.

Regards,
Youngjun Park

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (13 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 16:52   ` Kairui Song
  2025-10-31  5:56   ` YoungJun Park
  2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
This pinning usage here can be dropped by adding the folio to swap
cache directly on allocation.

All swap allocations are folio-based now (except for hibernation), so
the swap allocator can always take the folio as the parameter. And now
both swap cache (swap table) and swap map are protected by the cluster
lock, scanning the map and inserting the folio can be done in the same
critical section. This eliminates the time window that a slot is pinned
by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
multiple times.

This is both a cleanup and an optimization.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   5 --
 mm/swap.h            |   8 +--
 mm/swap_state.c      |  56 +++++++++++-------
 mm/swapfile.c        | 161 +++++++++++++++++++++------------------------------
 4 files changed, 105 insertions(+), 125 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac3caa4c6999..4b4b81fbc6a3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
@@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
 {
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
-{
-}
-
 static inline int __swap_count(swp_entry_t entry)
 {
 	return 0;
diff --git a/mm/swap.h b/mm/swap.h
index 74c61129d7b7..03694ffa662f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
-			 void **shadow, bool alloc);
 void swap_cache_del_folio(struct folio *folio);
 struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
 				     struct mempolicy *mpol, pgoff_t ilx,
 				     bool *alloced);
 /* Below helpers require the caller to lock and pass in the swap cluster. */
+void __swap_cache_add_folio(struct swap_cluster_info *ci,
+			    struct folio *folio, swp_entry_t entry);
 void __swap_cache_del_folio(struct swap_cluster_info *ci,
 			    struct folio *folio, swp_entry_t entry, void *shadow);
 void __swap_cache_replace_folio(struct swap_cluster_info *ci,
@@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
-static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
-				       void **shadow, bool alloc)
+static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
+		struct folio *folio, swp_entry_t entry)
 {
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d2bcca92b6e0..85d9f99c384f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 	return NULL;
 }
 
+void __swap_cache_add_folio(struct swap_cluster_info *ci,
+			    struct folio *folio, swp_entry_t entry)
+{
+	unsigned long new_tb;
+	unsigned int ci_start, ci_off, ci_end;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+
+	new_tb = folio_to_swp_tb(folio);
+	ci_start = swp_cluster_offset(entry);
+	ci_off = ci_start;
+	ci_end = ci_start + nr_pages;
+	do {
+		VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
+		__swap_table_set(ci, ci_off, new_tb);
+	} while (++ci_off < ci_end);
+
+	folio_ref_add(folio, nr_pages);
+	folio_set_swapcache(folio);
+	folio->swap = entry;
+
+	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+}
+
 /**
  * swap_cache_add_folio - Add a folio into the swap cache.
  * @folio: The folio to be added.
@@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
  * The caller also needs to update the corresponding swap_map slots with
  * SWAP_HAS_CACHE bit to avoid race or conflict.
  */
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
-			 void **shadowp, bool alloc)
+static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+				void **shadowp)
 {
 	int err;
 	void *shadow = NULL;
+	unsigned long old_tb;
 	struct swap_info_struct *si;
-	unsigned long old_tb, new_tb;
 	struct swap_cluster_info *ci;
 	unsigned int ci_start, ci_off, ci_end, offset;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
-	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
-	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
-	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
-
 	si = __swap_entry_to_info(entry);
-	new_tb = folio_to_swp_tb(folio);
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
@@ -168,7 +191,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 			err = -EEXIST;
 			goto failed;
 		}
-		if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
+		if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
 			err = -ENOENT;
 			goto failed;
 		}
@@ -184,20 +207,11 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 		 * Still need to pin the slots with SWAP_HAS_CACHE since
 		 * swap allocator depends on that.
 		 */
-		if (!alloc)
-			__swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
-		__swap_table_set(ci, ci_off, new_tb);
+		__swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
 		offset++;
 	} while (++ci_off < ci_end);
-
-	folio_ref_add(folio, nr_pages);
-	folio_set_swapcache(folio);
-	folio->swap = entry;
+	__swap_cache_add_folio(ci, folio, entry);
 	swap_cluster_unlock(ci);
-
-	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
-	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
-
 	if (shadowp)
 		*shadowp = shadow;
 	return 0;
@@ -466,7 +480,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 	for (;;) {
-		ret = swap_cache_add_folio(folio, entry, &shadow, false);
+		ret = swap_cache_add_folio(folio, entry, &shadow);
 		if (!ret)
 			break;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 426b0b6d583f..8d98f28907bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -875,28 +875,53 @@ static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
 	}
 }
 
-static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
-				unsigned int start, unsigned char usage,
-				unsigned int order)
+static bool cluster_alloc_range(struct swap_info_struct *si,
+				struct swap_cluster_info *ci,
+				struct folio *folio,
+				unsigned int offset)
 {
-	unsigned int nr_pages = 1 << order;
+	unsigned long nr_pages;
+	unsigned int order;
 
 	lockdep_assert_held(&ci->lock);
 
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
 
+	/*
+	 * All mm swap allocation starts with a folio (folio_alloc_swap),
+	 * it's also the only allocation path for large orders allocation.
+	 * Such swap slots starts with count == 0 and will be increased
+	 * upon folio unmap.
+	 *
+	 * Else, it's a exclusive order 0 allocation for hibernation.
+	 * The slot starts with count == 1 and never increases.
+	 */
+	if (likely(folio)) {
+		order = folio_order(folio);
+		nr_pages = 1 << order;
+		/*
+		 * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries.
+		 * This is the legacy allocation behavior, will drop it very soon.
+		 */
+		memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
+		__swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
+	} else {
+		order = 0;
+		nr_pages = 1;
+		WARN_ON_ONCE(si->swap_map[offset]);
+		si->swap_map[offset] = 1;
+		swap_cluster_assert_table_empty(ci, offset, 1);
+	}
+
 	/*
 	 * The first allocation in a cluster makes the
 	 * cluster exclusive to this order
 	 */
 	if (cluster_is_empty(ci))
 		ci->order = order;
-
-	memset(si->swap_map + start, usage, nr_pages);
-	swap_cluster_assert_table_empty(ci, start, nr_pages);
-	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
+	swap_range_alloc(si, nr_pages);
 
 	return true;
 }
@@ -904,13 +929,12 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 /* Try use a new cluster for current CPU and allocate from it. */
 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    struct swap_cluster_info *ci,
-					    unsigned long offset,
-					    unsigned int order,
-					    unsigned char usage)
+					    struct folio *folio, unsigned long offset)
 {
 	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
+	unsigned int order = likely(folio) ? folio_order(folio) : 0;
 	unsigned int nr_pages = 1 << order;
 	bool need_reclaim;
 
@@ -930,7 +954,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 				continue;
 			offset = found;
 		}
-		if (!cluster_alloc_range(si, ci, offset, usage, order))
+		if (!cluster_alloc_range(si, ci, folio, offset))
 			break;
 		found = offset;
 		offset += nr_pages;
@@ -952,8 +976,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 
 static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
 					 struct list_head *list,
-					 unsigned int order,
-					 unsigned char usage,
+					 struct folio *folio,
 					 bool scan_all)
 {
 	unsigned int found = SWAP_ENTRY_INVALID;
@@ -965,7 +988,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
 		if (!ci)
 			break;
 		offset = cluster_offset(si, ci);
-		found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
+		found = alloc_swap_scan_cluster(si, ci, folio, offset);
 		if (found)
 			break;
 	} while (scan_all);
@@ -1026,10 +1049,11 @@ static void swap_reclaim_work(struct work_struct *work)
  * Try to allocate swap entries with specified order and try set a new
  * cluster for current CPU too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
-					      unsigned char usage)
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
+					      struct folio *folio)
 {
 	struct swap_cluster_info *ci;
+	unsigned int order = likely(folio) ? folio_order(folio) : 0;
 	unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
 
 	/*
@@ -1051,8 +1075,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
 				offset = cluster_offset(si, ci);
-			found = alloc_swap_scan_cluster(si, ci, offset,
-							order, usage);
+			found = alloc_swap_scan_cluster(si, ci, folio, offset);
 		} else {
 			swap_cluster_unlock(ci);
 		}
@@ -1066,22 +1089,19 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	 * to spread out the writes.
 	 */
 	if (si->flags & SWP_PAGE_DISCARD) {
-		found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
-					     false);
+		found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
 		if (found)
 			goto done;
 	}
 
 	if (order < PMD_ORDER) {
-		found = alloc_swap_scan_list(si, &si->nonfull_clusters[order],
-					     order, usage, true);
+		found = alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, true);
 		if (found)
 			goto done;
 	}
 
 	if (!(si->flags & SWP_PAGE_DISCARD)) {
-		found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
-					     false);
+		found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
 		if (found)
 			goto done;
 	}
@@ -1097,8 +1117,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * failure is not critical. Scanning one cluster still
 		 * keeps the list rotated and reclaimed (for HAS_CACHE).
 		 */
-		found = alloc_swap_scan_list(si, &si->frag_clusters[order], order,
-					     usage, false);
+		found = alloc_swap_scan_list(si, &si->frag_clusters[order], folio, false);
 		if (found)
 			goto done;
 	}
@@ -1112,13 +1131,11 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * Clusters here have at least one usable slots and can't fail order 0
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
-		found = alloc_swap_scan_list(si, &si->frag_clusters[o],
-					     0, usage, true);
+		found = alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true);
 		if (found)
 			goto done;
 
-		found = alloc_swap_scan_list(si, &si->nonfull_clusters[o],
-					     0, usage, true);
+		found = alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true);
 		if (found)
 			goto done;
 	}
@@ -1309,12 +1326,12 @@ static bool get_swap_device_info(struct swap_info_struct *si)
  * Fast path try to get swap entries with specified order from current
  * CPU's swap entry pool (a cluster).
  */
-static bool swap_alloc_fast(swp_entry_t *entry,
-			    int order)
+static bool swap_alloc_fast(struct folio *folio)
 {
+	unsigned int order = folio_order(folio);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	unsigned int offset, found = SWAP_ENTRY_INVALID;
+	unsigned int offset;
 
 	/*
 	 * Once allocated, swap_info_struct will never be completely freed,
@@ -1329,22 +1346,18 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
-		found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
-		if (found)
-			*entry = swp_entry(si->type, found);
+		alloc_swap_scan_cluster(si, ci, folio, offset);
 	} else {
 		swap_cluster_unlock(ci);
 	}
 
 	put_swap_device(si);
-	return !!found;
+	return folio_test_swapcache(folio);
 }
 
 /* Rotate the device and switch to a new cluster */
-static bool swap_alloc_slow(swp_entry_t *entry,
-			    int order)
+static void swap_alloc_slow(struct folio *folio)
 {
-	unsigned long offset;
 	struct swap_info_struct *si, *next;
 
 	spin_lock(&swap_avail_lock);
@@ -1354,14 +1367,12 @@ static bool swap_alloc_slow(swp_entry_t *entry,
 		plist_requeue(&si->avail_list, &swap_avail_head);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+			cluster_alloc_swap_entry(si, folio);
 			put_swap_device(si);
-			if (offset) {
-				*entry = swp_entry(si->type, offset);
-				return true;
-			}
-			if (order)
-				return false;
+			if (folio_test_swapcache(folio))
+				return;
+			if (folio_test_large(folio))
+				return;
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1423,7 +1434,6 @@ int folio_alloc_swap(struct folio *folio)
 {
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
-	swp_entry_t entry = {};
 
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1448,39 +1458,23 @@ int folio_alloc_swap(struct folio *folio)
 
 again:
 	local_lock(&percpu_swap_cluster.lock);
-	if (!swap_alloc_fast(&entry, order))
-		swap_alloc_slow(&entry, order);
+	if (!swap_alloc_fast(folio))
+		swap_alloc_slow(folio);
 	local_unlock(&percpu_swap_cluster.lock);
 
-	if (unlikely(!order && !entry.val)) {
+	if (!order && unlikely(!folio_test_swapcache(folio))) {
 		if (swap_sync_discard())
 			goto again;
 	}
 
 	/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
-	if (mem_cgroup_try_charge_swap(folio, entry))
-		goto out_free;
+	if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap)))
+		swap_cache_del_folio(folio);
 
-	if (!entry.val)
+	if (unlikely(!folio_test_swapcache(folio)))
 		return -ENOMEM;
 
-	/*
-	 * Allocator has pinned the slots with SWAP_HAS_CACHE
-	 * so it should never fail
-	 */
-	WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
-
-	/*
-	 * Allocator should always allocate aligned entries so folio based
-	 * operations never crossed more than one cluster.
-	 */
-	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
-
 	return 0;
-
-out_free:
-	put_swap_folio(folio, entry);
-	return -ENOMEM;
 }
 
 /**
@@ -1779,29 +1773,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 		partial_free_cluster(si, ci);
 }
 
-/*
- * Called after dropping swapcache to decrease refcnt to swap entries.
- */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-	int size = 1 << swap_entry_order(folio_order(folio));
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return;
-
-	ci = swap_cluster_lock(si, offset);
-	if (swap_only_has_cache(si, offset, size))
-		swap_entries_free(si, ci, entry, size);
-	else
-		for (int i = 0; i < size; i++, entry.val++)
-			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
-	swap_cluster_unlock(ci);
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
@@ -2052,7 +2023,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 			 * with swap table allocation.
 			 */
 			local_lock(&percpu_swap_cluster.lock);
-			offset = cluster_alloc_swap_entry(si, 0, 1);
+			offset = cluster_alloc_swap_entry(si, NULL);
 			local_unlock(&percpu_swap_cluster.lock);
 			if (offset) {
 				entry = swp_entry(si->type, offset);

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
  2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
@ 2025-10-29 16:52   ` Kairui Song
  2025-10-31  5:56   ` YoungJun Park
  1 sibling, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 16:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Thu, Oct 30, 2025 at 12:00 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
> SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
> This pinning usage here can be dropped by adding the folio to swap
> cache directly on allocation.
>
> All swap allocations are folio-based now (except for hibernation), so
> the swap allocator can always take the folio as the parameter. And now
> both swap cache (swap table) and swap map are protected by the cluster
> lock, scanning the map and inserting the folio can be done in the same
> critical section. This eliminates the time window that a slot is pinned
> by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
> multiple times.
>
> This is both a cleanup and an optimization.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |   5 --
>  mm/swap.h            |   8 +--
>  mm/swap_state.c      |  56 +++++++++++-------
>  mm/swapfile.c        | 161 +++++++++++++++++++++------------------------------
>  4 files changed, 105 insertions(+), 125 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ac3caa4c6999..4b4b81fbc6a3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
>  }
>
>  extern void si_swapinfo(struct sysinfo *);
> -void put_swap_folio(struct folio *folio, swp_entry_t entry);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  int swap_type_of(dev_t device, sector_t offset);
>  int find_first_swap(dev_t *device);
> @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
>  {
>  }
>
> -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> -{
> -}
> -
>  static inline int __swap_count(swp_entry_t entry)
>  {
>         return 0;
> diff --git a/mm/swap.h b/mm/swap.h
> index 74c61129d7b7..03694ffa662f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry);
>  void *swap_cache_get_shadow(swp_entry_t entry);
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -                        void **shadow, bool alloc);
>  void swap_cache_del_folio(struct folio *folio);
>  struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
>                                      struct mempolicy *mpol, pgoff_t ilx,
>                                      bool *alloced);
>  /* Below helpers require the caller to lock and pass in the swap cluster. */
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> +                           struct folio *folio, swp_entry_t entry);
>  void __swap_cache_del_folio(struct swap_cluster_info *ci,
>                             struct folio *folio, swp_entry_t entry, void *shadow);
>  void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
>         return NULL;
>  }
>
> -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -                                      void **shadow, bool alloc)
> +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
> +               struct folio *folio, swp_entry_t entry)
>  {
>  }
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d2bcca92b6e0..85d9f99c384f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>         return NULL;
>  }
>
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> +                           struct folio *folio, swp_entry_t entry)
> +{
> +       unsigned long new_tb;
> +       unsigned int ci_start, ci_off, ci_end;
> +       unsigned long nr_pages = folio_nr_pages(folio);
> +
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> +       new_tb = folio_to_swp_tb(folio);
> +       ci_start = swp_cluster_offset(entry);
> +       ci_off = ci_start;
> +       ci_end = ci_start + nr_pages;
> +       do {
> +               VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> +               __swap_table_set(ci, ci_off, new_tb);
> +       } while (++ci_off < ci_end);
> +
> +       folio_ref_add(folio, nr_pages);
> +       folio_set_swapcache(folio);
> +       folio->swap = entry;
> +
> +       node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> +       lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> +}
> +
>  /**
>   * swap_cache_add_folio - Add a folio into the swap cache.
>   * @folio: The folio to be added.
> @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>   * The caller also needs to update the corresponding swap_map slots with
>   * SWAP_HAS_CACHE bit to avoid race or conflict.
>   */
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -                        void **shadowp, bool alloc)
> +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> +                               void **shadowp)
>  {
>         int err;
>         void *shadow = NULL;
> +       unsigned long old_tb;
>         struct swap_info_struct *si;
> -       unsigned long old_tb, new_tb;
>         struct swap_cluster_info *ci;
>         unsigned int ci_start, ci_off, ci_end, offset;
>         unsigned long nr_pages = folio_nr_pages(folio);
>
> -       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> -       VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> -       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> -
>         si = __swap_entry_to_info(entry);
> -       new_tb = folio_to_swp_tb(folio);
>         ci_start = swp_cluster_offset(entry);
>         ci_end = ci_start + nr_pages;
>         ci_off = ci_start;
> @@ -168,7 +191,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
>                         err = -EEXIST;
>                         goto failed;
>                 }
> -               if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
> +               if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
>                         err = -ENOENT;
>                         goto failed;
>                 }
> @@ -184,20 +207,11 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
>                  * Still need to pin the slots with SWAP_HAS_CACHE since
>                  * swap allocator depends on that.
>                  */
> -               if (!alloc)
> -                       __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
> -               __swap_table_set(ci, ci_off, new_tb);
> +               __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
>                 offset++;
>         } while (++ci_off < ci_end);
> -
> -       folio_ref_add(folio, nr_pages);
> -       folio_set_swapcache(folio);
> -       folio->swap = entry;
> +       __swap_cache_add_folio(ci, folio, entry);
>         swap_cluster_unlock(ci);
> -
> -       node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> -       lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> -
>         if (shadowp)
>                 *shadowp = shadow;
>         return 0;
> @@ -466,7 +480,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
>         __folio_set_locked(folio);
>         __folio_set_swapbacked(folio);
>         for (;;) {
> -               ret = swap_cache_add_folio(folio, entry, &shadow, false);
> +               ret = swap_cache_add_folio(folio, entry, &shadow);
>                 if (!ret)
>                         break;
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 426b0b6d583f..8d98f28907bc 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -875,28 +875,53 @@ static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
>         }
>  }
>
> -static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
> -                               unsigned int start, unsigned char usage,
> -                               unsigned int order)
> +static bool cluster_alloc_range(struct swap_info_struct *si,
> +                               struct swap_cluster_info *ci,
> +                               struct folio *folio,
> +                               unsigned int offset)
>  {
> -       unsigned int nr_pages = 1 << order;
> +       unsigned long nr_pages;
> +       unsigned int order;
>
>         lockdep_assert_held(&ci->lock);
>
>         if (!(si->flags & SWP_WRITEOK))
>                 return false;
>
> +       /*
> +        * All mm swap allocation starts with a folio (folio_alloc_swap),
> +        * it's also the only allocation path for large orders allocation.
> +        * Such swap slots starts with count == 0 and will be increased
> +        * upon folio unmap.
> +        *
> +        * Else, it's a exclusive order 0 allocation for hibernation.
> +        * The slot starts with count == 1 and never increases.
> +        */
> +       if (likely(folio)) {
> +               order = folio_order(folio);
> +               nr_pages = 1 << order;
> +               /*
> +                * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries.
> +                * This is the legacy allocation behavior, will drop it very soon.
> +                */
> +               memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
> +               __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
> +       } else {
> +               order = 0;
> +               nr_pages = 1;
> +               WARN_ON_ONCE(si->swap_map[offset]);
> +               si->swap_map[offset] = 1;
> +               swap_cluster_assert_table_empty(ci, offset, 1);
> +       }
> +
>         /*
>          * The first allocation in a cluster makes the
>          * cluster exclusive to this order
>          */
>         if (cluster_is_empty(ci))
>                 ci->order = order;
> -
> -       memset(si->swap_map + start, usage, nr_pages);
> -       swap_cluster_assert_table_empty(ci, start, nr_pages);
> -       swap_range_alloc(si, nr_pages);
>         ci->count += nr_pages;
> +       swap_range_alloc(si, nr_pages);
>
>         return true;
>  }
> @@ -904,13 +929,12 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  /* Try use a new cluster for current CPU and allocate from it. */
>  static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>                                             struct swap_cluster_info *ci,
> -                                           unsigned long offset,
> -                                           unsigned int order,
> -                                           unsigned char usage)
> +                                           struct folio *folio, unsigned long offset)
>  {
>         unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
>         unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
>         unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> +       unsigned int order = likely(folio) ? folio_order(folio) : 0;
>         unsigned int nr_pages = 1 << order;
>         bool need_reclaim;
>
> @@ -930,7 +954,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>                                 continue;
>                         offset = found;
>                 }
> -               if (!cluster_alloc_range(si, ci, offset, usage, order))
> +               if (!cluster_alloc_range(si, ci, folio, offset))
>                         break;
>                 found = offset;
>                 offset += nr_pages;
> @@ -952,8 +976,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>
>  static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
>                                          struct list_head *list,
> -                                        unsigned int order,
> -                                        unsigned char usage,
> +                                        struct folio *folio,
>                                          bool scan_all)
>  {
>         unsigned int found = SWAP_ENTRY_INVALID;
> @@ -965,7 +988,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
>                 if (!ci)
>                         break;
>                 offset = cluster_offset(si, ci);
> -               found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
> +               found = alloc_swap_scan_cluster(si, ci, folio, offset);
>                 if (found)
>                         break;
>         } while (scan_all);
> @@ -1026,10 +1049,11 @@ static void swap_reclaim_work(struct work_struct *work)
>   * Try to allocate swap entries with specified order and try set a new
>   * cluster for current CPU too.
>   */
> -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> -                                             unsigned char usage)
> +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
> +                                             struct folio *folio)
>  {
>         struct swap_cluster_info *ci;
> +       unsigned int order = likely(folio) ? folio_order(folio) : 0;
>         unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
>
>         /*
> @@ -1051,8 +1075,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                 if (cluster_is_usable(ci, order)) {
>                         if (cluster_is_empty(ci))
>                                 offset = cluster_offset(si, ci);
> -                       found = alloc_swap_scan_cluster(si, ci, offset,
> -                                                       order, usage);
> +                       found = alloc_swap_scan_cluster(si, ci, folio, offset);
>                 } else {
>                         swap_cluster_unlock(ci);
>                 }
> @@ -1066,22 +1089,19 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>          * to spread out the writes.
>          */
>         if (si->flags & SWP_PAGE_DISCARD) {
> -               found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
> -                                            false);
> +               found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
>                 if (found)
>                         goto done;
>         }
>
>         if (order < PMD_ORDER) {
> -               found = alloc_swap_scan_list(si, &si->nonfull_clusters[order],
> -                                            order, usage, true);
> +               found = alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, true);
>                 if (found)
>                         goto done;
>         }
>
>         if (!(si->flags & SWP_PAGE_DISCARD)) {
> -               found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
> -                                            false);
> +               found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
>                 if (found)
>                         goto done;
>         }
> @@ -1097,8 +1117,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                  * failure is not critical. Scanning one cluster still
>                  * keeps the list rotated and reclaimed (for HAS_CACHE).
>                  */
> -               found = alloc_swap_scan_list(si, &si->frag_clusters[order], order,
> -                                            usage, false);
> +               found = alloc_swap_scan_list(si, &si->frag_clusters[order], folio, false);
>                 if (found)
>                         goto done;
>         }
> @@ -1112,13 +1131,11 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                  * Clusters here have at least one usable slots and can't fail order 0
>                  * allocation, but reclaim may drop si->lock and race with another user.
>                  */
> -               found = alloc_swap_scan_list(si, &si->frag_clusters[o],
> -                                            0, usage, true);
> +               found = alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true);
>                 if (found)
>                         goto done;
>
> -               found = alloc_swap_scan_list(si, &si->nonfull_clusters[o],
> -                                            0, usage, true);
> +               found = alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true);
>                 if (found)
>                         goto done;
>         }
> @@ -1309,12 +1326,12 @@ static bool get_swap_device_info(struct swap_info_struct *si)
>   * Fast path try to get swap entries with specified order from current
>   * CPU's swap entry pool (a cluster).
>   */
> -static bool swap_alloc_fast(swp_entry_t *entry,
> -                           int order)
> +static bool swap_alloc_fast(struct folio *folio)
>  {
> +       unsigned int order = folio_order(folio);
>         struct swap_cluster_info *ci;
>         struct swap_info_struct *si;
> -       unsigned int offset, found = SWAP_ENTRY_INVALID;
> +       unsigned int offset;
>
>         /*
>          * Once allocated, swap_info_struct will never be completely freed,
> @@ -1329,22 +1346,18 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>         if (cluster_is_usable(ci, order)) {
>                 if (cluster_is_empty(ci))
>                         offset = cluster_offset(si, ci);
> -               found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> -               if (found)
> -                       *entry = swp_entry(si->type, found);
> +               alloc_swap_scan_cluster(si, ci, folio, offset);
>         } else {
>                 swap_cluster_unlock(ci);
>         }
>
>         put_swap_device(si);
> -       return !!found;
> +       return folio_test_swapcache(folio);
>  }
>
>  /* Rotate the device and switch to a new cluster */
> -static bool swap_alloc_slow(swp_entry_t *entry,
> -                           int order)
> +static void swap_alloc_slow(struct folio *folio)
>  {
> -       unsigned long offset;
>         struct swap_info_struct *si, *next;
>
>         spin_lock(&swap_avail_lock);
> @@ -1354,14 +1367,12 @@ static bool swap_alloc_slow(swp_entry_t *entry,
>                 plist_requeue(&si->avail_list, &swap_avail_head);
>                 spin_unlock(&swap_avail_lock);
>                 if (get_swap_device_info(si)) {
> -                       offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> +                       cluster_alloc_swap_entry(si, folio);
>                         put_swap_device(si);
> -                       if (offset) {
> -                               *entry = swp_entry(si->type, offset);
> -                               return true;
> -                       }
> -                       if (order)
> -                               return false;
> +                       if (folio_test_swapcache(folio))
> +                               return;
> +                       if (folio_test_large(folio))
> +                               return;
>                 }
>
>                 spin_lock(&swap_avail_lock);

My bad, following diff was lost during rebase to mm-new,
swap_alloc_slow should return void now:

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8d98f28907bc..0bc734eb32c4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1391,7 +1391,6 @@ static void swap_alloc_slow(struct folio *folio)
                        goto start_over;
        }
        spin_unlock(&swap_avail_lock);
-       return false;
 }

^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
  2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
  2025-10-29 16:52   ` Kairui Song
@ 2025-10-31  5:56   ` YoungJun Park
  2025-10-31  7:02     ` Kairui Song
  1 sibling, 1 reply; 50+ messages in thread
From: YoungJun Park @ 2025-10-31  5:56 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:58:41PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>

Hello Kairui 

> The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
> SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
> This pinning usage here can be dropped by adding the folio to swap
> cache directly on allocation.
> 
> All swap allocations are folio-based now (except for hibernation), so
> the swap allocator can always take the folio as the parameter. And now
> both swap cache (swap table) and swap map are protected by the cluster
> lock, scanning the map and inserting the folio can be done in the same
> critical section. This eliminates the time window that a slot is pinned
> by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
> multiple times.
> 
> This is both a cleanup and an optimization.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |   5 --
>  mm/swap.h            |   8 +--
>  mm/swap_state.c      |  56 +++++++++++-------
>  mm/swapfile.c        | 161 +++++++++++++++++++++------------------------------
>  4 files changed, 105 insertions(+), 125 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ac3caa4c6999..4b4b81fbc6a3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
>  }
>  
>  extern void si_swapinfo(struct sysinfo *);
> -void put_swap_folio(struct folio *folio, swp_entry_t entry);
>  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
>  int swap_type_of(dev_t device, sector_t offset);
>  int find_first_swap(dev_t *device);
> @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
>  {
>  }
>  
> -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> -{
> -}
> -
>  static inline int __swap_count(swp_entry_t entry)
>  {
>  	return 0;
> diff --git a/mm/swap.h b/mm/swap.h
> index 74c61129d7b7..03694ffa662f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry);
>  void *swap_cache_get_shadow(swp_entry_t entry);
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -			 void **shadow, bool alloc);
>  void swap_cache_del_folio(struct folio *folio);
>  struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
>  				     struct mempolicy *mpol, pgoff_t ilx,
>  				     bool *alloced);
>  /* Below helpers require the caller to lock and pass in the swap cluster. */
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> +			    struct folio *folio, swp_entry_t entry);
>  void __swap_cache_del_folio(struct swap_cluster_info *ci,
>  			    struct folio *folio, swp_entry_t entry, void *shadow);
>  void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
>  	return NULL;
>  }
>  
> -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -				       void **shadow, bool alloc)
> +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
> +		struct folio *folio, swp_entry_t entry)
>  {
>  }

Just a nit,
void* return nothing. 

changed to void (original function prototype is return void)
or how about just remove If this is not used on !CONFIG_SWAP

> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d2bcca92b6e0..85d9f99c384f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>  	return NULL;
>  }
>  
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> +			    struct folio *folio, swp_entry_t entry)
> +{
> +	unsigned long new_tb;
> +	unsigned int ci_start, ci_off, ci_end;
> +	unsigned long nr_pages = folio_nr_pages(folio);
> +
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> +	new_tb = folio_to_swp_tb(folio);
> +	ci_start = swp_cluster_offset(entry);
> +	ci_off = ci_start;
> +	ci_end = ci_start + nr_pages;
> +	do {
> +		VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> +		__swap_table_set(ci, ci_off, new_tb);
> +	} while (++ci_off < ci_end);
> +
> +	folio_ref_add(folio, nr_pages);
> +	folio_set_swapcache(folio);
> +	folio->swap = entry;
> +
> +	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> +	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> +}
> +
>  /**
>   * swap_cache_add_folio - Add a folio into the swap cache.
>   * @folio: The folio to be added.
> @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>   * The caller also needs to update the corresponding swap_map slots with
>   * SWAP_HAS_CACHE bit to avoid race or conflict.
>   */
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> -			 void **shadowp, bool alloc)
> +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> +				void **shadowp)

It is also a small thing.
"alloc" parameter removed then the comment might be updated.

Thanks,
Youngjun Park

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
  2025-10-31  5:56   ` YoungJun Park
@ 2025-10-31  7:02     ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-31  7:02 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Fri, Oct 31, 2025 at 1:56 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:41PM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
>
> Hello Kairui
>
> > The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
> > SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
> > This pinning usage here can be dropped by adding the folio to swap
> > cache directly on allocation.
> >
> > All swap allocations are folio-based now (except for hibernation), so
> > the swap allocator can always take the folio as the parameter. And now
> > both swap cache (swap table) and swap map are protected by the cluster
> > lock, scanning the map and inserting the folio can be done in the same
> > critical section. This eliminates the time window that a slot is pinned
> > by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
> > multiple times.
> >
> > This is both a cleanup and an optimization.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/swap.h |   5 --
> >  mm/swap.h            |   8 +--
> >  mm/swap_state.c      |  56 +++++++++++-------
> >  mm/swapfile.c        | 161 +++++++++++++++++++++------------------------------
> >  4 files changed, 105 insertions(+), 125 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ac3caa4c6999..4b4b81fbc6a3 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
> >  }
> >
> >  extern void si_swapinfo(struct sysinfo *);
> > -void put_swap_folio(struct folio *folio, swp_entry_t entry);
> >  extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> >  int swap_type_of(dev_t device, sector_t offset);
> >  int find_first_swap(dev_t *device);
> > @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
> >  {
> >  }
> >
> > -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > -{
> > -}
> > -
> >  static inline int __swap_count(swp_entry_t entry)
> >  {
> >       return 0;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 74c61129d7b7..03694ffa662f 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> >   */
> >  struct folio *swap_cache_get_folio(swp_entry_t entry);
> >  void *swap_cache_get_shadow(swp_entry_t entry);
> > -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > -                      void **shadow, bool alloc);
> >  void swap_cache_del_folio(struct folio *folio);
> >  struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
> >                                    struct mempolicy *mpol, pgoff_t ilx,
> >                                    bool *alloced);
> >  /* Below helpers require the caller to lock and pass in the swap cluster. */
> > +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> > +                         struct folio *folio, swp_entry_t entry);
> >  void __swap_cache_del_folio(struct swap_cluster_info *ci,
> >                           struct folio *folio, swp_entry_t entry, void *shadow);
> >  void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> > @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
> >       return NULL;
> >  }
> >
> > -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > -                                    void **shadow, bool alloc)
> > +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
> > +             struct folio *folio, swp_entry_t entry)
> >  {
> >  }
>
> Just a nit,
> void* return nothing.
>
> changed to void (original function prototype is return void)
> or how about just remove If this is not used on !CONFIG_SWAP

Thanks! Yeah it can be just removed, no one is using it when
!CONFIG_SWAP after this commit. Will clean it up.

>
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index d2bcca92b6e0..85d9f99c384f 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> >       return NULL;
> >  }
> >
> > +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> > +                         struct folio *folio, swp_entry_t entry)
> > +{
> > +     unsigned long new_tb;
> > +     unsigned int ci_start, ci_off, ci_end;
> > +     unsigned long nr_pages = folio_nr_pages(folio);
> > +
> > +     VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> > +     VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> > +     VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> > +
> > +     new_tb = folio_to_swp_tb(folio);
> > +     ci_start = swp_cluster_offset(entry);
> > +     ci_off = ci_start;
> > +     ci_end = ci_start + nr_pages;
> > +     do {
> > +             VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> > +             __swap_table_set(ci, ci_off, new_tb);
> > +     } while (++ci_off < ci_end);
> > +
> > +     folio_ref_add(folio, nr_pages);
> > +     folio_set_swapcache(folio);
> > +     folio->swap = entry;
> > +
> > +     node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> > +     lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> > +}
> > +
> >  /**
> >   * swap_cache_add_folio - Add a folio into the swap cache.
> >   * @folio: The folio to be added.
> > @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> >   * The caller also needs to update the corresponding swap_map slots with
> >   * SWAP_HAS_CACHE bit to avoid race or conflict.
> >   */
> > -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > -                      void **shadowp, bool alloc)
> > +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > +                             void **shadowp)
>
> It is also a small thing.
> "alloc" parameter removed then the comment might be updated.

Nice suggestion, will cleanup the comment too.

>
> Thanks,
> Youngjun Park
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 16/19] mm, swap: check swap table directly for checking cache
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (14 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-11-06 21:02   ` Barry Song
  2025-10-29 15:58 ` [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of looking at the swap map, check swap table directly to tell
if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h        | 11 ++++++++---
 mm/swap_state.c  | 16 ++++++++++++++++
 mm/swapfile.c    | 55 +++++++++++++++++++++++++++++--------------------------
 mm/userfaultfd.c | 10 +++-------
 4 files changed, 56 insertions(+), 36 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 03694ffa662f..73f07bcea5f0 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
  *   swap entries in the page table, similar to locking swap cache folio.
  * - See the comment of get_swap_device() for more complex usage.
  */
+bool swap_cache_check_folio(swp_entry_t entry);
 struct folio *swap_cache_get_folio(swp_entry_t entry);
 void *swap_cache_get_shadow(swp_entry_t entry);
 void swap_cache_del_folio(struct folio *folio);
@@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = __swap_entry_to_info(entry);
-	pgoff_t offset = swp_offset(entry);
 	int i;
 
 	/*
@@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 	 * be in conflict with the folio in swap cache.
 	 */
 	for (i = 0; i < max_nr; i++) {
-		if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
+		if (swap_cache_check_folio(entry))
 			return i;
+		entry.val++;
 	}
 
 	return i;
@@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
 	return 0;
 }
 
+static inline bool swap_cache_check_folio(swp_entry_t entry)
+{
+	return false;
+}
+
 static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 85d9f99c384f..41d4fa056203 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return NULL;
 }
 
+/**
+ * swap_cache_check_folio - Check if a swap slot has cache.
+ * @entry: swap entry indicating the slot.
+ *
+ * Context: Caller must ensure @entry is valid and protect the swap
+ * device with reference count or locks.
+ */
+bool swap_cache_check_folio(swp_entry_t entry)
+{
+	unsigned long swp_tb;
+
+	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
+				swp_cluster_offset(entry));
+	return swp_tb_is_folio(swp_tb);
+}
+
 /**
  * swap_cache_get_shadow - Looks up a shadow in the swap cache.
  * @entry: swap entry used for the lookup.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8d98f28907bc..3b7df5768d7f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -788,23 +788,18 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
 	unsigned int nr_pages = 1 << order;
 	unsigned long offset = start, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
-	int nr_reclaim;
+	unsigned long swp_tb;
 
 	spin_unlock(&ci->lock);
 	do {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
+		if (swap_count(READ_ONCE(map[offset])))
 			break;
-		case SWAP_HAS_CACHE:
-			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
-			if (nr_reclaim < 0)
-				goto out;
-			break;
-		default:
-			goto out;
+		swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+		if (swp_tb_is_folio(swp_tb)) {
+			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0)
+				break;
 		}
 	} while (++offset < end);
-out:
 	spin_lock(&ci->lock);
 
 	/*
@@ -820,37 +815,41 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
 	 */
-	for (offset = start; offset < end; offset++)
-		if (READ_ONCE(map[offset]))
+	for (offset = start; offset < end; offset++) {
+		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+		if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb))
 			return SWAP_ENTRY_INVALID;
+	}
 
 	return start;
 }
 
 static bool cluster_scan_range(struct swap_info_struct *si,
 			       struct swap_cluster_info *ci,
-			       unsigned long start, unsigned int nr_pages,
+			       unsigned long offset, unsigned int nr_pages,
 			       bool *need_reclaim)
 {
-	unsigned long offset, end = start + nr_pages;
+	unsigned long end = offset + nr_pages;
 	unsigned char *map = si->swap_map;
+	unsigned long swp_tb;
 
 	if (cluster_is_empty(ci))
 		return true;
 
-	for (offset = start; offset < end; offset++) {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
-			continue;
-		case SWAP_HAS_CACHE:
+	do {
+		if (swap_count(map[offset]))
+			return false;
+		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+		if (swp_tb_is_folio(swp_tb)) {
+			WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE));
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
-			continue;
-		default:
-			return false;
+		} else {
+			/* A entry with no count and no cache must be null */
+			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
 		}
-	}
+	} while (++offset < end);
 
 	return true;
 }
@@ -1013,7 +1012,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+			if (!swap_count(READ_ONCE(map[offset])) &&
+			    swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
@@ -1957,6 +1957,7 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
 	struct swap_info_struct *si;
 	bool any_only_cache = false;
 	unsigned long offset;
+	unsigned long swp_tb;
 
 	si = get_swap_device(entry);
 	if (WARN_ON_ONCE(!si))
@@ -1981,7 +1982,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
 	 */
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
-		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+		swp_tb = swap_table_get(__swap_offset_to_cluster(si, offset),
+					offset % SWAPFILE_CLUSTER);
+		if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_tb)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 00122f42718c..5411fd340ac3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1184,17 +1184,13 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
 		 * Check if the swap entry is cached after acquiring the src_pte
 		 * lock. Otherwise, we might miss a newly loaded swap cache folio.
 		 *
-		 * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
 		 * We are trying to catch newly added swap cache, the only possible case is
 		 * when a folio is swapped in and out again staying in swap cache, using the
 		 * same entry before the PTE check above. The PTL is acquired and released
-		 * twice, each time after updating the swap_map's flag. So holding
-		 * the PTL here ensures we see the updated value. False positive is possible,
-		 * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
-		 * cache, or during the tiny synchronization window between swap cache and
-		 * swap_map, but it will be gone very quickly, worst result is retry jitters.
+		 * twice, each time after updating the swap table. So holding
+		 * the PTL here ensures we see the updated value.
 		 */
-		if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+		if (swap_cache_check_folio(entry)) {
 			double_pt_unlock(dst_ptl, src_ptl);
 			return -EAGAIN;
 		}

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 16/19] mm, swap: check swap table directly for checking cache
  2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
@ 2025-11-06 21:02   ` Barry Song
  2025-11-07  3:13     ` Kairui Song
  0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-06 21:02 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Thu, Oct 30, 2025 at 12:00 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Instead of looking at the swap map, check swap table directly to tell
> if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h        | 11 ++++++++---
>  mm/swap_state.c  | 16 ++++++++++++++++
>  mm/swapfile.c    | 55 +++++++++++++++++++++++++++++--------------------------
>  mm/userfaultfd.c | 10 +++-------
>  4 files changed, 56 insertions(+), 36 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 03694ffa662f..73f07bcea5f0 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
>   *   swap entries in the page table, similar to locking swap cache folio.
>   * - See the comment of get_swap_device() for more complex usage.
>   */
> +bool swap_cache_check_folio(swp_entry_t entry);
>  struct folio *swap_cache_get_folio(swp_entry_t entry);
>  void *swap_cache_get_shadow(swp_entry_t entry);
>  void swap_cache_del_folio(struct folio *folio);
> @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>
>  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  {
> -       struct swap_info_struct *si = __swap_entry_to_info(entry);
> -       pgoff_t offset = swp_offset(entry);
>         int i;
>
>         /*
> @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>          * be in conflict with the folio in swap cache.
>          */
>         for (i = 0; i < max_nr; i++) {
> -               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> +               if (swap_cache_check_folio(entry))
>                         return i;
> +               entry.val++;
>         }
>
>         return i;
> @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
>         return 0;
>  }
>
> +static inline bool swap_cache_check_folio(swp_entry_t entry)
> +{
> +       return false;
> +}
> +
>  static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
>         return NULL;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85d9f99c384f..41d4fa056203 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
>         return NULL;
>  }
>
> +/**
> + * swap_cache_check_folio - Check if a swap slot has cache.
> + * @entry: swap entry indicating the slot.
> + *
> + * Context: Caller must ensure @entry is valid and protect the swap
> + * device with reference count or locks.
> + */
> +bool swap_cache_check_folio(swp_entry_t entry)
> +{
> +       unsigned long swp_tb;
> +
> +       swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
> +                               swp_cluster_offset(entry));
> +       return swp_tb_is_folio(swp_tb);
> +}
> +

The name swap_cache_check_folio() sounds a bit odd to me — what we’re
actually doing is checking whether the swapcache contains (or is)
a folio, i.e., whether there’s a folio hit in the swapcache.
The word "check" could misleadingly suggest verifying the folio’s health
or validity instead.

what about swap_cache_has_folio() or simply:

struct folio *__swap_cache_get_folio(swp_entry_t entry);

This would return the folio without taking the lock, or NULL if not found?

Thanks
Barry

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 16/19] mm, swap: check swap table directly for checking cache
  2025-11-06 21:02   ` Barry Song
@ 2025-11-07  3:13     ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-07  3:13 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Fri, Nov 7, 2025 at 5:03 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Oct 30, 2025 at 12:00 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of looking at the swap map, check swap table directly to tell
> > if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap.h        | 11 ++++++++---
> >  mm/swap_state.c  | 16 ++++++++++++++++
> >  mm/swapfile.c    | 55 +++++++++++++++++++++++++++++--------------------------
> >  mm/userfaultfd.c | 10 +++-------
> >  4 files changed, 56 insertions(+), 36 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 03694ffa662f..73f07bcea5f0 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> >   *   swap entries in the page table, similar to locking swap cache folio.
> >   * - See the comment of get_swap_device() for more complex usage.
> >   */
> > +bool swap_cache_check_folio(swp_entry_t entry);
> >  struct folio *swap_cache_get_folio(swp_entry_t entry);
> >  void *swap_cache_get_shadow(swp_entry_t entry);
> >  void swap_cache_del_folio(struct folio *folio);
> > @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> >
> >  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >  {
> > -       struct swap_info_struct *si = __swap_entry_to_info(entry);
> > -       pgoff_t offset = swp_offset(entry);
> >         int i;
> >
> >         /*
> > @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >          * be in conflict with the folio in swap cache.
> >          */
> >         for (i = 0; i < max_nr; i++) {
> > -               if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> > +               if (swap_cache_check_folio(entry))
> >                         return i;
> > +               entry.val++;
> >         }
> >
> >         return i;
> > @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
> >         return 0;
> >  }
> >
> > +static inline bool swap_cache_check_folio(swp_entry_t entry)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
> >  {
> >         return NULL;
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 85d9f99c384f..41d4fa056203 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
> >         return NULL;
> >  }
> >
> > +/**
> > + * swap_cache_check_folio - Check if a swap slot has cache.
> > + * @entry: swap entry indicating the slot.
> > + *
> > + * Context: Caller must ensure @entry is valid and protect the swap
> > + * device with reference count or locks.
> > + */
> > +bool swap_cache_check_folio(swp_entry_t entry)
> > +{
> > +       unsigned long swp_tb;
> > +
> > +       swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
> > +                               swp_cluster_offset(entry));
> > +       return swp_tb_is_folio(swp_tb);
> > +}
> > +
>
> The name swap_cache_check_folio() sounds a bit odd to me — what we’re
> actually doing is checking whether the swapcache contains (or is)
> a folio, i.e., whether there’s a folio hit in the swapcache.
> The word "check" could misleadingly suggest verifying the folio’s health
> or validity instead.
>
> what about swap_cache_has_folio() or simply:
>
> struct folio *__swap_cache_get_folio(swp_entry_t entry);

I was worrying people may misuse this, the returned folio could be
invalided anytime if caller is not holding rcu lock.

I think swap_cache_has_folio seems better indeed.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH 17/19] mm, swap: clean up and improve swap entries freeing
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (15 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There are a few problems with the current freeing of swap entries.

When freeing a set of swap entries directly (swap_put_entries_direct,
typically from zapping the page table), it scans the whole swap region
multiple times. First, it scans the whole region to check if it can be
batch freed and if there is any cached folio. Then do a batch free only
if the whole region's swap count equals 1. And if any entry is cached,
even if only one, it will have to walk the whole region again to clean
up the cache.

And if any entry is not in a consistent status with other entries, it
will fall back to order 0 freeing. For example, if only one of them is
cached, the batch free will fall back.

And the current batch freeing workflow relies on the swap map's
SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which
isn't compatible with the swap table design.

Tidy this up, introduce a new cluster scoped helper for all swap entry
freeing job. It will batch frees all continuous entries, and just start
a new batch if any inconsistent entry is found. This may improve the
batch size when the clusters are fragmented. This should also be more
robust with more sanity checks, and make it clear that a slot pinned by
swap cache will be cleared upon cache reclaim.

And the cache reclaim scan is also now limited to each cluster. If a
cluster has any clean swap cache left after putting the swap count,
reclaim the cluster only instead of the whole region.

And since a folio's entries are always in the same cluster, putting swap
entries from a folio can also use the new helper directly.

This should be both an optimization and a cleanup, and the new helper is
adapted to the swap table.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 238 +++++++++++++++++++++++-----------------------------------
 1 file changed, 96 insertions(+), 142 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3b7df5768d7f..12a1ab6f7b32 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -55,12 +55,14 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_entries_free(struct swap_info_struct *si,
 			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int nr_pages);
+			      unsigned long start, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
-static bool swap_entries_put_map(struct swap_info_struct *si,
-				 swp_entry_t entry, int nr);
+static void swap_put_entry_locked(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  unsigned long offset,
+				  unsigned char usage);
 static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
@@ -197,25 +199,6 @@ static bool swap_only_has_cache(struct swap_info_struct *si,
 	return true;
 }
 
-static bool swap_is_last_map(struct swap_info_struct *si,
-		unsigned long offset, int nr_pages, bool *has_cache)
-{
-	unsigned char *map = si->swap_map + offset;
-	unsigned char *map_end = map + nr_pages;
-	unsigned char count = *map;
-
-	if (swap_count(count) != 1)
-		return false;
-
-	while (++map < map_end) {
-		if (*map != count)
-			return false;
-	}
-
-	*has_cache = !!(count & SWAP_HAS_CACHE);
-	return true;
-}
-
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
  * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
@@ -1420,6 +1403,76 @@ static bool swap_sync_discard(void)
 	return false;
 }
 
+/**
+ * swap_put_entries_cluster - Decrease the swap count of a set of slots.
+ * @si: The swap device.
+ * @start: start offset of slots.
+ * @nr: number of slots.
+ * @reclaim_cache: if true, also reclaim the swap cache.
+ *
+ * This helper decreases the swap count of a set of slots and tries to
+ * batch free them. Also reclaims the swap cache if @reclaim_cache is true.
+ * Context: The caller must ensure that all slots belong to the same
+ * cluster and their swap count doesn't go underflow.
+ */
+static void swap_put_entries_cluster(struct swap_info_struct *si,
+				     unsigned long start, int nr,
+				     bool reclaim_cache)
+{
+	unsigned long offset = start, end = start + nr;
+	unsigned long batch_start = SWAP_ENTRY_INVALID;
+	struct swap_cluster_info *ci;
+	bool need_reclaim = false;
+	unsigned int nr_reclaimed;
+	unsigned long swp_tb;
+	unsigned int count;
+
+	ci = swap_cluster_lock(si, offset);
+	do {
+		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+		count = si->swap_map[offset];
+		VM_WARN_ON(swap_count(count) < 1 || count == SWAP_MAP_BAD);
+		if (swap_count(count) == 1) {
+			/* count == 1 and non-cached slots will be batch freed. */
+			if (!swp_tb_is_folio(swp_tb)) {
+				if (!batch_start)
+					batch_start = offset;
+				continue;
+			}
+			/* count will be 0 after put, slot can be reclaimed */
+			VM_WARN_ON(!(count & SWAP_HAS_CACHE));
+			need_reclaim = true;
+		}
+		/*
+		 * A count != 1 or cached slot can't be freed. Put its swap
+		 * count and then free the interrupted pending batch. Cached
+		 * slots will be freed when folio is removed from swap cache
+		 * (__swap_cache_del_folio).
+		 */
+		swap_put_entry_locked(si, ci, offset, 1);
+		if (batch_start) {
+			swap_entries_free(si, ci, batch_start, offset - batch_start);
+			batch_start = SWAP_ENTRY_INVALID;
+		}
+	} while (++offset < end);
+
+	if (batch_start)
+		swap_entries_free(si, ci, batch_start, offset - batch_start);
+	swap_cluster_unlock(ci);
+
+	if (!need_reclaim || !reclaim_cache)
+		return;
+
+	offset = start;
+	do {
+		nr_reclaimed = __try_to_reclaim_swap(si, offset,
+						     TTRS_UNMAPPED | TTRS_FULL);
+		offset++;
+		if (nr_reclaimed)
+			offset = round_up(offset, abs(nr_reclaimed));
+	} while (offset < end);
+}
+
 /**
  * folio_alloc_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -1521,6 +1574,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
 {
 	swp_entry_t entry = folio->swap;
 	unsigned long nr_pages = folio_nr_pages(folio);
+	struct swap_info_struct *si = __swap_entry_to_info(entry);
 
 	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
@@ -1530,7 +1584,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
 		nr_pages = 1;
 	}
 
-	swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages);
+	swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
 }
 
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1567,12 +1621,11 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
-					   struct swap_cluster_info *ci,
-					   swp_entry_t entry,
-					   unsigned char usage)
+static void swap_put_entry_locked(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  unsigned long offset,
+				  unsigned char usage)
 {
-	unsigned long offset = swp_offset(entry);
 	unsigned char count;
 	unsigned char has_cache;
 
@@ -1598,9 +1651,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
 	if (usage)
 		WRITE_ONCE(si->swap_map[offset], usage);
 	else
-		swap_entries_free(si, ci, entry, 1);
-
-	return usage;
+		swap_entries_free(si, ci, offset, 1);
 }
 
 /*
@@ -1668,70 +1719,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-static bool swap_entries_put_map(struct swap_info_struct *si,
-				 swp_entry_t entry, int nr)
-{
-	unsigned long offset = swp_offset(entry);
-	struct swap_cluster_info *ci;
-	bool has_cache = false;
-	unsigned char count;
-	int i;
-
-	if (nr <= 1)
-		goto fallback;
-	count = swap_count(data_race(si->swap_map[offset]));
-	if (count != 1)
-		goto fallback;
-
-	ci = swap_cluster_lock(si, offset);
-	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
-		goto locked_fallback;
-	}
-	if (!has_cache)
-		swap_entries_free(si, ci, entry, nr);
-	else
-		for (i = 0; i < nr; i++)
-			WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	swap_cluster_unlock(ci);
-
-	return has_cache;
-
-fallback:
-	ci = swap_cluster_lock(si, offset);
-locked_fallback:
-	for (i = 0; i < nr; i++, entry.val++) {
-		count = swap_entry_put_locked(si, ci, entry, 1);
-		if (count == SWAP_HAS_CACHE)
-			has_cache = true;
-	}
-	swap_cluster_unlock(ci);
-	return has_cache;
-}
-
-/*
- * Only functions with "_nr" suffix are able to free entries spanning
- * cross multi clusters, so ensure the range is within a single cluster
- * when freeing entries with functions without "_nr" suffix.
- */
-static bool swap_entries_put_map_nr(struct swap_info_struct *si,
-				    swp_entry_t entry, int nr)
-{
-	int cluster_nr, cluster_rest;
-	unsigned long offset = swp_offset(entry);
-	bool has_cache = false;
-
-	cluster_rest = SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER;
-	while (nr) {
-		cluster_nr = min(nr, cluster_rest);
-		has_cache |= swap_entries_put_map(si, entry, cluster_nr);
-		cluster_rest = SWAPFILE_CLUSTER;
-		nr -= cluster_nr;
-		entry.val += cluster_nr;
-	}
-
-	return has_cache;
-}
-
 /*
  * Check if it's the last ref of swap entry in the freeing path.
  */
@@ -1746,9 +1733,9 @@ static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
  */
 static void swap_entries_free(struct swap_info_struct *si,
 			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int nr_pages)
+			      unsigned long offset, unsigned int nr_pages)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_entry_t entry = swp_entry(si->type, offset);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 
@@ -1954,10 +1941,8 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
 {
 	const unsigned long start_offset = swp_offset(entry);
 	const unsigned long end_offset = start_offset + nr;
+	unsigned long offset, cluster_end;
 	struct swap_info_struct *si;
-	bool any_only_cache = false;
-	unsigned long offset;
-	unsigned long swp_tb;
 
 	si = get_swap_device(entry);
 	if (WARN_ON_ONCE(!si))
@@ -1965,44 +1950,13 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
 	if (WARN_ON_ONCE(end_offset > si->max))
 		goto out;
 
-	/*
-	 * First free all entries in the range.
-	 */
-	any_only_cache = swap_entries_put_map_nr(si, entry, nr);
-
-	/*
-	 * Short-circuit the below loop if none of the entries had their
-	 * reference drop to zero.
-	 */
-	if (!any_only_cache)
-		goto out;
-
-	/*
-	 * Now go back over the range trying to reclaim the swap cache.
-	 */
-	for (offset = start_offset; offset < end_offset; offset += nr) {
-		nr = 1;
-		swp_tb = swap_table_get(__swap_offset_to_cluster(si, offset),
-					offset % SWAPFILE_CLUSTER);
-		if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_tb)) {
-			/*
-			 * Folios are always naturally aligned in swap so
-			 * advance forward to the next boundary. Zero means no
-			 * folio was found for the swap entry, so advance by 1
-			 * in this case. Negative value means folio was found
-			 * but could not be reclaimed. Here we can still advance
-			 * to the next boundary.
-			 */
-			nr = __try_to_reclaim_swap(si, offset,
-						   TTRS_UNMAPPED | TTRS_FULL);
-			if (nr == 0)
-				nr = 1;
-			else if (nr < 0)
-				nr = -nr;
-			nr = ALIGN(offset + 1, nr) - offset;
-		}
-	}
-
+	/* Put entries and reclaim cache in each cluster */
+	offset = start_offset;
+	do {
+		cluster_end = min(round_up(offset + 1, SWAPFILE_CLUSTER), end_offset);
+		swap_put_entries_cluster(si, offset, cluster_end - offset, true);
+		offset = cluster_end;
+	} while (offset < end_offset);
 out:
 	put_swap_device(si);
 }
@@ -2051,7 +2005,7 @@ void swap_free_hibernation_slot(swp_entry_t entry)
 		return;
 
 	ci = swap_cluster_lock(si, offset);
-	swap_entry_put_locked(si, ci, entry, 1);
+	swap_put_entry_locked(si, ci, offset, 1);
 	WARN_ON(swap_entry_swapped(si, offset));
 	swap_cluster_unlock(ci);
 
@@ -3799,10 +3753,10 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
 			      swp_entry_t entry, unsigned int nr)
 {
 	if (swap_only_has_cache(si, swp_offset(entry), nr)) {
-		swap_entries_free(si, ci, entry, nr);
+		swap_entries_free(si, ci, swp_offset(entry), nr);
 	} else {
 		for (int i = 0; i < nr; i++, entry.val++)
-			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+			swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE);
 	}
 }
 
@@ -3923,7 +3877,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
  * into, carry if so, or else fail until a new continuation page is allocated;
  * when the original swap_map count is decremented from 0 with continuation,
  * borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or caller of swap_entry_put_locked()
+ * Called while __swap_duplicate() or caller of swap_put_entry_locked()
  * holds cluster lock.
  */
 static bool swap_count_continued(struct swap_info_struct *si,

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (16 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-29 15:58 ` [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

Now, the swap cache is managed by the swap table. All swap cache users
are checking the swap table directly to check the swap cache state.
SWAP_HAS_CACHE is now just a temporary pin before the first increase
from 0 to 1 of a slot's swap count (swap_dup_entries), or before the
final free of slots pinned by folio in swap cache (put_swap_folio).

Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was
hard to kill because it used to have multiple meanings, more than just
"a slot is cached". We have simplified that and just defined that the
first dup is always done with folio locked in swap cache (folio_dup_swap),
so it can just check the swap cache (swap table) directly.

As for freeing, just let the swap cache free all swap entries of a folio
that have a swap count of zero directly upon folio removal. We have also
just cleaned up freeing to cover the swap cache usage in the swap table,
a slot with swap cache will not be freed until its cache is gone.
Now, making the removal of a folio and freeing the slots being done in
the same critical section, this should improve the performance and gets
rid of the SWAP_HAS_CACHE pin.

After these two changes, SWAP_HAS_CACHE no longer has any users. Remove
all related logic and helpers. swap_map is now only used for tracking
the count, so all swap_map users can just need to read it directly,
ignoring the swap_count helper, which was previously used to filter out
the SWAP_HAS_CACHE bit.

The idea of dropping SWAP_HAS_CACHE and using the swap table directly
was initially from Chris's idea of merging all the metadata usage of all
swaps into one place.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   1 -
 mm/swap.h            |  13 ++--
 mm/swap_state.c      |  28 +++++----
 mm/swapfile.c        | 163 ++++++++++++++++-----------------------------------
 4 files changed, 71 insertions(+), 134 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4b4b81fbc6a3..dcb1760e36c3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,7 +224,6 @@ enum {
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
 /* Bit flag in swap_map */
-#define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
 #define COUNT_CONTINUED	0x80	/* Flag swap_map continuation for full count */
 
 /* Special value in first swap_map */
diff --git a/mm/swap.h b/mm/swap.h
index 73f07bcea5f0..331424366487 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -205,6 +205,11 @@ int folio_alloc_swap(struct folio *folio);
 int folio_dup_swap(struct folio *folio, struct page *subpage);
 void folio_put_swap(struct folio *folio, struct page *subpage);
 
+/* For internal use */
+extern void swap_entries_free(struct swap_info_struct *si,
+			      struct swap_cluster_info *ci,
+			      unsigned long offset, unsigned int nr_pages);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -256,14 +261,6 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
 	return folio_entry.val == round_down(entry.val, nr_pages);
 }
 
-/* Temporary internal helpers */
-void __swapcache_set_cached(struct swap_info_struct *si,
-			    struct swap_cluster_info *ci,
-			    swp_entry_t entry);
-void __swapcache_clear_cached(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int nr);
-
 /*
  * All swap cache helpers below require the caller to ensure the swap entries
  * used are valid and stablize the device by any of the following ways:
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 41d4fa056203..2bf72d58f6ee 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -215,17 +215,6 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
 			shadow = swp_tb_to_shadow(old_tb);
 		offset++;
 	} while (++ci_off < ci_end);
-
-	ci_off = ci_start;
-	offset = swp_offset(entry);
-	do {
-		/*
-		 * Still need to pin the slots with SWAP_HAS_CACHE since
-		 * swap allocator depends on that.
-		 */
-		__swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
-		offset++;
-	} while (++ci_off < ci_end);
 	__swap_cache_add_folio(ci, folio, entry);
 	swap_cluster_unlock(ci);
 	if (shadowp)
@@ -256,6 +245,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 	struct swap_info_struct *si;
 	unsigned long old_tb, new_tb;
 	unsigned int ci_start, ci_off, ci_end;
+	bool folio_swapped = false, need_free = false;
 	unsigned long nr_pages = folio_nr_pages(folio);
 
 	VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
@@ -273,13 +263,27 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
 		old_tb = __swap_table_xchg(ci, ci_off, new_tb);
 		WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
 			     swp_tb_to_folio(old_tb) != folio);
+		if (__swap_count(swp_entry(si->type,
+				 swp_offset(entry) + ci_off - ci_start)))
+			folio_swapped = true;
+		else
+			need_free = true;
 	} while (++ci_off < ci_end);
 
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
 	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
 	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
-	__swapcache_clear_cached(si, ci, entry, nr_pages);
+
+	if (!folio_swapped) {
+		swap_entries_free(si, ci, swp_offset(entry), nr_pages);
+	} else if (need_free) {
+		do {
+			if (!__swap_count(entry))
+				swap_entries_free(si, ci, swp_offset(entry), 1);
+			entry.val++;
+		} while (--nr_pages);
+	}
 }
 
 /**
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12a1ab6f7b32..49916fdb8b70 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -48,21 +48,18 @@
 #include <linux/swap_cgroup.h>
 #include "swap_table.h"
 #include "internal.h"
+#include "swap_table.h"
 #include "swap.h"
 
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entries_free(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      unsigned long start, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
 static void swap_put_entry_locked(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  unsigned long offset,
-				  unsigned char usage);
+				  unsigned long offset);
 static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
@@ -149,11 +146,6 @@ static struct swap_info_struct *swap_entry_to_info(swp_entry_t entry)
 	return swap_type_to_info(swp_type(entry));
 }
 
-static inline unsigned char swap_count(unsigned char ent)
-{
-	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
-}
-
 /*
  * Use the second highest bit of inuse_pages counter as the indicator
  * if one swap device is on the available plist, so the atomic can
@@ -185,15 +177,20 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 #define TTRS_FULL		0x4
 
 static bool swap_only_has_cache(struct swap_info_struct *si,
-			      unsigned long offset, int nr_pages)
+				struct swap_cluster_info *ci,
+				unsigned long offset, int nr_pages)
 {
+	unsigned int ci_off = offset % SWAPFILE_CLUSTER;
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
+	unsigned long swp_tb;
 
 	do {
-		VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
-		if (*map != SWAP_HAS_CACHE)
+		swp_tb = __swap_table_get(ci, ci_off);
+		VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb));
+		if (*map)
 			return false;
+		++ci_off;
 	} while (++map < map_end);
 
 	return true;
@@ -254,7 +251,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
 	ci = swap_cluster_lock(si, offset);
-	need_reclaim = swap_only_has_cache(si, offset, nr_pages);
+	need_reclaim = swap_only_has_cache(si, ci, offset, nr_pages);
 	swap_cluster_unlock(ci);
 	if (!need_reclaim)
 		goto out_unlock;
@@ -775,7 +772,7 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
 
 	spin_unlock(&ci->lock);
 	do {
-		if (swap_count(READ_ONCE(map[offset])))
+		if (READ_ONCE(map[offset]))
 			break;
 		swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
 		if (swp_tb_is_folio(swp_tb)) {
@@ -800,7 +797,7 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
 	 */
 	for (offset = start; offset < end; offset++) {
 		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
-		if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb))
+		if (map[offset] || !swp_tb_is_null(swp_tb))
 			return SWAP_ENTRY_INVALID;
 	}
 
@@ -820,11 +817,10 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 		return true;
 
 	do {
-		if (swap_count(map[offset]))
+		if (map[offset])
 			return false;
 		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
 		if (swp_tb_is_folio(swp_tb)) {
-			WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE));
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
@@ -882,11 +878,6 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 	if (likely(folio)) {
 		order = folio_order(folio);
 		nr_pages = 1 << order;
-		/*
-		 * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries.
-		 * This is the legacy allocation behavior, will drop it very soon.
-		 */
-		memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
 		__swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
 	} else {
 		order = 0;
@@ -995,8 +986,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (!swap_count(READ_ONCE(map[offset])) &&
-			    swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
+			if (!READ_ONCE(map[offset]) &&
+			    swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
@@ -1431,8 +1422,8 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
 	do {
 		swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
 		count = si->swap_map[offset];
-		VM_WARN_ON(swap_count(count) < 1 || count == SWAP_MAP_BAD);
-		if (swap_count(count) == 1) {
+		VM_WARN_ON(count < 1 || count == SWAP_MAP_BAD);
+		if (count == 1) {
 			/* count == 1 and non-cached slots will be batch freed. */
 			if (!swp_tb_is_folio(swp_tb)) {
 				if (!batch_start)
@@ -1440,7 +1431,6 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
 				continue;
 			}
 			/* count will be 0 after put, slot can be reclaimed */
-			VM_WARN_ON(!(count & SWAP_HAS_CACHE));
 			need_reclaim = true;
 		}
 		/*
@@ -1449,7 +1439,7 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
 		 * slots will be freed when folio is removed from swap cache
 		 * (__swap_cache_del_folio).
 		 */
-		swap_put_entry_locked(si, ci, offset, 1);
+		swap_put_entry_locked(si, ci, offset);
 		if (batch_start) {
 			swap_entries_free(si, ci, batch_start, offset - batch_start);
 			batch_start = SWAP_ENTRY_INVALID;
@@ -1602,13 +1592,8 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	offset = swp_offset(entry);
 	if (offset >= si->max)
 		goto bad_offset;
-	if (data_race(!si->swap_map[swp_offset(entry)]))
-		goto bad_free;
 	return si;
 
-bad_free:
-	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
-	goto out;
 bad_offset:
 	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
 	goto out;
@@ -1623,21 +1608,12 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 
 static void swap_put_entry_locked(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  unsigned long offset,
-				  unsigned char usage)
+				  unsigned long offset)
 {
 	unsigned char count;
-	unsigned char has_cache;
 
 	count = si->swap_map[offset];
-
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-
-	if (usage == SWAP_HAS_CACHE) {
-		VM_BUG_ON(!has_cache);
-		has_cache = 0;
-	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
+	if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
 		if (count == COUNT_CONTINUED) {
 			if (swap_count_continued(si, offset, count))
 				count = SWAP_MAP_MAX | COUNT_CONTINUED;
@@ -1647,10 +1623,8 @@ static void swap_put_entry_locked(struct swap_info_struct *si,
 			count--;
 	}
 
-	usage = count | has_cache;
-	if (usage)
-		WRITE_ONCE(si->swap_map[offset], usage);
-	else
+	WRITE_ONCE(si->swap_map[offset], count);
+	if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))
 		swap_entries_free(si, ci, offset, 1);
 }
 
@@ -1719,21 +1693,13 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-/*
- * Check if it's the last ref of swap entry in the freeing path.
- */
-static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
-{
-	return (count == SWAP_HAS_CACHE) || (count == 1);
-}
-
 /*
  * Drop the last ref of swap entries, caller have to ensure all entries
  * belong to the same cgroup and cluster.
  */
-static void swap_entries_free(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      unsigned long offset, unsigned int nr_pages)
+void swap_entries_free(struct swap_info_struct *si,
+		       struct swap_cluster_info *ci,
+		       unsigned long offset, unsigned int nr_pages)
 {
 	swp_entry_t entry = swp_entry(si->type, offset);
 	unsigned char *map = si->swap_map + offset;
@@ -1746,7 +1712,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 
 	ci->count -= nr_pages;
 	do {
-		VM_BUG_ON(!swap_is_last_ref(*map));
+		VM_WARN_ON(*map > 1);
 		*map = 0;
 	} while (++map < map_end);
 
@@ -1765,7 +1731,7 @@ int __swap_count(swp_entry_t entry)
 	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	pgoff_t offset = swp_offset(entry);
 
-	return swap_count(si->swap_map[offset]);
+	return si->swap_map[offset];
 }
 
 /**
@@ -1779,7 +1745,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset)
 	int count;
 
 	ci = swap_cluster_lock(si, offset);
-	count = swap_count(si->swap_map[offset]);
+	count = si->swap_map[offset];
 	swap_cluster_unlock(ci);
 
 	return count && count != SWAP_MAP_BAD;
@@ -1806,7 +1772,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	ci = swap_cluster_lock(si, offset);
 
-	count = swap_count(si->swap_map[offset]);
+	count = si->swap_map[offset];
 	if (!(count & COUNT_CONTINUED))
 		goto out;
 
@@ -1844,12 +1810,12 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 
 	ci = swap_cluster_lock(si, offset);
 	if (nr_pages == 1) {
-		if (swap_count(map[roffset]))
+		if (map[roffset])
 			ret = true;
 		goto unlock_out;
 	}
 	for (i = 0; i < nr_pages; i++) {
-		if (swap_count(map[offset + i])) {
+		if (map[offset + i]) {
 			ret = true;
 			break;
 		}
@@ -2005,7 +1971,7 @@ void swap_free_hibernation_slot(swp_entry_t entry)
 		return;
 
 	ci = swap_cluster_lock(si, offset);
-	swap_put_entry_locked(si, ci, offset, 1);
+	swap_put_entry_locked(si, ci, offset);
 	WARN_ON(swap_entry_swapped(si, offset));
 	swap_cluster_unlock(ci);
 
@@ -2412,6 +2378,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 					unsigned int prev)
 {
 	unsigned int i;
+	unsigned long swp_tb;
 	unsigned char count;
 
 	/*
@@ -2422,7 +2389,11 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 */
 	for (i = prev + 1; i < si->max; i++) {
 		count = READ_ONCE(si->swap_map[i]);
-		if (count && swap_count(count) != SWAP_MAP_BAD)
+		swp_tb = swap_table_get(__swap_offset_to_cluster(si, i),
+					i % SWAPFILE_CLUSTER);
+		if (count == SWAP_MAP_BAD)
+			continue;
+		if (count || swp_tb_is_folio(swp_tb))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
 			cond_resched();
@@ -3649,39 +3620,26 @@ static int swap_dup_entries(struct swap_info_struct *si,
 			    unsigned char usage, int nr)
 {
 	int i;
-	unsigned char count, has_cache;
+	unsigned char count;
 
 	for (i = 0; i < nr; i++) {
 		count = si->swap_map[offset + i];
-
 		/*
 		 * Allocator never allocates bad slots, and readahead is guarded
 		 * by swap_entry_swapped.
 		 */
-		if (WARN_ON(swap_count(count) == SWAP_MAP_BAD))
-			return -ENOENT;
-
-		has_cache = count & SWAP_HAS_CACHE;
-		count &= ~SWAP_HAS_CACHE;
-
-		if (!count && !has_cache) {
-			return -ENOENT;
-		} else if (usage == SWAP_HAS_CACHE) {
-			if (has_cache)
-				return -EEXIST;
-		} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
-			return -EINVAL;
-		}
+		VM_WARN_ON(count == SWAP_MAP_BAD);
+		/*
+		 * Swap count duplication is guranteed by either locked swap cache
+		 * folio (folio_dup_swap) or external lock (swap_dup_entry_direct).
+		 */
+		VM_WARN_ON(!count &&
+			   !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)));
 	}
 
 	for (i = 0; i < nr; i++) {
 		count = si->swap_map[offset + i];
-		has_cache = count & SWAP_HAS_CACHE;
-		count &= ~SWAP_HAS_CACHE;
-
-		if (usage == SWAP_HAS_CACHE)
-			has_cache = SWAP_HAS_CACHE;
-		else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+		if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
 			count += usage;
 		else if (swap_count_continued(si, offset + i, count))
 			count = COUNT_CONTINUED;
@@ -3693,7 +3651,7 @@ static int swap_dup_entries(struct swap_info_struct *si,
 			return -ENOMEM;
 		}
 
-		WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
+		WRITE_ONCE(si->swap_map[offset + i], count);
 	}
 
 	return 0;
@@ -3739,27 +3697,6 @@ int swap_dup_entry_direct(swp_entry_t entry)
 	return err;
 }
 
-/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */
-void __swapcache_set_cached(struct swap_info_struct *si,
-			    struct swap_cluster_info *ci,
-			    swp_entry_t entry)
-{
-	WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1));
-}
-
-/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock */
-void __swapcache_clear_cached(struct swap_info_struct *si,
-			      struct swap_cluster_info *ci,
-			      swp_entry_t entry, unsigned int nr)
-{
-	if (swap_only_has_cache(si, swp_offset(entry), nr)) {
-		swap_entries_free(si, ci, swp_offset(entry), nr);
-	} else {
-		for (int i = 0; i < nr; i++, entry.val++)
-			swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE);
-	}
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
@@ -3805,7 +3742,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 
 	ci = swap_cluster_lock(si, offset);
 
-	count = swap_count(si->swap_map[offset]);
+	count = si->swap_map[offset];
 
 	if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
 		/*

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (17 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
  2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
  2025-11-05  7:39 ` Chris Li
  20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

From: Kairui Song <kasong@tencent.com>

There are now only two users of _swap_info_get after consolidating
these callers, folio_try_reclaim_swap and swp_swapcount.

folio_free_swap already holds the folio lock, and the folio
is in swap cache, _swap_info_get is redundant.

For swp_swapcount, it can just use get_swap_device instead. It only
wants to check the swap count, both are fine except get_swap_device
increases the device ref count, which is actually a bit safer. The
only current use is smap walking, and the performance change here
is tiny.

And after these changes, _swap_info_get is no longer used, so we can
safely remove it.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 39 ++++++---------------------------------
 1 file changed, 6 insertions(+), 33 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 49916fdb8b70..150916f4640c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1577,35 +1577,6 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
 	swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	unsigned long offset;
-
-	if (!entry.val)
-		goto out;
-	si = swap_entry_to_info(entry);
-	if (!si)
-		goto bad_nofile;
-	if (data_race(!(si->flags & SWP_USED)))
-		goto bad_device;
-	offset = swp_offset(entry);
-	if (offset >= si->max)
-		goto bad_offset;
-	return si;
-
-bad_offset:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
-	goto out;
-bad_device:
-	pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
-	goto out;
-bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
-out:
-	return NULL;
-}
-
 static void swap_put_entry_locked(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
 				  unsigned long offset)
@@ -1764,7 +1735,7 @@ int swp_swapcount(swp_entry_t entry)
 	pgoff_t offset;
 	unsigned char *map;
 
-	si = _swap_info_get(entry);
+	si = get_swap_device(entry);
 	if (!si)
 		return 0;
 
@@ -1794,6 +1765,7 @@ int swp_swapcount(swp_entry_t entry)
 	} while (tmp_count & COUNT_CONTINUED);
 out:
 	swap_cluster_unlock(ci);
+	put_swap_device(si);
 	return count;
 }
 
@@ -1828,11 +1800,12 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 static bool folio_swapped(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
-	struct swap_info_struct *si = _swap_info_get(entry);
+	struct swap_info_struct *si;
 
-	if (!si)
-		return false;
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 
+	si = __swap_entry_to_info(entry);
 	if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
 		return swap_entry_swapped(si, swp_offset(entry));
 

-- 
2.51.1


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (18 preceding siblings ...)
  2025-10-29 15:58 ` [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
@ 2025-10-30 23:04 ` Yosry Ahmed
  2025-10-31  6:58   ` Kairui Song
  2025-11-05  7:39 ` Chris Li
  20 siblings, 1 reply; 50+ messages in thread
From: Yosry Ahmed @ 2025-10-30 23:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> special swap bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.
> 
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
> 
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
> 
> Test results:
> 
> Redis / Valkey bench:
> =====================
> 
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> 
>         no persistence              with BGSAVE
> Before: 460475.84 RPS               311591.19 RPS
> After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
> 
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> 
>         no persistence              with BGSAVE
> Before: 306044.38 RPS               102745.88 RPS
> After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
> 
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
> 
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
> 
>                            Before:         After:
> System time:               282.22s         283.47s
> Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> Single process Throughput: 176.41 MB/s     176.23 MB/s
> Free latency:              518477.96 us    521488.06 us
> 
> Which is almost identical.
> 
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
> 
>                 Before            After:
> System time:    1379.91s          1364.22s (-0.11%)
> 
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
> 
>                 Before            After:
> System time:    1822.52s          1803.33s (-0.11%)
> 
> Which is almost identical.
> 
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> 
> Before: 318162.18 qps
> After:  318512.01 qps (+0.01%)
> 
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
> 
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. It's still limited to SYNC_IO
> devices, though, this limitation can will be removed later. This may
> cause more serious thrashing for certain workloads, but that's not an
> issue caused by this series, it's a common THP issue we should resolve
> separately.
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> 
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Unfortunately I don't have time to go through the series and review it,
but I wanted to just say awesome work here. The special cases in the
swap code to avoid using the swapcache have always been a pain.

In fact, there's one more special case that we can probably remove in
zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
fix data loss on SWP_SYNCHRONOUS_IO devices").

> ---
> Kairui Song (18):
>       mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
>       mm, swap: split swap cache preparation loop into a standalone helper
>       mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
>       mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
>       mm, swap: simplify the code and reduce indention
>       mm, swap: free the swap cache after folio is mapped
>       mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
>       mm, swap: swap entry of a bad slot should not be considered as swapped out
>       mm, swap: consolidate cluster reclaim and check logic
>       mm, swap: split locked entry duplicating into a standalone helper
>       mm, swap: use swap cache as the swap in synchronize layer
>       mm, swap: remove workaround for unsynchronized swap map cache state
>       mm, swap: sanitize swap entry management workflow
>       mm, swap: add folio to swap cache directly on allocation
>       mm, swap: check swap table directly for checking cache
>       mm, swap: clean up and improve swap entries freeing
>       mm, swap: drop the SWAP_HAS_CACHE flag
>       mm, swap: remove no longer needed _swap_info_get
> 
> Nhat Pham (1):
>       mm/shmem, swap: remove SWAP_MAP_SHMEM
> 
>  arch/s390/mm/pgtable.c |   2 +-
>  include/linux/swap.h   |  77 ++---
>  kernel/power/swap.c    |  10 +-
>  mm/madvise.c           |   2 +-
>  mm/memory.c            | 270 +++++++---------
>  mm/rmap.c              |   7 +-
>  mm/shmem.c             |  75 ++---
>  mm/swap.h              |  69 +++-
>  mm/swap_state.c        | 341 +++++++++++++-------
>  mm/swapfile.c          | 849 +++++++++++++++++++++----------------------------
>  mm/userfaultfd.c       |  10 +-
>  mm/vmscan.c            |   1 -
>  mm/zswap.c             |   4 +-
>  13 files changed, 840 insertions(+), 877 deletions(-)
> ---
> base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
> change-id: 20251007-swap-table-p2-7d3086e5c38a
> 
> Best regards,
> -- 
> Kairui Song <kasong@tencent.com>
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
  2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
@ 2025-10-31  6:58   ` Kairui Song
  0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-31  6:58 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel

On Fri, Oct 31, 2025 at 7:05 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> > This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> > special swap bits including SWAP_HAS_CACHE, along with many historical
> > issues. The performance is about ~20% better for some workloads, like
> > Redis with persistence. This also cleans up the code to prepare for
> > later phases, some patches are from a previously posted series.
> >
> > Swap cache bypassing and swap synchronization in general had many
> > issues. Some are solved as workarounds, and some are still there [1]. To
> > resolve them in a clean way, one good solution is to always use swap
> > cache as the synchronization layer [2]. So we have to remove the swap
> > cache bypass swap-in path first. It wasn't very doable due to
> > performance issues, but now combined with the swap table, removing
> > the swap cache bypass path will instead improve the performance,
> > there is no reason to keep it.
> >
> > Now we can rework the swap entry and cache synchronization following
> > the new design. Swap cache synchronization was heavily relying on
> > SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> > of special swap map bits and related workarounds, we get a cleaner code
> > base and prepare for merging the swap count into the swap table in the
> > next step.
> >
> > Test results:
> >
> > Redis / Valkey bench:
> > =====================
> >
> > Testing on a ARM64 VM 1.5G memory:
> > Server: valkey-server --maxmemory 2560M
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> >         no persistence              with BGSAVE
> > Before: 460475.84 RPS               311591.19 RPS
> > After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
> >
> > Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> > Server:
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> >         no persistence              with BGSAVE
> > Before: 306044.38 RPS               102745.88 RPS
> > After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
> >
> > The performance is a lot better when persistence is applied. This should
> > apply to many other workloads that involve sharing memory and COW. A
> > slight performance drop was observed for the ARM64 Redis test: We are
> > still using swap_map to track the swap count, which is causing redundant
> > cache and CPU overhead and is not very performance-friendly for some
> > arches. This will be improved once we merge the swap map into the swap
> > table (as already demonstrated previously [3]).
> >
> > vm-scabiity
> > ===========
> > usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> > simulated PMEM as swap), average result of 6 test run:
> >
> >                            Before:         After:
> > System time:               282.22s         283.47s
> > Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> > Single process Throughput: 176.41 MB/s     176.23 MB/s
> > Free latency:              518477.96 us    521488.06 us
> >
> > Which is almost identical.
> >
> > Build kernel test:
> > ==================
> > Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> >                 Before            After:
> > System time:    1379.91s          1364.22s (-0.11%)
> >
> > Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> >                 Before            After:
> > System time:    1822.52s          1803.33s (-0.11%)
> >
> > Which is almost identical.
> >
> > MySQL:
> > ======
> > sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> > --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> > 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> >
> > Before: 318162.18 qps
> > After:  318512.01 qps (+0.01%)
> >
> > In conclusion, the result is looking better or identical for most cases,
> > and it's especially better for workloads with swap count > 1 on SYNC_IO
> > devices, about ~20% gain in above test. Next phases will start to merge
> > swap count into swap table and reduce memory usage.
> >
> > One more gain here is that we now have better support for THP swapin.
> > Previously, the THP swapin was bound with swap cache bypassing, which
> > only works for single-mapped folios. Removing the bypassing path also
> > enabled THP swapin for all folios. It's still limited to SYNC_IO
> > devices, though, this limitation can will be removed later. This may
> > cause more serious thrashing for certain workloads, but that's not an
> > issue caused by this series, it's a common THP issue we should resolve
> > separately.
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> > Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Unfortunately I don't have time to go through the series and review it,
> but I wanted to just say awesome work here. The special cases in the
> swap code to avoid using the swapcache have always been a pain.
>
> In fact, there's one more special case that we can probably remove in
> zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
> fix data loss on SWP_SYNCHRONOUS_IO devices").

Thanks! Oh, now I remember that one, it can be removed indeed. There
are several more cleanup and optimizations that can be done after this
series, it's getting too long already so I didn't include everything.

But removing 25cd241408a2 is easy to do and easy to review, I can
include it in the next update.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
  2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
                   ` (19 preceding siblings ...)
  2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
@ 2025-11-05  7:39 ` Chris Li
  20 siblings, 0 replies; 50+ messages in thread
From: Chris Li @ 2025-11-05  7:39 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Nhat Pham,
	Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
	Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song

Sorry I have been super busy and late to the review party.

I am still catching up on my backlogs.

The cover letter title is a bit too long, I suggest you put the swap
table phase II in the beginning of the title rather than the end. The
title is too long and "phase II" gets wrapped to another line. Maybe
just use "swap table phase II" as the cover letter title is good
enough. You can explain what this series does in more detail in the
body of the cover letter.

Also we can mention the total estimate of phases for the swap tables
(4-5 phases?). Does not need to be precise, just serves as an overall
indication of the swap table progress bar.

On Wed, Oct 29, 2025 at 8:59 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and

Great job!

> special swap bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.

That is wonderful we can remove SWAP_HAS_CACHE and remove sync IO swap
cache bypass. Swap table is so fast the bypass does not make any sense
any more.

> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
>
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
>
> Test results:
>
> Redis / Valkey bench:
> =====================
>
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
>         no persistence              with BGSAVE
> Before: 460475.84 RPS               311591.19 RPS
> After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
>
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
>         no persistence              with BGSAVE
> Before: 306044.38 RPS               102745.88 RPS
> After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
>
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
>
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
>
>                            Before:         After:
> System time:               282.22s         283.47s
> Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> Single process Throughput: 176.41 MB/s     176.23 MB/s
> Free latency:              518477.96 us    521488.06 us
>
> Which is almost identical.
>
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
>                 Before            After:
> System time:    1379.91s          1364.22s (-0.11%)
>
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
>                 Before            After:
> System time:    1822.52s          1803.33s (-0.11%)
>
> Which is almost identical.
>
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
>
> Before: 318162.18 qps
> After:  318512.01 qps (+0.01%)
>
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
>
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. It's still limited to SYNC_IO
> devices, though, this limitation can will be removed later. This may

Grammer. "though, this"  "can will be"

 The THP swapin is still limited to SYNC_IO devices.  This limitation
can be removed later.

Chris

> cause more serious thrashing for certain workloads, but that's not an
> issue caused by this series, it's a common THP issue we should resolve
> separately.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Kairui Song (18):
>       mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
>       mm, swap: split swap cache preparation loop into a standalone helper
>       mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
>       mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
>       mm, swap: simplify the code and reduce indention
>       mm, swap: free the swap cache after folio is mapped
>       mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
>       mm, swap: swap entry of a bad slot should not be considered as swapped out
>       mm, swap: consolidate cluster reclaim and check logic
>       mm, swap: split locked entry duplicating into a standalone helper
>       mm, swap: use swap cache as the swap in synchronize layer
>       mm, swap: remove workaround for unsynchronized swap map cache state
>       mm, swap: sanitize swap entry management workflow
>       mm, swap: add folio to swap cache directly on allocation
>       mm, swap: check swap table directly for checking cache
>       mm, swap: clean up and improve swap entries freeing
>       mm, swap: drop the SWAP_HAS_CACHE flag
>       mm, swap: remove no longer needed _swap_info_get
>
> Nhat Pham (1):
>       mm/shmem, swap: remove SWAP_MAP_SHMEM
>
>  arch/s390/mm/pgtable.c |   2 +-
>  include/linux/swap.h   |  77 ++---
>  kernel/power/swap.c    |  10 +-
>  mm/madvise.c           |   2 +-
>  mm/memory.c            | 270 +++++++---------
>  mm/rmap.c              |   7 +-
>  mm/shmem.c             |  75 ++---
>  mm/swap.h              |  69 +++-
>  mm/swap_state.c        | 341 +++++++++++++-------
>  mm/swapfile.c          | 849 +++++++++++++++++++++----------------------------
>  mm/userfaultfd.c       |  10 +-
>  mm/vmscan.c            |   1 -
>  mm/zswap.c             |   4 +-
>  13 files changed, 840 insertions(+), 877 deletions(-)
> ---
> base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
> change-id: 20251007-swap-table-p2-7d3086e5c38a
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2025-11-07  3:13 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
2025-10-30 22:53   ` Yosry Ahmed
     [not found]     ` <CAGsJ_4x1P0ypm70De7qDcDxqvY93GEPW6X2sBS_xfSUem5_S2w@mail.gmail.com>
2025-11-03  9:02       ` Kairui Song
2025-11-03  9:10         ` Barry Song
2025-11-03 16:50         ` Yosry Ahmed
2025-10-29 15:58 ` [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
2025-11-04  3:47   ` Barry Song
2025-11-04 10:44     ` Kairui Song
2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
2025-11-04  4:19   ` Barry Song
2025-11-04  8:26     ` Barry Song
2025-11-04 10:55       ` Kairui Song
2025-10-29 15:58 ` [PATCH 05/19] mm, swap: simplify the code and reduce indention Kairui Song
2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
2025-11-04  9:14   ` Barry Song
2025-11-04 10:50     ` Kairui Song
2025-11-04 19:52       ` Barry Song
2025-10-29 15:58 ` [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
2025-10-29 15:58 ` [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
2025-10-29 15:58 ` [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
2025-10-31  5:25   ` YoungJun Park
2025-10-31  7:11     ` Kairui Song
2025-10-29 15:58 ` [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
2025-10-29 19:25   ` kernel test robot
2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
2025-11-07  3:07   ` Barry Song
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
2025-10-29 19:25   ` kernel test robot
2025-10-30  5:25     ` Kairui Song
2025-10-29 19:25   ` kernel test robot
2025-11-01  4:51   ` YoungJun Park
2025-11-01  8:59     ` Kairui Song
2025-11-01  9:08       ` YoungJun Park
2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-10-29 16:52   ` Kairui Song
2025-10-31  5:56   ` YoungJun Park
2025-10-31  7:02     ` Kairui Song
2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
2025-11-06 21:02   ` Barry Song
2025-11-07  3:13     ` Kairui Song
2025-10-29 15:58 ` [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
2025-10-29 15:58 ` [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
2025-10-29 15:58 ` [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
2025-10-31  6:58   ` Kairui Song
2025-11-05  7:39 ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).