[PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags

Linux Power Management development
 help / color / mirror / Atom feed

* [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags
@ 2025-12-19 19:43 Kairui Song
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Kairui Song @ 2025-12-19 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and
special swap flag bits including SWAP_HAS_CACHE, along with many historical
issues. The performance is about ~20% better for some workloads, like
Redis with persistence. This also cleans up the code to prepare for
later phases, some patches are from a previously posted series.

Swap cache bypassing and swap synchronization in general had many
issues. Some are solved as workarounds, and some are still there [1]. To
resolve them in a clean way, one good solution is to always use swap
cache as the synchronization layer [2]. So we have to remove the swap
cache bypass swap-in path first. It wasn't very doable due to
performance issues, but now combined with the swap table, removing
the swap cache bypass path will instead improve the performance,
there is no reason to keep it.

Now we can rework the swap entry and cache synchronization following
the new design. Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.

And swap_map is now only used for swap count, so in the next phase,
swap_map can be merged into the swap table, which will clean up more
things and start to reduce the static memory usage. Removal of
swap_cgroup_ctrl is also doable, but needs to be done after we also
simplify the allocation of swapin folios: always use the new
swap_cache_alloc_folio helper so the accounting will also be managed by
the swap layer by then.

Test results:

Redis / Valkey bench:
=====================

Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 460475.84 RPS               311591.19 RPS
After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)

Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 306044.38 RPS               102745.88 RPS
After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)

The performance is a lot better when persistence is applied. This should
apply to many other workloads that involve sharing memory and COW. A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches. This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).

vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:

                           Before:         After:
System time:               282.22s         283.47s
Sum Throughput:            5677.35 MB/s    5688.78 MB/s
Single process Throughput: 176.41 MB/s     176.23 MB/s
Free latency:              518477.96 us    521488.06 us

Which is almost identical.

Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1379.91s          1364.22s (-0.11%)

Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1822.52s          1803.33s (-0.11%)

Which is almost identical.

MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).

Before: 318162.18 qps
After:  318512.01 qps (+0.01%)

In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test. Next phases will start to merge
swap count into swap table and reduce memory usage.

One more gain here is that we now have better support for THP swapin.
Previously, the THP swapin was bound with swap cache bypassing, which
only works for single-mapped folios. Removing the bypassing path also
enabled THP swapin for all folios. The THP swapin is still limited to
SYNC_IO devices, the limitation can be removed later.

This may cause more serious THP thrashing for certain workloads, but that's
not an issue caused by this series, it's a common THP issue we should resolve
separately.

Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Changes in v5:
Rebased on top of current mm-unstalbe, also appliable on mm-new.
- Solve trivial conlicts with 6.19 rc1 for easier reviewing.
- Don't change the argument for swap_entry_swapped [ Baoquan He ].
- Update commit message and comment [ Baoquan He ].
- Add a WARN in swap_dup_entries to catch potential swap count
  overflow. No error was ever observed for this but the check existed
  before, so just keep it to be very careful.
- Link to v4: https://lore.kernel.org/r/20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com

Changes in v4:
- Rebase on latest mm-unstable, should be also mergeable with mm-new.
- Update the shmem update commit message as suggested by, and reviewed
  by [ Baolin Wang ].
- Add a WARN_ON to catch more potential issue and update a few comments.
- Link to v3: https://lore.kernel.org/r/20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com

Changes in v3:
- Imporve and update comments [ Barry Song, YoungJun Park, Chris Li ]
- Simplify the changes of cluster_reclaim_range a bit, as YoungJun points
  out the change looked confusing.
- Fix a few typos I found during self review.
- Fix a few build error and warns.
- Link to v2: https://lore.kernel.org/r/20251117-swap-table-p2-v2-0-37730e6ea6d5@tencent.com

Changes in v2:
- Rebased on latest mm-new to resolve conflicts, also appliable to
  mm-unstable.
- Imporve comment, and commit messages in multiple commits, many thanks to
  [Barry Song, YoungJun Park, Yosry Ahmed ]
- Fix cluster usable check in allocator [ YoungJun Park]
- Improve cover letter [ Chris Li ]
- Collect Reviewed-by [ Yosry Ahmed ]
- Fix a few build warning and issues from build bot.
- Link to v1: https://lore.kernel.org/r/20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com

---
Kairui Song (18):
      mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio
      mm, swap: split swap cache preparation loop into a standalone helper
      mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
      mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
      mm, swap: simplify the code and reduce indention
      mm, swap: free the swap cache after folio is mapped
      mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
      mm, swap: swap entry of a bad slot should not be considered as swapped out
      mm, swap: consolidate cluster reclaim and usability check
      mm, swap: split locked entry duplicating into a standalone helper
      mm, swap: use swap cache as the swap in synchronize layer
      mm, swap: remove workaround for unsynchronized swap map cache state
      mm, swap: cleanup swap entry management workflow
      mm, swap: add folio to swap cache directly on allocation
      mm, swap: check swap table directly for checking cache
      mm, swap: clean up and improve swap entries freeing
      mm, swap: drop the SWAP_HAS_CACHE flag
      mm, swap: remove no longer needed _swap_info_get

Nhat Pham (1):
      mm/shmem, swap: remove SWAP_MAP_SHMEM

 arch/s390/mm/gmap_helpers.c |   2 +-
 arch/s390/mm/pgtable.c      |   2 +-
 include/linux/swap.h        |  71 ++--
 kernel/power/swap.c         |  10 +-
 mm/madvise.c                |   2 +-
 mm/memory.c                 | 276 +++++++-------
 mm/rmap.c                   |   7 +-
 mm/shmem.c                  |  75 ++--
 mm/swap.h                   |  70 +++-
 mm/swap_state.c             | 338 +++++++++++------
 mm/swapfile.c               | 861 ++++++++++++++++++++------------------------
 mm/userfaultfd.c            |  10 +-
 mm/vmscan.c                 |   1 -
 mm/zswap.c                  |   4 +-
 14 files changed, 858 insertions(+), 871 deletions(-)
---
base-commit: dc9f44261a74a4db5fe8ed570fc8b3edc53a28a2
change-id: 20251007-swap-table-p2-7d3086e5c38a

Best regards,
-- 
Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-19 19:43 [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
@ 2025-12-19 19:43 ` Kairui Song
  2025-12-20  4:02   ` Baoquan He
                     ` (4 more replies)
  2025-12-19 20:05 ` [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
  2025-12-20 12:34 ` Baoquan He
  2 siblings, 5 replies; 16+ messages in thread
From: Kairui Song @ 2025-12-19 19:43 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

From: Kairui Song <kasong@tencent.com>

The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.

This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.

Swap entry will be mostly allocated and free with a folio bound.
The folio lock will be useful for resolving many swap ralated races.

Now swap allocation (except hibernation) always starts with a folio in
the swap cache, and gets duped/freed protected by the folio lock:

- folio_alloc_swap() - The only allocation entry point now.
  Context: The folio must be locked.
  This allocates one or a set of continuous swap slots for a folio and
  binds them to the folio by adding the folio to the swap cache. The
  swap slots' swap count start with zero value.

- folio_dup_swap() - Increase the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This increases the ref count of swap entries allocated to a folio.
  Newly allocated swap slots' count has to be increased by this helper
  as the folio got unmapped (and swap entries got installed).

- folio_put_swap() - Decrease the swap count of one or more entries.
  Context: The folio must be locked and in the swap cache. For now, the
  caller still has to lock the new swap entry owner (e.g., PTL).
  This decreases the ref count of swap entries allocated to a folio.
  Typically, swapin will decrease the swap count as the folio got
  installed back and the swap entry got uninstalled

  This won't remove the folio from the swap cache and free the
  slot. Lazy freeing of swap cache is helpful for reducing IO.
  There is already a folio_free_swap() for immediate cache reclaim.
  This part could be further optimized later.

The above locking constraints could be further relaxed when the swap
table if fully implemented. Currently dup still needs the caller
to lock the swap entry container (e.g. PTL), or a concurrent zap
may underflow the swap count.

Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:

- swap_put_entries_direct() - Decrease the swap count directly.
  Context: The caller must lock whatever is referencing the slots to
  avoid a race.

  Typically the page table zapping or shmem mapping truncate will need
  to free swap slots directly. If a slot is cached (has a folio bound),
  this will also try to release the swap cache.

- swap_dup_entry_direct() - Increase the swap count directly.
  Context: The caller must lock whatever is referencing the entries to
  avoid race, and the entries must already have a swap count > 1.

  Typically, forking will need to copy the page table and hence needs to
  increase the swap count of the entries in the table. The page table is
  locked while referencing the swap entries, so the entries all have a
  swap count > 1 and can't be freed.

Hibernation subsystem is a bit different, so two special wrappers are here:

- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.

All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.

By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.

This commit should not introduce any behavior change

Cc: linux-pm@vger.kernel.org
Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 arch/s390/mm/gmap_helpers.c |   2 +-
 arch/s390/mm/pgtable.c      |   2 +-
 include/linux/swap.h        |  58 ++++++++---------
 kernel/power/swap.c         |  10 +--
 mm/madvise.c                |   2 +-
 mm/memory.c                 |  15 +++--
 mm/rmap.c                   |   7 +-
 mm/shmem.c                  |  10 +--
 mm/swap.h                   |  37 +++++++++++
 mm/swapfile.c               | 152 +++++++++++++++++++++++++++++++-------------
 10 files changed, 197 insertions(+), 98 deletions(-)

diff --git a/arch/s390/mm/gmap_helpers.c b/arch/s390/mm/gmap_helpers.c
index d41b19925a5a..dd89fce28531 100644
--- a/arch/s390/mm/gmap_helpers.c
+++ b/arch/s390/mm/gmap_helpers.c
@@ -32,7 +32,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm, softleaf_t entry)
 		dec_mm_counter(mm, MM_SWAPENTS);
 	else if (softleaf_is_migration(entry))
 		dec_mm_counter(mm, mm_counter(softleaf_to_folio(entry)));
-	free_swap_and_cache(entry);
+	swap_put_entries_direct(entry, 1);
 }
 
 /**
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 666adcd681ab..b22181e1079e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -682,7 +682,7 @@ static void ptep_zap_softleaf_entry(struct mm_struct *mm, softleaf_t entry)
 
 		dec_mm_counter(mm, mm_counter(folio));
 	}
-	free_swap_and_cache(entry);
+	swap_put_entries_direct(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 74df3004c850..aaa868f60b9c 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
 int swap_type_of(dev_t device, sector_t offset);
 int find_first_swap(dev_t *device);
 extern unsigned int count_swap_pages(int, int);
@@ -471,6 +465,29 @@ struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
+/*
+ * If there is an existing swap slot reference (swap entry) and the caller
+ * guarantees that there is no race modification of it (e.g., PTL
+ * protecting the swap entry in page table; shmem's cmpxchg protects t
+ * he swap entry in shmem mapping), these two helpers below can be used
+ * to put/dup the entries directly.
+ *
+ * All entries must be allocated by folio_alloc_swap(). And they must have
+ * a swap count > 1. See comments of folio_*_swap helpers for more info.
+ */
+int swap_dup_entry_direct(swp_entry_t entry);
+void swap_put_entries_direct(swp_entry_t entry, int nr);
+
+/*
+ * folio_free_swap tries to free the swap entries pinned by a swap cache
+ * folio, it has to be here to be called by other components.
+ */
+bool folio_free_swap(struct folio *folio);
+
+/* Allocate / free (hibernation) exclusive entries */
+swp_entry_t swap_alloc_hibernation_slot(int type);
+void swap_free_hibernation_slot(swp_entry_t entry);
+
 static inline void put_swap_device(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
@@ -498,10 +515,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 #define free_pages_and_swap_cache(pages, nr) \
 	release_pages((pages), (nr));
 
-static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
-{
-}
-
 static inline void free_swap_cache(struct folio *folio)
 {
 }
@@ -511,12 +524,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 	return 0;
 }
 
-static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
+static inline int swap_dup_entry_direct(swp_entry_t ent)
 {
 	return 0;
 }
 
-static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
 {
 }
 
@@ -539,11 +552,6 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline int folio_alloc_swap(struct folio *folio)
-{
-	return -EINVAL;
-}
-
 static inline bool folio_free_swap(struct folio *folio)
 {
 	return false;
@@ -556,22 +564,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
 	return -EINVAL;
 }
 #endif /* CONFIG_SWAP */
-
-static inline int swap_duplicate(swp_entry_t entry)
-{
-	return swap_duplicate_nr(entry, 1);
-}
-
-static inline void free_swap_and_cache(swp_entry_t entry)
-{
-	free_swap_and_cache_nr(entry, 1);
-}
-
-static inline void swap_free(swp_entry_t entry)
-{
-	swap_free_nr(entry, 1);
-}
-
 #ifdef CONFIG_MEMCG
 static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
 {
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 33a186373bef..859476a714ac 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -174,10 +174,10 @@ sector_t alloc_swapdev_block(int swap)
 	 * Allocate a swap page and register that it has been allocated, so that
 	 * it can be freed in case of an error.
 	 */
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_offset(swap_alloc_hibernation_slot(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free_hibernation_slot(swp_entry(swap, offset));
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)
 
 void free_all_swap_pages(int swap)
 {
+	unsigned long offset;
 	struct rb_node *node;
 
 	/*
@@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
-			     ext->end - ext->start + 1);
+
+		for (offset = ext->start; offset < ext->end; offset++)
+			swap_free_hibernation_slot(swp_entry(swap, offset));
 
 		kfree(ext);
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index 6bf7009fa5ce..5f79f6fabfc0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -694,7 +694,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				max_nr = (end - addr) / PAGE_SIZE;
 				nr = swap_pte_batch(pte, max_nr, ptent);
 				nr_swap -= nr;
-				free_swap_and_cache_nr(entry, nr);
+				swap_put_entries_direct(entry, nr);
 				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
 			} else if (softleaf_is_hwpoison(entry) ||
 				   softleaf_is_poison_marker(entry)) {
diff --git a/mm/memory.c b/mm/memory.c
index a4c58341c44a..a61508107f6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -934,7 +934,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	struct page *page;
 
 	if (likely(softleaf_is_swap(entry))) {
-		if (swap_duplicate(entry) < 0)
+		if (swap_dup_entry_direct(entry) < 0)
 			return -EIO;
 
 		/* make sure dst_mm is on swapoff's mmlist. */
@@ -1744,7 +1744,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 
 		nr = swap_pte_batch(pte, max_nr, ptent);
 		rss[MM_SWAPENTS] -= nr;
-		free_swap_and_cache_nr(entry, nr);
+		swap_put_entries_direct(entry, nr);
 	} else if (softleaf_is_migration(entry)) {
 		struct folio *folio = softleaf_to_folio(entry);
 
@@ -4933,7 +4933,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -4971,6 +4971,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
+		folio_put_swap(swapcache, NULL);
 	} else if (!folio_test_anon(folio)) {
 		/*
 		 * We currently only expect !anon folios that are fully
@@ -4979,9 +4980,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
+		folio_put_swap(folio, NULL);
 	} else {
+		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
-					rmap_flags);
+					 rmap_flags);
+		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
@@ -4995,7 +4999,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * Do it after mapping, so raced page faults will likely see the folio
 	 * in swap cache and wait on the folio lock.
 	 */
-	swap_free_nr(entry, nr_pages);
 	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
 		folio_free_swap(folio);
 
@@ -5005,7 +5008,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * Hold the lock to avoid the swap entry to be reused
 		 * until we take the PT lock for the pte_same() check
 		 * (to avoid false positives from pte_same). For
-		 * further safety release the lock after the swap_free
+		 * further safety release the lock after the folio_put_swap
 		 * so that the swap count won't change under a
 		 * parallel locked swapcache.
 		 */
diff --git a/mm/rmap.c b/mm/rmap.c
index d6799afe1114..e805ddc5a27b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -82,6 +82,7 @@
 #include <trace/events/migrate.h>
 
 #include "internal.h"
+#include "swap.h"
 
 static struct kmem_cache *anon_vma_cachep;
 static struct kmem_cache *anon_vma_chain_cachep;
@@ -2147,7 +2148,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				goto discard;
 			}
 
-			if (swap_duplicate(entry) < 0) {
+			if (folio_dup_swap(folio, subpage) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2158,7 +2159,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			 * so we'll not check/care.
 			 */
 			if (arch_unmap_one(mm, vma, address, pteval) < 0) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
@@ -2166,7 +2167,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 			/* See folio_try_share_anon_rmap(): clear PTE first. */
 			if (anon_exclusive &&
 			    folio_try_share_anon_rmap_pte(folio, subpage)) {
-				swap_free(entry);
+				folio_put_swap(folio, subpage);
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				goto walk_abort;
 			}
diff --git a/mm/shmem.c b/mm/shmem.c
index e36330cdd066..df346f0c8ddc 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -970,7 +970,7 @@ static long shmem_free_swap(struct address_space *mapping,
 	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
 	if (old != radswap)
 		return 0;
-	free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+	swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
 
 	return 1 << order;
 }
@@ -1667,7 +1667,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 			spin_unlock(&shmem_swaplist_lock);
 		}
 
-		swap_duplicate_nr(folio->swap, nr_pages);
+		folio_dup_swap(folio, NULL);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
@@ -1688,7 +1688,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		/* Swap entry might be erased by racing shmem_free_swap() */
 		if (!error) {
 			shmem_recalc_inode(inode, 0, -nr_pages);
-			swap_free_nr(folio->swap, nr_pages);
+			folio_put_swap(folio, NULL);
 		}
 
 		/*
@@ -2174,6 +2174,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
+	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2181,7 +2182,6 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	 * in shmem_evict_inode().
 	 */
 	shmem_recalc_inode(inode, -nr_pages, -nr_pages);
-	swap_free_nr(swap, nr_pages);
 }
 
 static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
@@ -2404,9 +2404,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (sgp == SGP_WRITE)
 		folio_mark_accessed(folio);
 
+	folio_put_swap(folio, NULL);
 	swap_cache_del_folio(folio);
 	folio_mark_dirty(folio);
-	swap_free_nr(swap, nr_pages);
 	put_swap_device(si);
 
 	*foliop = folio;
diff --git a/mm/swap.h b/mm/swap.h
index 6777b2ab9d92..9ed12936b889 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
 	spin_unlock_irq(&ci->lock);
 }
 
+/*
+ * Below are the core routines for doing swap for a folio.
+ * All helpers requires the folio to be locked, and a locked folio
+ * in the swap cache pins the swap entries / slots allocated to the
+ * folio, swap relies heavily on the swap cache and folio lock for
+ * synchronization.
+ *
+ * folio_alloc_swap(): the entry point for a folio to be swapped
+ * out. It allocates swap slots and pins the slots with swap cache.
+ * The slots start with a swap count of zero.
+ *
+ * folio_dup_swap(): increases the swap count of a folio, usually
+ * during it gets unmapped and a swap entry is installed to replace
+ * it (e.g., swap entry in page table). A swap slot with swap
+ * count == 0 should only be increasd by this helper.
+ *
+ * folio_put_swap(): does the opposite thing of folio_dup_swap().
+ */
+int folio_alloc_swap(struct folio *folio);
+int folio_dup_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio, struct page *subpage);
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
 	return NULL;
 }
 
+static inline int folio_alloc_swap(struct folio *folio)
+{
+	return -EINVAL;
+}
+
+static inline int folio_dup_swap(struct folio *folio, struct page *page)
+{
+	return -EINVAL;
+}
+
+static inline void folio_put_swap(struct folio *folio, struct page *page)
+{
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
+
 static inline void swap_write_unplug(struct swap_iocb *sio)
 {
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 38f3c369df72..f812fdea68b3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si,
 			      swp_entry_t entry, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
+static bool swap_entries_put_map(struct swap_info_struct *si,
+				 swp_entry_t entry, int nr);
 static bool folio_swapcache_freeable(struct folio *folio);
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
@@ -1482,6 +1485,12 @@ int folio_alloc_swap(struct folio *folio)
 	 */
 	WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
 
+	/*
+	 * Allocator should always allocate aligned entries so folio based
+	 * operations never crossed more than one cluster.
+	 */
+	VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
+
 	return 0;
 
 out_free:
@@ -1489,6 +1498,66 @@ int folio_alloc_swap(struct folio *folio)
 	return -ENOMEM;
 }
 
+/**
+ * folio_dup_swap() - Increase swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded.
+ * @subpage: if not NULL, only increase the swap count of this subpage.
+ *
+ * Typically called when the folio is unmapped and have its swap entry to
+ * take its palce.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ * NOTE: The caller also has to ensure there is no raced call to
+ * swap_put_entries_direct on its swap entry before this helper returns, or
+ * the swap map may underflow. Currently, we only accept @subpage == NULL
+ * for shmem due to the limitation of swap continuation: shmem always
+ * duplicates the swap entry only once, so there is no such issue for it.
+ */
+int folio_dup_swap(struct folio *folio, struct page *subpage)
+{
+	int err = 0;
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
+		err = add_swap_count_continuation(entry, GFP_ATOMIC);
+
+	return err;
+}
+
+/**
+ * folio_put_swap() - Decrease swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded, must be in swap cache and locked.
+ * @subpage: if not NULL, only decrease the swap count of this subpage.
+ *
+ * This won't free the swap slots even if swap count drops to zero, they are
+ * still pinned by the swap cache. User may call folio_free_swap to free them.
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ */
+void folio_put_swap(struct folio *folio, struct page *subpage)
+{
+	swp_entry_t entry = folio->swap;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (subpage) {
+		entry.val += folio_page_idx(folio, subpage);
+		nr_pages = 1;
+	}
+
+	swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages);
+}
+
 static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 {
 	struct swap_info_struct *si;
@@ -1729,28 +1798,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 		partial_free_cluster(si, ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
-{
-	int nr;
-	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
-
-	sis = _swap_info_get(entry);
-	if (!sis)
-		return;
-
-	while (nr_pages) {
-		nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
-		swap_entries_put_map(sis, swp_entry(sis->type, offset), nr);
-		offset += nr;
-		nr_pages -= nr;
-	}
-}
-
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
@@ -1940,16 +1987,19 @@ bool folio_free_swap(struct folio *folio)
 }
 
 /**
- * free_swap_and_cache_nr() - Release reference on range of swap entries and
- *                            reclaim their cache if no more references remain.
+ * swap_put_entries_direct() - Release reference on range of swap entries and
+ *                             reclaim their cache if no more references remain.
  * @entry: First entry of range.
  * @nr: Number of entries in range.
  *
  * For each swap entry in the contiguous range, release a reference. If any swap
  * entries become free, try to reclaim their underlying folios, if present. The
  * offset range is defined by [entry.offset, entry.offset + nr).
+ *
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being released.
  */
-void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+void swap_put_entries_direct(swp_entry_t entry, int nr)
 {
 	const unsigned long start_offset = swp_offset(entry);
 	const unsigned long end_offset = start_offset + nr;
@@ -1958,10 +2008,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	unsigned long offset;
 
 	si = get_swap_device(entry);
-	if (!si)
+	if (WARN_ON_ONCE(!si))
 		return;
-
-	if (WARN_ON(end_offset > si->max))
+	if (WARN_ON_ONCE(end_offset > si->max))
 		goto out;
 
 	/*
@@ -2005,8 +2054,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 }
 
 #ifdef CONFIG_HIBERNATION
-
-swp_entry_t get_swap_page_of_type(int type)
+/* Allocate a slot for hibernation */
+swp_entry_t swap_alloc_hibernation_slot(int type)
 {
 	struct swap_info_struct *si = swap_type_to_info(type);
 	unsigned long offset;
@@ -2034,6 +2083,27 @@ swp_entry_t get_swap_page_of_type(int type)
 	return entry;
 }
 
+/* Free a slot allocated by swap_alloc_hibernation_slot */
+void swap_free_hibernation_slot(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	pgoff_t offset = swp_offset(entry);
+
+	si = get_swap_device(entry);
+	if (WARN_ON(!si))
+		return;
+
+	ci = swap_cluster_lock(si, offset);
+	swap_entry_put_locked(si, ci, entry, 1);
+	WARN_ON(swap_entry_swapped(si, entry));
+	swap_cluster_unlock(ci);
+
+	/* In theory readahead might add it to the swap cache by accident */
+	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+	put_swap_device(si);
+}
+
 /*
  * Find the swap type that corresponds to given device (if any).
  *
@@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	/*
 	 * Some architectures may have to restore extra metadata to the page
 	 * when reading from swap. This metadata may be indexed by swap entry
-	 * so this must be called before swap_free().
+	 * so this must be called before folio_put_swap().
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
@@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	swap_free(entry);
+	folio_put_swap(folio, page);
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);
@@ -3746,28 +3816,22 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	return err;
 }
 
-/**
- * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
- *                       by 1.
- *
+/*
+ * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
  * @entry: first swap entry from which we want to increase the refcount.
- * @nr: Number of entries in range.
  *
  * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
  * but could not be atomically allocated.  Returns 0, just as if it succeeded,
  * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
  * might occur if a page table entry has got corrupted.
  *
- * Note that we are currently not handling the case where nr > 1 and we need to
- * add swap count continuation. This is OK, because no such user exists - shmem
- * is the only user that can pass nr > 1, and it never re-duplicates any swap
- * entry it owns.
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being increased.
  */
-int swap_duplicate_nr(swp_entry_t entry, int nr)
+int swap_dup_entry_direct(swp_entry_t entry)
 {
 	int err = 0;
-
-	while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
+	while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
 		err = add_swap_count_continuation(entry, GFP_ATOMIC);
 	return err;
 }

-- 
2.52.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
@ 2025-12-20  4:02   ` Baoquan He
  2025-12-22  2:43     ` Kairui Song
  2026-01-14 12:16   ` Chris Mason
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Baoquan He @ 2025-12-20  4:02 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

On 12/20/25 at 03:43am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
> 
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.
> 
> Swap entry will be mostly allocated and free with a folio bound.
                                          ~~~~
                                          freed, typo
> The folio lock will be useful for resolving many swap ralated races.
> 
> Now swap allocation (except hibernation) always starts with a folio in
> the swap cache, and gets duped/freed protected by the folio lock:
> 
> - folio_alloc_swap() - The only allocation entry point now.
>   Context: The folio must be locked.
>   This allocates one or a set of continuous swap slots for a folio and
>   binds them to the folio by adding the folio to the swap cache. The
>   swap slots' swap count start with zero value.
> 
> - folio_dup_swap() - Increase the swap count of one or more entries.
>   Context: The folio must be locked and in the swap cache. For now, the
>   caller still has to lock the new swap entry owner (e.g., PTL).
>   This increases the ref count of swap entries allocated to a folio.
>   Newly allocated swap slots' count has to be increased by this helper
>   as the folio got unmapped (and swap entries got installed).
> 
> - folio_put_swap() - Decrease the swap count of one or more entries.
>   Context: The folio must be locked and in the swap cache. For now, the
>   caller still has to lock the new swap entry owner (e.g., PTL).
>   This decreases the ref count of swap entries allocated to a folio.
>   Typically, swapin will decrease the swap count as the folio got
>   installed back and the swap entry got uninstalled
> 
>   This won't remove the folio from the swap cache and free the
>   slot. Lazy freeing of swap cache is helpful for reducing IO.
>   There is already a folio_free_swap() for immediate cache reclaim.
>   This part could be further optimized later.
> 
> The above locking constraints could be further relaxed when the swap
> table if fully implemented. Currently dup still needs the caller
        ~~ s/if/is/ typo

> to lock the swap entry container (e.g. PTL), or a concurrent zap
> may underflow the swap count.
......


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-20  4:02   ` Baoquan He
@ 2025-12-22  2:43     ` Kairui Song
  2026-01-07 16:05       ` Kairui Song
  0 siblings, 1 reply; 16+ messages in thread
From: Kairui Song @ 2025-12-22  2:43 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel)

On Sat, Dec 20, 2025 at 12:02 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 12/20/25 at 03:43am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > The current swap entry allocation/freeing workflow has never had a clear
> > definition. This makes it hard to debug or add new optimizations.
> >
> > This commit introduces a proper definition of how swap entries would be
> > allocated and freed. Now, most operations are folio based, so they will
> > never exceed one swap cluster, and we now have a cleaner border between
> > swap and the rest of mm, making it much easier to follow and debug,
> > especially with new added sanity checks. Also making more optimization
> > possible.
> >
> > Swap entry will be mostly allocated and free with a folio bound.
>                                           ~~~~
>                                           freed, typo

Ack, nice catch.

> > The folio lock will be useful for resolving many swap ralated races.
> >
> > Now swap allocation (except hibernation) always starts with a folio in
> > the swap cache, and gets duped/freed protected by the folio lock:
> >
> > - folio_alloc_swap() - The only allocation entry point now.
> >   Context: The folio must be locked.
> >   This allocates one or a set of continuous swap slots for a folio and
> >   binds them to the folio by adding the folio to the swap cache. The
> >   swap slots' swap count start with zero value.
> >
> > - folio_dup_swap() - Increase the swap count of one or more entries.
> >   Context: The folio must be locked and in the swap cache. For now, the
> >   caller still has to lock the new swap entry owner (e.g., PTL).
> >   This increases the ref count of swap entries allocated to a folio.
> >   Newly allocated swap slots' count has to be increased by this helper
> >   as the folio got unmapped (and swap entries got installed).
> >
> > - folio_put_swap() - Decrease the swap count of one or more entries.
> >   Context: The folio must be locked and in the swap cache. For now, the
> >   caller still has to lock the new swap entry owner (e.g., PTL).
> >   This decreases the ref count of swap entries allocated to a folio.
> >   Typically, swapin will decrease the swap count as the folio got
> >   installed back and the swap entry got uninstalled
> >
> >   This won't remove the folio from the swap cache and free the
> >   slot. Lazy freeing of swap cache is helpful for reducing IO.
> >   There is already a folio_free_swap() for immediate cache reclaim.
> >   This part could be further optimized later.
> >
> > The above locking constraints could be further relaxed when the swap
> > table if fully implemented. Currently dup still needs the caller
>         ~~ s/if/is/ typo

Ack, Thanks!

>
> > to lock the swap entry container (e.g. PTL), or a concurrent zap
> > may underflow the swap count.
> ......
>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-22  2:43     ` Kairui Song
@ 2026-01-07 16:05       ` Kairui Song
  0 siblings, 0 replies; 16+ messages in thread
From: Kairui Song @ 2026-01-07 16:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel)

On Mon, Dec 22, 2025 at 10:43 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Dec 20, 2025 at 12:02 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 12/20/25 at 03:43am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > The current swap entry allocation/freeing workflow has never had a clear
> > > definition. This makes it hard to debug or add new optimizations.
> > >
> > > This commit introduces a proper definition of how swap entries would be
> > > allocated and freed. Now, most operations are folio based, so they will
> > > never exceed one swap cluster, and we now have a cleaner border between
> > > swap and the rest of mm, making it much easier to follow and debug,
> > > especially with new added sanity checks. Also making more optimization
> > > possible.
> > >
> > > Swap entry will be mostly allocated and free with a folio bound.
> >                                           ~~~~
> >                                           freed, typo
>
> Ack, nice catch.
>
> > > The folio lock will be useful for resolving many swap ralated races.
> > >
> > > Now swap allocation (except hibernation) always starts with a folio in
> > > the swap cache, and gets duped/freed protected by the folio lock:
> > >
> > > - folio_alloc_swap() - The only allocation entry point now.
> > >   Context: The folio must be locked.
> > >   This allocates one or a set of continuous swap slots for a folio and
> > >   binds them to the folio by adding the folio to the swap cache. The
> > >   swap slots' swap count start with zero value.
> > >
> > > - folio_dup_swap() - Increase the swap count of one or more entries.
> > >   Context: The folio must be locked and in the swap cache. For now, the
> > >   caller still has to lock the new swap entry owner (e.g., PTL).
> > >   This increases the ref count of swap entries allocated to a folio.
> > >   Newly allocated swap slots' count has to be increased by this helper
> > >   as the folio got unmapped (and swap entries got installed).
> > >
> > > - folio_put_swap() - Decrease the swap count of one or more entries.
> > >   Context: The folio must be locked and in the swap cache. For now, the
> > >   caller still has to lock the new swap entry owner (e.g., PTL).
> > >   This decreases the ref count of swap entries allocated to a folio.
> > >   Typically, swapin will decrease the swap count as the folio got
> > >   installed back and the swap entry got uninstalled
> > >
> > >   This won't remove the folio from the swap cache and free the
> > >   slot. Lazy freeing of swap cache is helpful for reducing IO.
> > >   There is already a folio_free_swap() for immediate cache reclaim.
> > >   This part could be further optimized later.
> > >
> > > The above locking constraints could be further relaxed when the swap
> > > table if fully implemented. Currently dup still needs the caller
> >         ~~ s/if/is/ typo
>
> Ack, Thanks!

Hi Andrew,

There are no other problems reported with the series so far, for the
two typos here, could you help update the commit message?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
  2025-12-20  4:02   ` Baoquan He
@ 2026-01-14 12:16   ` Chris Mason
  2026-01-14 16:18     ` Kairui Song
  2026-01-14 13:28   ` Lai, Yi
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Chris Mason @ 2026-01-14 12:16 UTC (permalink / raw)
  To: Kairui Song
  Cc: Chris Mason, linux-mm, Andrew Morton, Baoquan He, Barry Song,
	Chris Li, Nhat Pham, Yosry Ahmed, David Hildenbrand,
	Johannes Weiner, Youngjun Park, Hugh Dickins, Baolin Wang,
	Ying Huang, Kemeng Shi, Lorenzo Stoakes, Matthew Wilcox (Oracle),
	linux-kernel, Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

On Sat, 20 Dec 2025 03:43:43 +0800 Kairui Song <ryncsn@gmail.com> wrote:

Hi everyone,

I ran mm-new through the AI review prompts, and this one was flagged:

> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 8050e51828351..036489d3b01a9 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c

[ ... ]

> @@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)
>
>  void free_all_swap_pages(int swap)
>  {
> +	unsigned long offset;
>  	struct rb_node *node;
>
>  	/*
> @@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)
>
>  		ext = rb_entry(node, struct swsusp_extent, node);
>  		rb_erase(node, &swsusp_extents);
> -		swap_free_nr(swp_entry(swap, ext->start),
> -			     ext->end - ext->start + 1);
> +
> +		for (offset = ext->start; offset < ext->end; offset++)
> +			swap_free_hibernation_slot(swp_entry(swap, offset));
                                           ^^^^^^^^^^^

Is this loop missing the final slot at ext->end?

The old code freed (ext->end - ext->start + 1) entries, which is the
inclusive range [ext->start, ext->end]. Looking at swsusp_extents_insert(),
when a single slot is inserted it sets ext->start = ext->end = offset, and
merging extends the range by incrementing ext->end or decrementing
ext->start. This means ext->end is inclusive.

The new loop uses "offset < ext->end" which iterates [ext->start, ext->end),
missing the slot at ext->end. Should this be "offset <= ext->end" instead?



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2026-01-14 12:16   ` Chris Mason
@ 2026-01-14 16:18     ` Kairui Song
  0 siblings, 0 replies; 16+ messages in thread
From: Kairui Song @ 2026-01-14 16:18 UTC (permalink / raw)
  To: Chris Mason
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Yosry Ahmed, David Hildenbrand, Johannes Weiner,
	Youngjun Park, Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel)

On Wed, Jan 14, 2026 at 8:17 PM Chris Mason <clm@meta.com> wrote:
>
> On Sat, 20 Dec 2025 03:43:43 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> Hi everyone,
>
> I ran mm-new through the AI review prompts, and this one was flagged:
>
> > diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> > index 8050e51828351..036489d3b01a9 100644
> > --- a/kernel/power/swap.c
> > +++ b/kernel/power/swap.c
>
> [ ... ]
>
> > @@ -186,6 +186,7 @@ sector_t alloc_swapdev_block(int swap)
> >
> >  void free_all_swap_pages(int swap)
> >  {
> > +     unsigned long offset;
> >       struct rb_node *node;
> >
> >       /*
> > @@ -197,8 +198,9 @@ void free_all_swap_pages(int swap)
> >
> >               ext = rb_entry(node, struct swsusp_extent, node);
> >               rb_erase(node, &swsusp_extents);
> > -             swap_free_nr(swp_entry(swap, ext->start),
> > -                          ext->end - ext->start + 1);
> > +
> > +             for (offset = ext->start; offset < ext->end; offset++)
> > +                     swap_free_hibernation_slot(swp_entry(swap, offset));
>                                            ^^^^^^^^^^^
>
> Is this loop missing the final slot at ext->end?
>
> The old code freed (ext->end - ext->start + 1) entries, which is the
> inclusive range [ext->start, ext->end]. Looking at swsusp_extents_insert(),
> when a single slot is inserted it sets ext->start = ext->end = offset, and
> merging extends the range by incrementing ext->end or decrementing
> ext->start. This means ext->end is inclusive.
>
> The new loop uses "offset < ext->end" which iterates [ext->start, ext->end),
> missing the slot at ext->end. Should this be "offset <= ext->end" instead?

Wow, nice catch. Indeed that would be one swap leak for each
hibernation snapshot release I think. I only tested normal
hibernations, didn't realize there is issue with the "release before
use" path. `offset <= ext->end` is the right one here.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
  2025-12-20  4:02   ` Baoquan He
  2026-01-14 12:16   ` Chris Mason
@ 2026-01-14 13:28   ` Lai, Yi
  2026-01-14 16:22     ` Kairui Song
  2026-01-14 16:53   ` Kairui Song
  2026-01-29 19:32   ` Chris Mason
  4 siblings, 1 reply; 16+ messages in thread
From: Lai, Yi @ 2026-01-14 13:28 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Yosry Ahmed, David Hildenbrand, Johannes Weiner,
	Youngjun Park, Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

Hi Kairui Song,

Greetings!

I used Syzkaller and found that there is possible deadlock in swap_free_hibernation_slot in linux-next next-20260113.

After bisection and the first bad commit is:
"
33be6f68989d mm. swap: cleanup swap entry management workflow
"

All detailed into can be found at:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot
Syzkaller repro code:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.c
Syzkaller repro syscall steps:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.prog
Syzkaller report:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.report
Kconfig(make olddefconfig):
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/kconfig_origin
Bisect info:
https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/bisect_info.log
bzImage:
https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/260114_102849_swap_free_hibernation_slot/bzImage_0f853ca2a798ead9d24d39cad99b0966815c582a
Issue dmesg:
https://github.com/laifryiee/syzkaller_logs/blob/main/260114_102849_swap_free_hibernation_slot/0f853ca2a798ead9d24d39cad99b0966815c582a_dmesg.log

"
[   62.477554] ============================================
[   62.477802] WARNING: possible recursive locking detected
[   62.478059] 6.19.0-rc5-next-20260113-0f853ca2a798 #1 Not tainted
[   62.478324] --------------------------------------------
[   62.478549] repro/668 is trying to acquire lock:
[   62.478759] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0x13e/0x2a0
[   62.479271]
[   62.479271] but task is already holding lock:
[   62.479519] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x2a0
[   62.479984]
[   62.479984] other info that might help us debug this:
[   62.480293]  Possible unsafe locking scenario:
[   62.480293]
[   62.480565]        CPU0
[   62.480686]        ----
[   62.480809]   lock(&cluster_info[i].lock);
[   62.481010]   lock(&cluster_info[i].lock);
[   62.481205]
[   62.481205]  *** DEADLOCK ***
[   62.481205]
[   62.481481]  May be due to missing lock nesting notation
[   62.481481]
[   62.481802] 2 locks held by repro/668:
[   62.481981]  #0: ffffffff87542e28 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x92/0xb0
[   62.482439]  #1: ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x0
[   62.482936]
[   62.482936] stack backtrace:
[   62.483131] CPU: 0 UID: 0 PID: 668 Comm: repro Not tainted 6.19.0-rc5-next-20260113-0f853ca2a798 #1 PREEMPT(l
[   62.483143] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.q4
[   62.483151] Call Trace:
[   62.483156]  <TASK>
[   62.483160]  dump_stack_lvl+0xea/0x150
[   62.483195]  dump_stack+0x19/0x20
[   62.483206]  print_deadlock_bug+0x22e/0x300
[   62.483215]  __lock_acquire+0x1325/0x2210
[   62.483226]  lock_acquire+0x170/0x2f0
[   62.483234]  ? swap_free_hibernation_slot+0x13e/0x2a0
[   62.483249]  _raw_spin_lock+0x38/0x50
[   62.483267]  ? swap_free_hibernation_slot+0x13e/0x2a0
[   62.483279]  swap_free_hibernation_slot+0x13e/0x2a0
[   62.483291]  ? __pfx_swap_free_hibernation_slot+0x10/0x10
[   62.483303]  ? locks_remove_file+0xe2/0x7f0
[   62.483322]  ? __pfx_snapshot_release+0x10/0x10
[   62.483331]  free_all_swap_pages+0xdd/0x160
[   62.483339]  ? __pfx_snapshot_release+0x10/0x10
[   62.483346]  snapshot_release+0xac/0x200
[   62.483353]  __fput+0x41f/0xb70
[   62.483369]  ____fput+0x22/0x30
[   62.483376]  task_work_run+0x19e/0x2b0
[   62.483391]  ? __pfx_task_work_run+0x10/0x10
[   62.483398]  ? nsproxy_free+0x2da/0x5b0
[   62.483410]  ? switch_task_namespaces+0x118/0x130
[   62.483421]  do_exit+0x869/0x2810
[   62.483435]  ? do_group_exit+0x1d8/0x2c0
[   62.483445]  ? __pfx_do_exit+0x10/0x10
[   62.483451]  ? __this_cpu_preempt_check+0x21/0x30
[   62.483463]  ? _raw_spin_unlock_irq+0x2c/0x60
[   62.483474]  ? lockdep_hardirqs_on+0x85/0x110
[   62.483486]  ? _raw_spin_unlock_irq+0x2c/0x60
[   62.483498]  ? trace_hardirqs_on+0x26/0x130
[   62.483516]  do_group_exit+0xe4/0x2c0
[   62.483524]  __x64_sys_exit_group+0x4d/0x60
[   62.483531]  x64_sys_call+0x21a2/0x21b0
[   62.483544]  do_syscall_64+0x6d/0x1180
[   62.483560]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   62.483584] RIP: 0033:0x7fe84fb18a4d
[   62.483595] Code: Unable to access opcode bytes at 0x7fe84fb18a23.
[   62.483602] RSP: 002b:00007fff3e35c928 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[   62.483610] RAX: ffffffffffffffda RBX: 00007fe84fbf69e0 RCX: 00007fe84fb18a4d
[   62.483615] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000001
[   62.483620] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
[   62.483624] R10: 00007fff3e35c7d0 R11: 0000000000000246 R12: 00007fe84fbf69e0
[   62.483629] R13: 00007fe84fbfbf00 R14: 0000000000000001 R15: 00007fe84fbfbee8
[   62.483640]  </TASK>
"

Hope this cound be insightful to you.

Regards,
Yi Lai

---

If you don't need the following environment to reproduce the problem or if you
already have one reproduced environment, please ignore the following information.

How to reproduce:
git clone https://gitlab.com/xupengfe/repro_vm_env.git
cd repro_vm_env
tar -xvf repro_vm_env.tar.gz
cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
  // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
  // You could change the bzImage_xxx as you want
  // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
You could use below command to log in, there is no password for root.
ssh -p 10023 root@localhost

After login vm(virtual machine) successfully, you could transfer reproduced
binary to the vm by below way, and reproduce the problem in vm:
gcc -pthread -o repro repro.c
scp -P 10023 repro root@localhost:/root/

Get the bzImage for target kernel:
Please use target kconfig and copy it to kernel_src/.config
make olddefconfig
make -jx bzImage           //x should equal or less than cpu num your pc has

Fill the bzImage file into above start3.sh to load the target kernel in vm.


Tips:
If you already have qemu-system-x86_64, please ignore below info.
If you want to install qemu v7.1.0 version:
git clone https://github.com/qemu/qemu.git
cd qemu
git checkout -f v7.1.0
mkdir build
cd build
yum install -y ninja-build.x86_64
yum -y install libslirp-devel.x86_64
../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
make
make install 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2026-01-14 13:28   ` Lai, Yi
@ 2026-01-14 16:22     ` Kairui Song
  0 siblings, 0 replies; 16+ messages in thread
From: Kairui Song @ 2026-01-14 16:22 UTC (permalink / raw)
  To: Lai, Yi
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Yosry Ahmed, David Hildenbrand, Johannes Weiner,
	Youngjun Park, Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel)

On Wed, Jan 14, 2026 at 9:28 PM Lai, Yi <yi1.lai@linux.intel.com> wrote:
>
> Hi Kairui Song,
>
> Greetings!
>
> I used Syzkaller and found that there is possible deadlock in swap_free_hibernation_slot in linux-next next-20260113.
>
> After bisection and the first bad commit is:
> "
> 33be6f68989d mm. swap: cleanup swap entry management workflow
> "
>
> All detailed into can be found at:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot
> Syzkaller repro code:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.c
> Syzkaller repro syscall steps:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.prog
> Syzkaller report:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/repro.report
> Kconfig(make olddefconfig):
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/kconfig_origin
> Bisect info:
> https://github.com/laifryiee/syzkaller_logs/tree/main/260114_102849_swap_free_hibernation_slot/bisect_info.log
> bzImage:
> https://github.com/laifryiee/syzkaller_logs/raw/refs/heads/main/260114_102849_swap_free_hibernation_slot/bzImage_0f853ca2a798ead9d24d39cad99b0966815c582a
> Issue dmesg:
> https://github.com/laifryiee/syzkaller_logs/blob/main/260114_102849_swap_free_hibernation_slot/0f853ca2a798ead9d24d39cad99b0966815c582a_dmesg.log
>
> "
> [   62.477554] ============================================
> [   62.477802] WARNING: possible recursive locking detected
> [   62.478059] 6.19.0-rc5-next-20260113-0f853ca2a798 #1 Not tainted
> [   62.478324] --------------------------------------------
> [   62.478549] repro/668 is trying to acquire lock:
> [   62.478759] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0x13e/0x2a0
> [   62.479271]
> [   62.479271] but task is already holding lock:
> [   62.479519] ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x2a0
> [   62.479984]
> [   62.479984] other info that might help us debug this:
> [   62.480293]  Possible unsafe locking scenario:
> [   62.480293]
> [   62.480565]        CPU0
> [   62.480686]        ----
> [   62.480809]   lock(&cluster_info[i].lock);
> [   62.481010]   lock(&cluster_info[i].lock);
> [   62.481205]
> [   62.481205]  *** DEADLOCK ***
> [   62.481205]
> [   62.481481]  May be due to missing lock nesting notation
> [   62.481481]
> [   62.481802] 2 locks held by repro/668:
> [   62.481981]  #0: ffffffff87542e28 (system_transition_mutex){+.+.}-{4:4}, at: lock_system_sleep+0x92/0xb0
> [   62.482439]  #1: ffff888011664018 (&cluster_info[i].lock){+.+.}-{3:3}, at: swap_free_hibernation_slot+0xfa/0x0
> [   62.482936]
> [   62.482936] stack backtrace:
> [   62.483131] CPU: 0 UID: 0 PID: 668 Comm: repro Not tainted 6.19.0-rc5-next-20260113-0f853ca2a798 #1 PREEMPT(l
> [   62.483143] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.q4
> [   62.483151] Call Trace:
> [   62.483156]  <TASK>
> [   62.483160]  dump_stack_lvl+0xea/0x150
> [   62.483195]  dump_stack+0x19/0x20
> [   62.483206]  print_deadlock_bug+0x22e/0x300
> [   62.483215]  __lock_acquire+0x1325/0x2210
> [   62.483226]  lock_acquire+0x170/0x2f0
> [   62.483234]  ? swap_free_hibernation_slot+0x13e/0x2a0
> [   62.483249]  _raw_spin_lock+0x38/0x50
> [   62.483267]  ? swap_free_hibernation_slot+0x13e/0x2a0
> [   62.483279]  swap_free_hibernation_slot+0x13e/0x2a0
> [   62.483291]  ? __pfx_swap_free_hibernation_slot+0x10/0x10
> [   62.483303]  ? locks_remove_file+0xe2/0x7f0
> [   62.483322]  ? __pfx_snapshot_release+0x10/0x10
> [   62.483331]  free_all_swap_pages+0xdd/0x160
> [   62.483339]  ? __pfx_snapshot_release+0x10/0x10
> [   62.483346]  snapshot_release+0xac/0x200
> [   62.483353]  __fput+0x41f/0xb70
> [   62.483369]  ____fput+0x22/0x30
> [   62.483376]  task_work_run+0x19e/0x2b0
> [   62.483391]  ? __pfx_task_work_run+0x10/0x10
> [   62.483398]  ? nsproxy_free+0x2da/0x5b0
> [   62.483410]  ? switch_task_namespaces+0x118/0x130
> [   62.483421]  do_exit+0x869/0x2810
> [   62.483435]  ? do_group_exit+0x1d8/0x2c0
> [   62.483445]  ? __pfx_do_exit+0x10/0x10
> [   62.483451]  ? __this_cpu_preempt_check+0x21/0x30
> [   62.483463]  ? _raw_spin_unlock_irq+0x2c/0x60
> [   62.483474]  ? lockdep_hardirqs_on+0x85/0x110
> [   62.483486]  ? _raw_spin_unlock_irq+0x2c/0x60
> [   62.483498]  ? trace_hardirqs_on+0x26/0x130
> [   62.483516]  do_group_exit+0xe4/0x2c0
> [   62.483524]  __x64_sys_exit_group+0x4d/0x60
> [   62.483531]  x64_sys_call+0x21a2/0x21b0
> [   62.483544]  do_syscall_64+0x6d/0x1180
> [   62.483560]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [   62.483584] RIP: 0033:0x7fe84fb18a4d
> [   62.483595] Code: Unable to access opcode bytes at 0x7fe84fb18a23.
> [   62.483602] RSP: 002b:00007fff3e35c928 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
> [   62.483610] RAX: ffffffffffffffda RBX: 00007fe84fbf69e0 RCX: 00007fe84fb18a4d
> [   62.483615] RDX: 00000000000000e7 RSI: ffffffffffffff80 RDI: 0000000000000001
> [   62.483620] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000020
> [   62.483624] R10: 00007fff3e35c7d0 R11: 0000000000000246 R12: 00007fe84fbf69e0
> [   62.483629] R13: 00007fe84fbfbf00 R14: 0000000000000001 R15: 00007fe84fbfbee8
> [   62.483640]  </TASK>
> "
>
> Hope this cound be insightful to you.
>
> Regards,
> Yi Lai
>
> ---
>
> If you don't need the following environment to reproduce the problem or if you
> already have one reproduced environment, please ignore the following information.
>
> How to reproduce:
> git clone https://gitlab.com/xupengfe/repro_vm_env.git
> cd repro_vm_env
> tar -xvf repro_vm_env.tar.gz
> cd repro_vm_env; ./start3.sh  // it needs qemu-system-x86_64 and I used v7.1.0
>   // start3.sh will load bzImage_2241ab53cbb5cdb08a6b2d4688feb13971058f65 v6.2-rc5 kernel
>   // You could change the bzImage_xxx as you want
>   // Maybe you need to remove line "-drive if=pflash,format=raw,readonly=on,file=./OVMF_CODE.fd \" for different qemu version
> You could use below command to log in, there is no password for root.
> ssh -p 10023 root@localhost
>
> After login vm(virtual machine) successfully, you could transfer reproduced
> binary to the vm by below way, and reproduce the problem in vm:
> gcc -pthread -o repro repro.c
> scp -P 10023 repro root@localhost:/root/
>
> Get the bzImage for target kernel:
> Please use target kconfig and copy it to kernel_src/.config
> make olddefconfig
> make -jx bzImage           //x should equal or less than cpu num your pc has
>
> Fill the bzImage file into above start3.sh to load the target kernel in vm.
>
>
> Tips:
> If you already have qemu-system-x86_64, please ignore below info.
> If you want to install qemu v7.1.0 version:
> git clone https://github.com/qemu/qemu.git
> cd qemu
> git checkout -f v7.1.0
> mkdir build
> cd build
> yum install -y ninja-build.x86_64
> yum -y install libslirp-devel.x86_64
> ../configure --target-list=x86_64-softmmu --enable-kvm --enable-vnc --enable-gtk --enable-sdl --enable-usb-redir --enable-slirp
> make
> make install
>

Thanks Lai!

The issue is with the WARN_ON I added... Didn't notice that the
WARN_ON requires ci lock, so we better just remove that.

Following change should fix the issue you reported:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 85bf4f7d9ae7..8c0f31363c1f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2096,7 +2096,6 @@ void swap_free_hibernation_slot(swp_entry_t entry)

        ci = swap_cluster_lock(si, offset);
        swap_entry_put_locked(si, ci, entry, 1);
-       WARN_ON(swap_entry_swapped(si, entry));
        swap_cluster_unlock(ci);

        /* In theory readahead might add it to the swap cache by accident */

---

swap_entry_swapped requires CI lock. There wasn't any WARN_ON before,
it was added to just ensure things worked as expected, really not
needed.

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
                     ` (2 preceding siblings ...)
  2026-01-14 13:28   ` Lai, Yi
@ 2026-01-14 16:53   ` Kairui Song
  2026-01-14 22:29     ` Andrew Morton
  2026-01-29 19:32   ` Chris Mason
  4 siblings, 1 reply; 16+ messages in thread
From: Kairui Song @ 2026-01-14 16:53 UTC (permalink / raw)
  To: Andrew Morton, linux-mm
  Cc: Baoquan He, Barry Song, Chris Li, Nhat Pham, Yosry Ahmed,
	David Hildenbrand, Johannes Weiner, Youngjun Park, Hugh Dickins,
	Baolin Wang, Ying Huang, Kemeng Shi, Lorenzo Stoakes,
	Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel)

[-- Attachment #1: Type: text/plain, Size: 1165 bytes --]

On Sat, Dec 20, 2025 at 3:45 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
>
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.

...

>
> Cc: linux-pm@vger.kernel.org
> Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Hi Andrew,

Is it convenient for you to squash this attached fix into this patch?
That's the two issues from Chris Mason and Lai Yi combined in a clean
to apply format, only 3 lines change.

There might be minor conflict by removing the WARN_ON in two following
patches, but should be easy to resolve. I can send a v6 if that's
troublesome.

[-- Attachment #2: 0001-mm-swap-fix-locking-and-leaking-with-hibernation-sna.patch --]
[-- Type: application/x-patch, Size: 1474 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2026-01-14 16:53   ` Kairui Song
@ 2026-01-14 22:29     ` Andrew Morton
  2026-01-16 10:57       ` Chris Li
  0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2026-01-14 22:29 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel), Chris Mason, Yi Lai

On Thu, 15 Jan 2026 00:53:41 +0800 Kairui Song <ryncsn@gmail.com> wrote:

> Is it convenient for you to squash this attached fix into this patch?

Done, below

> That's the two issues from Chris Mason and Lai Yi combined in a clean
> to apply format, only 3 lines change.

Let's cc them!

> There might be minor conflict by removing the WARN_ON in two following
> patches, but should be easy to resolve. I can send a v6 if that's
> troublesome.

All fixed up, thanks.


From: Kairui Song <kasong@tencent.com>
Subject: mm, swap: fix locking and leaking with hibernation snapshot releasing
Date: Thu, 15 Jan 2026 00:15:27 +0800

fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi

Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Lai Yi <yi1.lai@linux.intel.com>
Cc: Chris Mason <clm@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/power/swap.c |    2 +-
 mm/swapfile.c       |    1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

--- a/kernel/power/swap.c~mm-swap-cleanup-swap-entry-management-workflow-fix
+++ a/kernel/power/swap.c
@@ -199,7 +199,7 @@ void free_all_swap_pages(int swap)
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
 
-		for (offset = ext->start; offset < ext->end; offset++)
+		for (offset = ext->start; offset <= ext->end; offset++)
 			swap_free_hibernation_slot(swp_entry(swap, offset));
 
 		kfree(ext);
--- a/mm/swapfile.c~mm-swap-cleanup-swap-entry-management-workflow-fix
+++ a/mm/swapfile.c
@@ -2096,7 +2096,6 @@ void swap_free_hibernation_slot(swp_entr
 
 	ci = swap_cluster_lock(si, offset);
 	swap_entry_put_locked(si, ci, entry, 1);
-	WARN_ON(swap_entry_swapped(si, entry));
 	swap_cluster_unlock(ci);
 
 	/* In theory readahead might add it to the swap cache by accident */
_


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2026-01-14 22:29     ` Andrew Morton
@ 2026-01-16 10:57       ` Chris Li
  0 siblings, 0 replies; 16+ messages in thread
From: Chris Li @ 2026-01-16 10:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, linux-mm, Baoquan He, Barry Song, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel), Chris Mason, Yi Lai

On Wed, Jan 14, 2026 at 2:29 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 15 Jan 2026 00:53:41 +0800 Kairui Song <ryncsn@gmail.com> wrote:
>
> > Is it convenient for you to squash this attached fix into this patch?
>
> Done, below
>
> > That's the two issues from Chris Mason and Lai Yi combined in a clean
> > to apply format, only 3 lines change.
>
> Let's cc them!
>
> > There might be minor conflict by removing the WARN_ON in two following
> > patches, but should be easy to resolve. I can send a v6 if that's
> > troublesome.
>
> All fixed up, thanks.
>
>
> From: Kairui Song <kasong@tencent.com>
> Subject: mm, swap: fix locking and leaking with hibernation snapshot releasing
> Date: Thu, 15 Jan 2026 00:15:27 +0800
>
> fix leak, per Chris Mason.  Remove WARN_ON, per Lai Yi

That is a great catch. Thanks.

>
> Link: https://lkml.kernel.org/r/CAMgjq7AUz10uETVm8ozDWcB3XohkOqf0i33KGrAquvEVvfp5cg@mail.gmail.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Cc: Lai Yi <yi1.lai@linux.intel.com>
> Cc: Chris Mason <clm@fb.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

That small fix looks good to me.

Chris

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
                     ` (3 preceding siblings ...)
  2026-01-14 16:53   ` Kairui Song
@ 2026-01-29 19:32   ` Chris Mason
  2026-01-30 16:48     ` Kairui Song
  4 siblings, 1 reply; 16+ messages in thread
From: Chris Mason @ 2026-01-29 19:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	Nhat Pham, Yosry Ahmed, David Hildenbrand, Johannes Weiner,
	Youngjun Park, Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

Kairui Song <ryncsn@gmail.com> wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
> 
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.
> 
> Swap entry will be mostly allocated and free with a folio bound.
> The folio lock will be useful for resolving many swap ralated races.
> 
> Now swap allocation (except hibernation) always starts with a folio in
> the swap cache, and gets duped/freed protected by the folio lock:

[ ... ]

Hi everyone,

Apologies for looping back around on this one, but I've got some prompt
changes that break down larger commits into sections and run them
independently through review.  I reran it on mm-new, and it found something
new:

> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 64970ee11fcf7..85bf4f7d9ae7b 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c

[ ... ]

> @@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  	/*
>  	 * Some architectures may have to restore extra metadata to the page
>  	 * when reading from swap. This metadata may be indexed by swap entry
> -	 * so this must be called before swap_free().
> +	 * so this must be called before folio_put_swap().
>  	 */
>  	arch_swap_restore(folio_swap(entry, folio), folio);
>
> @@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
>  		new_pte = pte_mkuffd_wp(new_pte);
>  setpte:
>  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
> -	swap_free(entry);
> +	folio_put_swap(folio, page);

Does this corrupt the swap map when ksm_might_need_to_copy() returns a
new folio? In that case, folio != swapcache, and the new folio is NOT in
the swap cache with folio->swap = 0. This would trigger
VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio) and call
swap_entries_put_map() with entry.val = 0.

Compare with do_swap_page() which correctly uses folio_put_swap(swapcache,
NULL) when folio != swapcache. Should this use the original entry parameter
or the swapcache folio instead?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow
  2026-01-29 19:32   ` Chris Mason
@ 2026-01-30 16:48     ` Kairui Song
  0 siblings, 0 replies; 16+ messages in thread
From: Kairui Song @ 2026-01-30 16:48 UTC (permalink / raw)
  To: Chris Mason, Andrew Morton
  Cc: linux-mm, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

On Thu, Jan 29, 2026 at 11:32:38AM +0800, Chris Mason wrote:
> Kairui Song <ryncsn@gmail.com> wrote:
> > From: Kairui Song <kasong@tencent.com>
> > 
> > The current swap entry allocation/freeing workflow has never had a clear
> > definition. This makes it hard to debug or add new optimizations.
> > 
> > This commit introduces a proper definition of how swap entries would be
> > allocated and freed. Now, most operations are folio based, so they will
> > never exceed one swap cluster, and we now have a cleaner border between
> > swap and the rest of mm, making it much easier to follow and debug,
> > especially with new added sanity checks. Also making more optimization
> > possible.
> > 
> > Swap entry will be mostly allocated and free with a folio bound.
> > The folio lock will be useful for resolving many swap ralated races.
> > 
> > Now swap allocation (except hibernation) always starts with a folio in
> > the swap cache, and gets duped/freed protected by the folio lock:
> 
> [ ... ]
> 
> Hi everyone,
> 
> Apologies for looping back around on this one, but I've got some prompt
> changes that break down larger commits into sections and run them
> independently through review.  I reran it on mm-new, and it found something
> new:
> 
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 64970ee11fcf7..85bf4f7d9ae7b 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> 
> [ ... ]
> 
> > @@ -2195,7 +2265,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> >  	/*
> >  	 * Some architectures may have to restore extra metadata to the page
> >  	 * when reading from swap. This metadata may be indexed by swap entry
> > -	 * so this must be called before swap_free().
> > +	 * so this must be called before folio_put_swap().
> >  	 */
> >  	arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > @@ -2236,7 +2306,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
> >  		new_pte = pte_mkuffd_wp(new_pte);
> >  setpte:
> >  	set_pte_at(vma->vm_mm, addr, pte, new_pte);
> > -	swap_free(entry);
> > +	folio_put_swap(folio, page);
> 
> Does this corrupt the swap map when ksm_might_need_to_copy() returns a
> new folio? In that case, folio != swapcache, and the new folio is NOT in
> the swap cache with folio->swap = 0. This would trigger
> VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio) and call
> swap_entries_put_map() with entry.val = 0.
> 
> Compare with do_swap_page() which correctly uses folio_put_swap(swapcache,
> NULL) when folio != swapcache. Should this use the original entry parameter
> or the swapcache folio instead?

Thanks again for running the AI review. And it's really helpful.

This is a valid case, I missed the KSM copy pages for swapoff indeed.

We do need the following change squashed as you suggested.

Hi Andrew, can you help squash add following fix? I just ran more
stress tests with KSM and racing swapoff, and everything is looking
good now.

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8c0f31363c1f..d652486898de 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2305,7 +2305,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 		new_pte = pte_mkuffd_wp(new_pte);
 setpte:
 	set_pte_at(vma->vm_mm, addr, pte, new_pte);
-	folio_put_swap(folio, page);
+	folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry)));
 out:
 	if (pte)
 		pte_unmap_unlock(pte, ptl);

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags
  2025-12-19 19:43 [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
@ 2025-12-19 20:05 ` Kairui Song
  2025-12-20 12:34 ` Baoquan He
  2 siblings, 0 replies; 16+ messages in thread
From: Kairui Song @ 2025-12-19 20:05 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel, linux-pm,
	Rafael J. Wysocki (Intel)

On Sat, Dec 20, 2025 at 3:44 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and
> special swap flag bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.
>
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
>
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
>
> And swap_map is now only used for swap count, so in the next phase,
> swap_map can be merged into the swap table, which will clean up more
> things and start to reduce the static memory usage. Removal of
> swap_cgroup_ctrl is also doable, but needs to be done after we also
> simplify the allocation of swapin folios: always use the new
> swap_cache_alloc_folio helper so the accounting will also be managed by
> the swap layer by then.
>
> Test results:
>
> Redis / Valkey bench:
> =====================
>
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
>         no persistence              with BGSAVE
> Before: 460475.84 RPS               311591.19 RPS
> After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
>
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
>         no persistence              with BGSAVE
> Before: 306044.38 RPS               102745.88 RPS
> After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
>
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
>
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
>
>                            Before:         After:
> System time:               282.22s         283.47s
> Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> Single process Throughput: 176.41 MB/s     176.23 MB/s
> Free latency:              518477.96 us    521488.06 us
>
> Which is almost identical.
>
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
>                 Before            After:
> System time:    1379.91s          1364.22s (-0.11%)
>
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
>                 Before            After:
> System time:    1822.52s          1803.33s (-0.11%)
>
> Which is almost identical.
>
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
>
> Before: 318162.18 qps
> After:  318512.01 qps (+0.01%)
>
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
>
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. The THP swapin is still limited to
> SYNC_IO devices, the limitation can be removed later.
>
> This may cause more serious THP thrashing for certain workloads, but that's
> not an issue caused by this series, it's a common THP issue we should resolve
> separately.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Changes in v5:
> Rebased on top of current mm-unstalbe, also appliable on mm-new.
> - Solve trivial conlicts with 6.19 rc1 for easier reviewing.
> - Don't change the argument for swap_entry_swapped [ Baoquan He ].
> - Update commit message and comment [ Baoquan He ].
> - Add a WARN in swap_dup_entries to catch potential swap count
>   overflow. No error was ever observed for this but the check existed
>   before, so just keep it to be very careful.
> - Link to v4: https://lore.kernel.org/r/20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com
>
> Changes in v4:
> - Rebase on latest mm-unstable, should be also mergeable with mm-new.
> - Update the shmem update commit message as suggested by, and reviewed
>   by [ Baolin Wang ].
> - Add a WARN_ON to catch more potential issue and update a few comments.
> - Link to v3: https://lore.kernel.org/r/20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com
>
> Changes in v3:
> - Imporve and update comments [ Barry Song, YoungJun Park, Chris Li ]
> - Simplify the changes of cluster_reclaim_range a bit, as YoungJun points
>   out the change looked confusing.
> - Fix a few typos I found during self review.
> - Fix a few build error and warns.
> - Link to v2: https://lore.kernel.org/r/20251117-swap-table-p2-v2-0-37730e6ea6d5@tencent.com
>
> Changes in v2:
> - Rebased on latest mm-new to resolve conflicts, also appliable to
>   mm-unstable.
> - Imporve comment, and commit messages in multiple commits, many thanks to
>   [Barry Song, YoungJun Park, Yosry Ahmed ]
> - Fix cluster usable check in allocator [ YoungJun Park]
> - Improve cover letter [ Chris Li ]
> - Collect Reviewed-by [ Yosry Ahmed ]
> - Fix a few build warning and issues from build bot.
> - Link to v1: https://lore.kernel.org/r/20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com
>
> ---
> Kairui Song (18):
>       mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio
>       mm, swap: split swap cache preparation loop into a standalone helper
>       mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
>       mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
>       mm, swap: simplify the code and reduce indention
>       mm, swap: free the swap cache after folio is mapped
>       mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO

Gmail blocked my Patch 7 so I have to resend it manually, it still
appears on lore thread just fine but the order seems a bit odd. Hope
this won't cause trouble for everyone.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags
  2025-12-19 19:43 [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
  2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
  2025-12-19 20:05 ` [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
@ 2025-12-20 12:34 ` Baoquan He
  2 siblings, 0 replies; 16+ messages in thread
From: Baoquan He @ 2025-12-20 12:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Barry Song, Chris Li, Nhat Pham,
	Yosry Ahmed, David Hildenbrand, Johannes Weiner, Youngjun Park,
	Hugh Dickins, Baolin Wang, Ying Huang, Kemeng Shi,
	Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
	Kairui Song, linux-pm, Rafael J. Wysocki (Intel)

On 12/20/25 at 03:43am, Kairui Song wrote:
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass swapin code and
> special swap flag bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.

Thanks for the great effort on the swap table phase II redesign, optimization
and improvement. I am done with the whole patchset reviewing, with my
limited knowledge, I didn't see some major issues, just rased several
minor concerns. All in all, the whole patchset looks good to me.

It's not easy to check patch by patch in this big patch series, especially
some patches are involving a lot of changes, and some change could be related
to later patch. I think it's worth being put in next or mergd for more testing.
Looking forward to seeing the phase III patchset.

FWIW, for the whole series,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
> 
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
> 
> And swap_map is now only used for swap count, so in the next phase,
> swap_map can be merged into the swap table, which will clean up more
> things and start to reduce the static memory usage. Removal of
> swap_cgroup_ctrl is also doable, but needs to be done after we also
> simplify the allocation of swapin folios: always use the new
> swap_cache_alloc_folio helper so the accounting will also be managed by
> the swap layer by then.
> 
> Test results:
> 
> Redis / Valkey bench:
> =====================
> 
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> 
>         no persistence              with BGSAVE
> Before: 460475.84 RPS               311591.19 RPS
> After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
> 
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> 
>         no persistence              with BGSAVE
> Before: 306044.38 RPS               102745.88 RPS
> After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
> 
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
> 
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
> 
>                            Before:         After:
> System time:               282.22s         283.47s
> Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> Single process Throughput: 176.41 MB/s     176.23 MB/s
> Free latency:              518477.96 us    521488.06 us
> 
> Which is almost identical.
> 
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
> 
>                 Before            After:
> System time:    1379.91s          1364.22s (-0.11%)
> 
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
> 
>                 Before            After:
> System time:    1822.52s          1803.33s (-0.11%)
> 
> Which is almost identical.
> 
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> 
> Before: 318162.18 qps
> After:  318512.01 qps (+0.01%)
> 
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
> 
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. The THP swapin is still limited to
> SYNC_IO devices, the limitation can be removed later.
> 
> This may cause more serious THP thrashing for certain workloads, but that's
> not an issue caused by this series, it's a common THP issue we should resolve
> separately.
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> 
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Changes in v5:
> Rebased on top of current mm-unstalbe, also appliable on mm-new.
> - Solve trivial conlicts with 6.19 rc1 for easier reviewing.
> - Don't change the argument for swap_entry_swapped [ Baoquan He ].
> - Update commit message and comment [ Baoquan He ].
> - Add a WARN in swap_dup_entries to catch potential swap count
>   overflow. No error was ever observed for this but the check existed
>   before, so just keep it to be very careful.
> - Link to v4: https://lore.kernel.org/r/20251205-swap-table-p2-v4-0-cb7e28a26a40@tencent.com
> 
> Changes in v4:
> - Rebase on latest mm-unstable, should be also mergeable with mm-new.
> - Update the shmem update commit message as suggested by, and reviewed
>   by [ Baolin Wang ].
> - Add a WARN_ON to catch more potential issue and update a few comments.
> - Link to v3: https://lore.kernel.org/r/20251125-swap-table-p2-v3-0-33f54f707a5c@tencent.com
> 
> Changes in v3:
> - Imporve and update comments [ Barry Song, YoungJun Park, Chris Li ]
> - Simplify the changes of cluster_reclaim_range a bit, as YoungJun points
>   out the change looked confusing.
> - Fix a few typos I found during self review.
> - Fix a few build error and warns.
> - Link to v2: https://lore.kernel.org/r/20251117-swap-table-p2-v2-0-37730e6ea6d5@tencent.com
> 
> Changes in v2:
> - Rebased on latest mm-new to resolve conflicts, also appliable to
>   mm-unstable.
> - Imporve comment, and commit messages in multiple commits, many thanks to
>   [Barry Song, YoungJun Park, Yosry Ahmed ]
> - Fix cluster usable check in allocator [ YoungJun Park]
> - Improve cover letter [ Chris Li ]
> - Collect Reviewed-by [ Yosry Ahmed ]
> - Fix a few build warning and issues from build bot.
> - Link to v1: https://lore.kernel.org/r/20251029-swap-table-p2-v1-0-3d43f3b6ec32@tencent.com
> 
> ---
> Kairui Song (18):
>       mm, swap: rename __read_swap_cache_async to swap_cache_alloc_folio
>       mm, swap: split swap cache preparation loop into a standalone helper
>       mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
>       mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
>       mm, swap: simplify the code and reduce indention
>       mm, swap: free the swap cache after folio is mapped
>       mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
>       mm, swap: swap entry of a bad slot should not be considered as swapped out
>       mm, swap: consolidate cluster reclaim and usability check
>       mm, swap: split locked entry duplicating into a standalone helper
>       mm, swap: use swap cache as the swap in synchronize layer
>       mm, swap: remove workaround for unsynchronized swap map cache state
>       mm, swap: cleanup swap entry management workflow
>       mm, swap: add folio to swap cache directly on allocation
>       mm, swap: check swap table directly for checking cache
>       mm, swap: clean up and improve swap entries freeing
>       mm, swap: drop the SWAP_HAS_CACHE flag
>       mm, swap: remove no longer needed _swap_info_get
> 
> Nhat Pham (1):
>       mm/shmem, swap: remove SWAP_MAP_SHMEM
> 
>  arch/s390/mm/gmap_helpers.c |   2 +-
>  arch/s390/mm/pgtable.c      |   2 +-
>  include/linux/swap.h        |  71 ++--
>  kernel/power/swap.c         |  10 +-
>  mm/madvise.c                |   2 +-
>  mm/memory.c                 | 276 +++++++-------
>  mm/rmap.c                   |   7 +-
>  mm/shmem.c                  |  75 ++--
>  mm/swap.h                   |  70 +++-
>  mm/swap_state.c             | 338 +++++++++++------
>  mm/swapfile.c               | 861 ++++++++++++++++++++------------------------
>  mm/userfaultfd.c            |  10 +-
>  mm/vmscan.c                 |   1 -
>  mm/zswap.c                  |   4 +-
>  14 files changed, 858 insertions(+), 871 deletions(-)
> ---
> base-commit: dc9f44261a74a4db5fe8ed570fc8b3edc53a28a2
> change-id: 20251007-swap-table-p2-7d3086e5c38a
> 
> Best regards,
> -- 
> Kairui Song <kasong@tencent.com>
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2026-01-30 16:48 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-19 19:43 [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
2025-12-19 19:43 ` [PATCH v5 14/19] mm, swap: cleanup swap entry management workflow Kairui Song
2025-12-20  4:02   ` Baoquan He
2025-12-22  2:43     ` Kairui Song
2026-01-07 16:05       ` Kairui Song
2026-01-14 12:16   ` Chris Mason
2026-01-14 16:18     ` Kairui Song
2026-01-14 13:28   ` Lai, Yi
2026-01-14 16:22     ` Kairui Song
2026-01-14 16:53   ` Kairui Song
2026-01-14 22:29     ` Andrew Morton
2026-01-16 10:57       ` Chris Li
2026-01-29 19:32   ` Chris Mason
2026-01-30 16:48     ` Kairui Song
2025-12-19 20:05 ` [PATCH v5 00/19] mm, swap: swap table phase II: unify swapin use swap cache and cleanup flags Kairui Song
2025-12-20 12:34 ` Baoquan He

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox