[PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I)

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I)
@ 2025-08-22 19:20 Kairui Song
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
                   ` (10 more replies)
  0 siblings, 11 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This is the first phase of the bigger series implementing basic
infrastructures for the Swap Table idea proposed at the LSF/MM/BPF
topic "Integrate swap cache, swap maps with swap allocator" [1].

This phase I contains 9 patches, introduces the swap table infrastructure
and uses it as the swap cache backend. By doing so, we have up to ~5-20%
performance gain in throughput, RPS or build time for benchmark and
workload tests. This is based on Chris Li's idea of using cluster size
atomic arrays to implement swap cache. It has less contention on the swap
cache access. The cluster size is much finer-grained than the 64M address
space split, which is removed in this phase I. It also unifies and cleans
up the swap code base.

Each swap cluster will dynamically allocate the swap table, which is an
atomic array to cover every swap slot in the cluster. It replaces the swap
cache back by Xarray. In phase I, the static allocated swap_map still
co-exists with the swap table. The memory usage is about the same as the
original on average. A few exception test cases show about 1% higher in
memory usage. In the following phases of the series, swap_map will merge
into the swap table without additional memory allocation. It will result
in net memory reduction compared to the original swap cache.

Testing has shown that phase I has a significant performance improvement
from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
workloads.

The full picture with a summary can be found at [2]. An older bigger
series of 28 patches is posted at [3].

vm-scability test:
==================
Test with:
usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
                Before:         After:
System time:    220.86s         160.42s      (-27.36%)
Throughput:     4775.18 MB/s    6381.43 MB/s (+33.63%)
Free latency:   174492 us       132122 us    (+24.28%)

usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
PMEM as swap)
                Before:         After:
System time:    355.23s         295.28s      (-16.87%)
Throughput:     4659.89 MB/s    5765.80 MB/s (+23.73%)
Free latency:   500417 us       477098 us     (-4.66%)

This shows an improvement of more than 20% improvement in most readings.

Build kernel test:
==================
Building kernel with defconfig on tmpfs with ZSWAP / ZRAM is looking
good. The results below show a test matrix using different memory
pressure and setups. Tests are done with shmem as filesystem and
using the same build config, measuring sys and real time in seconds
(user time is almost identical as expected):

 -j<NR> / Mem  | Sys before / after  | Real before / after
Using 16G ZRAM with memcg limit:
     12 / 256M | 6475 / 6232  -3.75% | 814 / 793   -2.58%
     24 / 384M | 5904 / 5560  -5.82% | 413 / 397   -3.87%
     48 / 768M | 4762 / 4242  -10.9% | 187 / 179   -4.27%
With 64k folio:
     24 / 512M | 4196 / 4062  -3.19% | 325 / 319   -1.84%
     48 / 1G   | 3622 / 3544  -2.15% | 148 / 146   -1.37%
With ZSWAP with 3G memcg (using higher limit due to kmem account):
     48 / 3G   |  605 /  571  -5.61% |  81 /  79   -2.47%

For extremely high pressure global pressure, using ZSWAP with 32G
NVMEs in a 48c VM that has 4G memory globally, no memcg limit, system
components take up about 1.5G so the pressure is high, using make -j48:

Before:  sys time: 2061.72s            real time: 135.61s
After:   sys time: 1990.96s (-3.43%)   real time: 134.03s (-1.16%)

All cases are faster, and no regression even under heavy global
memory pressure.

Redis / Valkey bench:
=====================
The test machine is a ARM64 VM with 1.5G memory, redis is set to
use 2.5G memory:

Testing with:
redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get

                no BGSAVE                with BGSAVE
Before:         433015.08 RPS            271421.15 RPS
After:          431537.61 RPS (-0.34%)   290441.79 RPS (+7.0%)

Testing with:
redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get
                no BGSAVE                with BGSAVE
Before:         446339.45 RPS            274845.19 RPS
After:          442697.29 RPS (-0.81%)   293053.59 RPS (+6.6%)

With BGSAVE enabled, most Redis memory will have a swap count > 1 so
swap cache is heavily in use. We can see a >5% performance. No BGSAVE
is very slightly slower (<1%) due to the higher memory pressure of the
co-existence of swap_map and swap table. This will be optimzed into a
net gain and up to 20% gain in BGSAVE case in the following phases.

Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]

Kairui Song (9):
  mm, swap: use unified helper for swap cache look up
  mm, swap: always lock and check the swap cache folio before use
  mm, swap: rename and move some swap cluster definition and helpers
  mm, swap: tidy up swap device and cluster info helpers
  mm/shmem, swap: remove redundant error handling for replacing folio
  mm, swap: use the swap table for the swap cache and switch API
  mm, swap: remove contention workaround for swap cache
  mm, swap: implement dynamic allocation of swap table
  mm, swap: use a single page for swap table when the size fits

 MAINTAINERS          |   1 +
 include/linux/swap.h |  42 ----
 mm/filemap.c         |   2 +-
 mm/huge_memory.c     |  16 +-
 mm/memory-failure.c  |   2 +-
 mm/memory.c          |  30 +--
 mm/migrate.c         |  28 +--
 mm/mincore.c         |   3 +-
 mm/page_io.c         |  12 +-
 mm/shmem.c           |  56 ++----
 mm/swap.h            | 268 +++++++++++++++++++++----
 mm/swap_state.c      | 404 +++++++++++++++++++-------------------
 mm/swap_table.h      | 136 +++++++++++++
 mm/swapfile.c        | 456 ++++++++++++++++++++++++++++---------------
 mm/userfaultfd.c     |   5 +-
 mm/vmscan.c          |  20 +-
 mm/zswap.c           |   9 +-
 17 files changed, 954 insertions(+), 536 deletions(-)
 create mode 100644 mm/swap_table.h

---

I was trying some new tools like b4 for branch management, and it seems
a draft version was sent out by accident, but seems got rejected. I'm
not sure if anyone is seeing duplicated or a malformed email. If so,
please accept my apology and use this series for review, discussion
or merge.

-- 
2.51.0

^ permalink raw reply	[flat|nested] 90+ messages in thread

* [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-27  2:47   ` Chris Li
                     ` (5 more replies)
  2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
                   ` (9 subsequent siblings)
  10 siblings, 6 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Always use swap_cache_get_folio for swap cache folio look up. The reason
we are not using it in all places is that it also updates the readahead
info, and some callsites want to avoid that.

So decouple readahead update with swap cache lookup into a standalone
helper, let the caller call the readahead update helper if that's
needed. And convert all swap cache lookups to use swap_cache_get_folio.

After this commit, there are only three special cases for accessing swap
cache space now: huge memory splitting, migration and shmem replacing,
because they need to lock the Xarray. Following commits will wrap their
accesses to the swap cache too with special helpers.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c      |  6 ++-
 mm/mincore.c     |  3 +-
 mm/shmem.c       |  4 +-
 mm/swap.h        | 13 +++++--
 mm/swap_state.c  | 99 +++++++++++++++++++++++-------------------------
 mm/swapfile.c    | 11 +++---
 mm/userfaultfd.c |  5 +--
 7 files changed, 72 insertions(+), 69 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d9de6c056179..10ef528a5f44 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(!si))
 		goto out;
 
-	folio = swap_cache_get_folio(entry, vma, vmf->address);
-	if (folio)
+	folio = swap_cache_get_folio(entry);
+	if (folio) {
+		swap_update_readahead(folio, vma, vmf->address);
 		page = folio_file_page(folio, swp_offset(entry));
+	}
 	swapcache = folio;
 
 	if (!folio) {
diff --git a/mm/mincore.c b/mm/mincore.c
index 2f3e1816a30d..8ec4719370e1 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
 		if (!si)
 			return 0;
 	}
-	folio = filemap_get_entry(swap_address_space(entry),
-				  swap_cache_index(entry));
+	folio = swap_cache_get_folio(entry);
 	if (shmem)
 		put_swap_device(si);
 	/* The swap cache space contains either folio, shadow or NULL */
diff --git a/mm/shmem.c b/mm/shmem.c
index 13cc51df3893..e9d0d2784cd5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 
 	/* Look it up and read it in.. */
-	folio = swap_cache_get_folio(swap, NULL, 0);
+	folio = swap_cache_get_folio(swap);
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			/* Direct swapin skipping swap cache & readahead */
@@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
+	} else {
+		swap_update_readahead(folio, NULL, 0);
 	}
 
 	if (order > folio_order(folio)) {
diff --git a/mm/swap.h b/mm/swap.h
index 1ae44d4193b1..efb6d7ff9f30 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr);
+struct folio *swap_cache_get_folio(swp_entry_t entry);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
 		struct mempolicy *mpol, pgoff_t ilx);
 struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 		struct vm_fault *vmf);
+void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
+			   unsigned long addr);
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
@@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
 	return NULL;
 }
 
+static inline void swap_update_readahead(struct folio *folio,
+		struct vm_area_struct *vma, unsigned long addr)
+{
+}
+
 static inline int swap_writeout(struct folio *folio,
 		struct swap_iocb **swap_plug)
 {
@@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
 {
 }
 
-static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
 	return NULL;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 99513b74b5d8..ff9eb761a103 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -69,6 +69,21 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", K(total_swap_pages));
 }
 
+/*
+ * Lookup a swap entry in the swap cache. A found folio will be returned
+ * unlocked and with its refcount incremented.
+ *
+ * Caller must lock the swap device or hold a reference to keep it valid.
+ */
+struct folio *swap_cache_get_folio(swp_entry_t entry)
+{
+	struct folio *folio = filemap_get_folio(swap_address_space(entry),
+						swap_cache_index(entry));
+	if (!IS_ERR(folio))
+		return folio;
+	return NULL;
+}
+
 void *get_shadow_from_swap_cache(swp_entry_t entry)
 {
 	struct address_space *address_space = swap_address_space(entry);
@@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void)
 }
 
 /*
- * Lookup a swap entry in the swap cache. A found folio will be returned
- * unlocked and with its refcount incremented - we rely on the kernel
- * lock getting page table operations atomic even if we drop the folio
- * lock before returning.
- *
- * Caller must lock the swap device or hold a reference to keep it valid.
+ * Update the readahead statistics of a vma or globally.
  */
-struct folio *swap_cache_get_folio(swp_entry_t entry,
-		struct vm_area_struct *vma, unsigned long addr)
+void swap_update_readahead(struct folio *folio,
+			   struct vm_area_struct *vma,
+			   unsigned long addr)
 {
-	struct folio *folio;
-
-	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
-	if (!IS_ERR(folio)) {
-		bool vma_ra = swap_use_vma_readahead();
-		bool readahead;
+	bool readahead, vma_ra = swap_use_vma_readahead();
 
-		/*
-		 * At the moment, we don't support PG_readahead for anon THP
-		 * so let's bail out rather than confusing the readahead stat.
-		 */
-		if (unlikely(folio_test_large(folio)))
-			return folio;
-
-		readahead = folio_test_clear_readahead(folio);
-		if (vma && vma_ra) {
-			unsigned long ra_val;
-			int win, hits;
-
-			ra_val = GET_SWAP_RA_VAL(vma);
-			win = SWAP_RA_WIN(ra_val);
-			hits = SWAP_RA_HITS(ra_val);
-			if (readahead)
-				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
-			atomic_long_set(&vma->swap_readahead_info,
-					SWAP_RA_VAL(addr, win, hits));
-		}
-
-		if (readahead) {
-			count_vm_event(SWAP_RA_HIT);
-			if (!vma || !vma_ra)
-				atomic_inc(&swapin_readahead_hits);
-		}
-	} else {
-		folio = NULL;
+	/*
+	 * At the moment, we don't support PG_readahead for anon THP
+	 * so let's bail out rather than confusing the readahead stat.
+	 */
+	if (unlikely(folio_test_large(folio)))
+		return;
+
+	readahead = folio_test_clear_readahead(folio);
+	if (vma && vma_ra) {
+		unsigned long ra_val;
+		int win, hits;
+
+		ra_val = GET_SWAP_RA_VAL(vma);
+		win = SWAP_RA_WIN(ra_val);
+		hits = SWAP_RA_HITS(ra_val);
+		if (readahead)
+			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
+		atomic_long_set(&vma->swap_readahead_info,
+				SWAP_RA_VAL(addr, win, hits));
 	}
 
-	return folio;
+	if (readahead) {
+		count_vm_event(SWAP_RA_HIT);
+		if (!vma || !vma_ra)
+			atomic_inc(&swapin_readahead_hits);
+	}
 }
 
 struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
@@ -336,14 +337,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	*new_page_allocated = false;
 	for (;;) {
 		int err;
-		/*
-		 * First check the swap cache.  Since this is normally
-		 * called after swap_cache_get_folio() failed, re-calling
-		 * that would confuse statistics.
-		 */
-		folio = filemap_get_folio(swap_address_space(entry),
-					  swap_cache_index(entry));
-		if (!IS_ERR(folio))
+
+		/* Check the swap cache in case the folio is already there */
+		folio = swap_cache_get_folio(entry);
+		if (folio)
 			goto got_folio;
 
 		/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a7ffabbe65ef..4b8ab2cb49ca 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
 	swp_entry_t entry = swp_entry(si->type, offset);
-	struct address_space *address_space = swap_address_space(entry);
 	struct swap_cluster_info *ci;
 	struct folio *folio;
 	int ret, nr_pages;
 	bool need_reclaim;
 
 again:
-	folio = filemap_get_folio(address_space, swap_cache_index(entry));
-	if (IS_ERR(folio))
+	folio = swap_cache_get_folio(entry);
+	if (!folio)
 		return 0;
 
 	nr_pages = folio_nr_pages(folio);
@@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		pte_unmap(pte);
 		pte = NULL;
 
-		folio = swap_cache_get_folio(entry, vma, addr);
+		folio = swap_cache_get_folio(entry);
 		if (!folio) {
 			struct vm_fault vmf = {
 				.vma = vma,
@@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type)
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
 		entry = swp_entry(type, i);
-		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
-		if (IS_ERR(folio))
+		folio = swap_cache_get_folio(entry);
+		if (!folio)
 			continue;
 
 		/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 50aaa8dcd24c..af61b95c89e4 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
 		 * separately to allow proper handling.
 		 */
 		if (!src_folio)
-			folio = filemap_get_folio(swap_address_space(entry),
-					swap_cache_index(entry));
-		if (!IS_ERR_OR_NULL(folio)) {
+			folio = swap_cache_get_folio(entry);
+		if (folio) {
 			if (folio_test_large(folio)) {
 				ret = -EBUSY;
 				folio_put(folio);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-27  6:13   ` Chris Li
                     ` (3 more replies)
  2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
                   ` (8 subsequent siblings)
  10 siblings, 4 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Swap cache lookup is lockless, it only increases the reference count
of the returned folio. That's not enough to ensure a folio is stable in
the swap cache, so the folio could be removed from the swap cache at any
time. The caller always has to lock and check the folio before use.

Document this as a comment, and introduce a helper for swap cache folio
verification with proper sanity checks.

Also, sanitize all current users to use this convention, and use the new
helper when possible for easier debugging. Some existing callers won't
cause any major problem right now, only trivial issues like incorrect
readahead statistic (swapin) or wasted loop (swapoff). It's better to
always follow this convention to make things robust.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c     | 28 +++++++++++++---------------
 mm/shmem.c      |  4 ++--
 mm/swap.h       | 28 ++++++++++++++++++++++++++++
 mm/swap_state.c | 13 +++++++++----
 mm/swapfile.c   | 10 ++++++++--
 5 files changed, 60 insertions(+), 23 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 10ef528a5f44..9ca8e1873c6e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4661,12 +4661,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	folio = swap_cache_get_folio(entry);
-	if (folio) {
-		swap_update_readahead(folio, vma, vmf->address);
-		page = folio_file_page(folio, swp_offset(entry));
-	}
 	swapcache = folio;
-
 	if (!folio) {
 		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
@@ -4735,20 +4730,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		ret = VM_FAULT_MAJOR;
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
-		page = folio_file_page(folio, swp_offset(entry));
-	} else if (PageHWPoison(page)) {
-		/*
-		 * hwpoisoned dirty swapcache pages are kept for killing
-		 * owner processes (which may be unknown at hwpoison time)
-		 */
-		ret = VM_FAULT_HWPOISON;
-		goto out_release;
 	}
 
 	ret |= folio_lock_or_retry(folio, vmf);
 	if (ret & VM_FAULT_RETRY)
 		goto out_release;
 
+	page = folio_file_page(folio, swp_offset(entry));
 	if (swapcache) {
 		/*
 		 * Make sure folio_free_swap() or swapoff did not release the
@@ -4757,10 +4745,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		 * swapcache, we need to check that the page's swap has not
 		 * changed.
 		 */
-		if (unlikely(!folio_test_swapcache(folio) ||
-			     page_swap_entry(page).val != entry.val))
+		if (!folio_contains_swap(folio, entry))
 			goto out_page;
 
+		if (PageHWPoison(page)) {
+			/*
+			 * hwpoisoned dirty swapcache pages are kept for killing
+			 * owner processes (which may be unknown at hwpoison time)
+			 */
+			ret = VM_FAULT_HWPOISON;
+			goto out_page;
+		}
+
+		swap_update_readahead(folio, vma, vmf->address);
+
 		/*
 		 * KSM sometimes has to copy on read faults, for example, if
 		 * folio->index of non-ksm folios would be nonlinear inside the
diff --git a/mm/shmem.c b/mm/shmem.c
index e9d0d2784cd5..b4d39f2a1e0a 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			count_vm_event(PGMAJFAULT);
 			count_memcg_event_mm(fault_mm, PGMAJFAULT);
 		}
-	} else {
-		swap_update_readahead(folio, NULL, 0);
 	}
 
 	if (order > folio_order(folio)) {
@@ -2431,6 +2429,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		error = -EIO;
 		goto failed;
 	}
+	if (!skip_swapcache)
+		swap_update_readahead(folio, NULL, 0);
 	folio_wait_writeback(folio);
 	nr_pages = folio_nr_pages(folio);
 
diff --git a/mm/swap.h b/mm/swap.h
index efb6d7ff9f30..bb2adbfd64a9 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -52,6 +52,29 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
 }
 
+/**
+ * folio_contains_swap - Does this folio contain this swap entry?
+ * @folio: The folio.
+ * @entry: The swap entry to check against.
+ *
+ * Swap version of folio_contains()
+ *
+ * Context: The caller should have the folio locked to ensure
+ * nothing will move it out of the swap cache.
+ * Return: true or false.
+ */
+static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
+{
+	pgoff_t offset = swp_offset(entry);
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	if (unlikely(!folio_test_swapcache(folio)))
+		return false;
+	if (unlikely(swp_type(entry) != swp_type(folio->swap)))
+		return false;
+	return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
+}
+
 void show_swap_cache_info(void);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
 int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
@@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	return 0;
 }
 
+static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
+{
+	return false;
+}
+
 static inline void show_swap_cache_info(void)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ff9eb761a103..be0d96494dc1 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -70,10 +70,12 @@ void show_swap_cache_info(void)
 }
 
 /*
- * Lookup a swap entry in the swap cache. A found folio will be returned
- * unlocked and with its refcount incremented.
+ * swap_cache_get_folio - Lookup a swap entry in the swap cache.
  *
- * Caller must lock the swap device or hold a reference to keep it valid.
+ * A found folio will be returned unlocked and with its refcount increased.
+ *
+ * Context: Caller must ensure @entry is valid and pin the swap device, also
+ * check the returned folio after locking it (e.g. folio_swap_contains).
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
@@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	for (;;) {
 		int err;
 
-		/* Check the swap cache in case the folio is already there */
+		/*
+		 * Check the swap cache first, if a cached folio is found,
+		 * return it unlocked. The caller will lock and check it.
+		 */
 		folio = swap_cache_get_folio(entry);
 		if (folio)
 			goto got_folio;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4b8ab2cb49ca..12f2580ebe8d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * Offset could point to the middle of a large folio, or folio
 	 * may no longer point to the expected offset before it's locked.
 	 */
-	entry = folio->swap;
-	if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
+	if (!folio_contains_swap(folio, entry)) {
 		folio_unlock(folio);
 		folio_put(folio);
 		goto again;
 	}
+	entry = folio->swap;
 	offset = swp_offset(entry);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
@@ -2150,6 +2150,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		}
 
 		folio_lock(folio);
+		if (!folio_contains_swap(folio, entry)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			continue;
+		}
+
 		folio_wait_writeback(folio);
 		ret = unuse_pte(vma, pmd, addr, entry, folio);
 		if (ret < 0) {
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
  2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-30  2:31   ` Chris Li
                     ` (2 more replies)
  2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

No feature change, move cluster related definitions and helpers to
mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
helpers, so they can be used outside of swap files.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 34 ---------------
 mm/swap.h            | 63 ++++++++++++++++++++++++++++
 mm/swapfile.c        | 99 ++++++++++++++------------------------------
 3 files changed, 93 insertions(+), 103 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c2da85cb7fe7..20efd9a34034 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -235,40 +235,6 @@ enum {
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
 
-/*
- * We use this to track usage of a cluster. A cluster is a block of swap disk
- * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
- * free clusters are organized into a list. We fetch an entry from the list to
- * get a free cluster.
- *
- * The flags field determines if a cluster is free. This is
- * protected by cluster lock.
- */
-struct swap_cluster_info {
-	spinlock_t lock;	/*
-				 * Protect swap_cluster_info fields
-				 * other than list, and swap_info_struct->swap_map
-				 * elements corresponding to the swap cluster.
-				 */
-	u16 count;
-	u8 flags;
-	u8 order;
-	struct list_head list;
-};
-
-/* All on-list cluster must have a non-zero flag. */
-enum swap_cluster_flags {
-	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
-	CLUSTER_FLAG_FREE,
-	CLUSTER_FLAG_NONFULL,
-	CLUSTER_FLAG_FRAG,
-	/* Clusters with flags above are allocatable */
-	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
-	CLUSTER_FLAG_FULL,
-	CLUSTER_FLAG_DISCARD,
-	CLUSTER_FLAG_MAX,
-};
-
 /*
  * The first page in the swap file is the swap header, which is always marked
  * bad to prevent it from being allocated as an entry. This also prevents the
diff --git a/mm/swap.h b/mm/swap.h
index bb2adbfd64a9..223b40f2d37e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -7,10 +7,73 @@ struct swap_iocb;
 
 extern int page_cluster;
 
+#ifdef CONFIG_THP_SWAP
+#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
+#define swap_entry_order(order)	(order)
+#else
+#define SWAPFILE_CLUSTER	256
+#define swap_entry_order(order)	0
+#endif
+
+/*
+ * We use this to track usage of a cluster. A cluster is a block of swap disk
+ * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
+ * free clusters are organized into a list. We fetch an entry from the list to
+ * get a free cluster.
+ *
+ * The flags field determines if a cluster is free. This is
+ * protected by cluster lock.
+ */
+struct swap_cluster_info {
+	spinlock_t lock;	/*
+				 * Protect swap_cluster_info fields
+				 * other than list, and swap_info_struct->swap_map
+				 * elements corresponding to the swap cluster.
+				 */
+	u16 count;
+	u8 flags;
+	u8 order;
+	struct list_head list;
+};
+
+/* All on-list cluster must have a non-zero flag. */
+enum swap_cluster_flags {
+	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
+	CLUSTER_FLAG_FREE,
+	CLUSTER_FLAG_NONFULL,
+	CLUSTER_FLAG_FRAG,
+	/* Clusters with flags above are allocatable */
+	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
+	CLUSTER_FLAG_FULL,
+	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_MAX,
+};
+
 #ifdef CONFIG_SWAP
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+static inline struct swap_cluster_info *swp_offset_cluster(
+		struct swap_info_struct *si, pgoff_t offset)
+{
+	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+}
+
+static inline struct swap_cluster_info *swap_cluster_lock(
+		struct swap_info_struct *si,
+		unsigned long offset)
+{
+	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
+
+	spin_lock(&ci->lock);
+	return ci;
+}
+
+static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
+{
+	spin_unlock(&ci->lock);
+}
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12f2580ebe8d..618cf4333a3d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si,
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
-static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
-					      unsigned long offset);
-static inline void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -259,9 +256,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * swap_map is HAS_CACHE only, which means the slots have no page table
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	need_reclaim = swap_only_has_cache(si, offset, nr_pages);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	if (!need_reclaim)
 		goto out_unlock;
 
@@ -386,20 +383,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 	}
 }
 
-#ifdef CONFIG_THP_SWAP
-#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
-
-#define swap_entry_order(order)	(order)
-#else
-#define SWAPFILE_CLUSTER	256
-
-/*
- * Define swap_entry_order() as constant to let compiler to optimize
- * out some code if !CONFIG_THP_SWAP
- */
-#define swap_entry_order(order)	0
-#endif
-#define LATENCY_LIMIT		256
+#define LATENCY_LIMIT 256
 
 static inline bool cluster_is_empty(struct swap_cluster_info *info)
 {
@@ -426,34 +410,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
 	return ci - si->cluster_info;
 }
 
-static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
-							  unsigned long offset)
-{
-	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
-}
-
 static inline unsigned int cluster_offset(struct swap_info_struct *si,
 					  struct swap_cluster_info *ci)
 {
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
-static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
-						     unsigned long offset)
-{
-	struct swap_cluster_info *ci;
-
-	ci = offset_to_cluster(si, offset);
-	spin_lock(&ci->lock);
-
-	return ci;
-}
-
-static inline void unlock_cluster(struct swap_cluster_info *ci)
-{
-	spin_unlock(&ci->lock);
-}
-
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags)
@@ -809,7 +771,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	}
 out:
 	relocate_cluster(si, ci);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
@@ -876,7 +838,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		if (ci->flags == CLUSTER_FLAG_NONE)
 			relocate_cluster(si, ci);
 
-		unlock_cluster(ci);
+		swap_cluster_unlock(ci);
 		if (to_scan <= 0)
 			break;
 	}
@@ -915,7 +877,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		if (offset == SWAP_ENTRY_INVALID)
 			goto new_cluster;
 
-		ci = lock_cluster(si, offset);
+		ci = swap_cluster_lock(si, offset);
 		/* Cluster could have been used by another order */
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_empty(ci))
@@ -923,7 +885,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			found = alloc_swap_scan_cluster(si, ci, offset,
 							order, usage);
 		} else {
-			unlock_cluster(ci);
+			swap_cluster_unlock(ci);
 		}
 		if (found)
 			goto done;
@@ -1204,7 +1166,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 	if (!si || !offset || !get_swap_device_info(si))
 		return false;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	if (cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
@@ -1212,7 +1174,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
 		if (found)
 			*entry = swp_entry(si->type, found);
 	} else {
-		unlock_cluster(ci);
+		swap_cluster_unlock(ci);
 	}
 
 	put_swap_device(si);
@@ -1480,14 +1442,14 @@ static void swap_entries_put_cache(struct swap_info_struct *si,
 	unsigned long offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
 
-	ci = lock_cluster(si, offset);
-	if (swap_only_has_cache(si, offset, nr))
+	ci = swap_cluster_lock(si, offset);
+	if (swap_only_has_cache(si, offset, nr)) {
 		swap_entries_free(si, ci, entry, nr);
-	else {
+	} else {
 		for (int i = 0; i < nr; i++, entry.val++)
 			swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
 	}
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 }
 
 static bool swap_entries_put_map(struct swap_info_struct *si,
@@ -1505,7 +1467,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	if (count != 1 && count != SWAP_MAP_SHMEM)
 		goto fallback;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
 		goto locked_fallback;
 	}
@@ -1514,21 +1476,20 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
 	else
 		for (i = 0; i < nr; i++)
 			WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 
 	return has_cache;
 
 fallback:
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 locked_fallback:
 	for (i = 0; i < nr; i++, entry.val++) {
 		count = swap_entry_put_locked(si, ci, entry, 1);
 		if (count == SWAP_HAS_CACHE)
 			has_cache = true;
 	}
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return has_cache;
-
 }
 
 /*
@@ -1578,7 +1539,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 	unsigned char *map_end = map + nr_pages;
 
 	/* It should never free entries across different clusters */
-	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
+	VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_empty(ci));
 	VM_BUG_ON(ci->count < nr_pages);
 
@@ -1653,9 +1614,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	int count;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	count = swap_count(si->swap_map[offset]);
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return !!count;
 }
 
@@ -1678,7 +1639,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
@@ -1701,7 +1662,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return count;
 }
 
@@ -1716,7 +1677,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	int i;
 	bool ret = false;
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 	if (nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
@@ -1729,7 +1690,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 		}
 	}
 unlock_out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return ret;
 }
 
@@ -2662,8 +2623,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	BUG_ON(si->flags & SWP_WRITEOK);
 
 	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
-		ci = lock_cluster(si, offset);
-		unlock_cluster(ci);
+		ci = swap_cluster_lock(si, offset);
+		swap_cluster_unlock(ci);
 	}
 }
 
@@ -3579,7 +3540,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 
 	err = 0;
 	for (i = 0; i < nr; i++) {
@@ -3634,7 +3595,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	}
 
 unlock_out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	return err;
 }
 
@@ -3733,7 +3694,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster(si, offset);
+	ci = swap_cluster_lock(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 
@@ -3793,7 +3754,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 out_unlock_cont:
 	spin_unlock(&si->cont_lock);
 out:
-	unlock_cluster(ci);
+	swap_cluster_unlock(ci);
 	put_swap_device(si);
 outer:
 	if (page)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (2 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-27  3:47   ` Baoquan He
                     ` (2 more replies)
  2025-08-22 19:20 ` [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

swp_swap_info is the most commonly used helper for retrieving swap info.
It has an internal check that may lead to a NULL return value, but
almost none of its caller checks the return value, making the internal
check pointless. In fact, most of these callers already ensured the
entry is valid and never expect a NULL value.

Tidy this up and shorten the name. If the caller can make sure the
swap entry/type is valid and the device is pinned, use the new introduced
swp_info/swp_type_info instead. They have more debug sanity checks and
lower overhead as they are inlined.

Callers that may expect a NULL value should use
swp_get_info/swp_type_get_info instead.

No feature change. The rearranged codes should have had no effect, or
they should have been hitting NULL de-ref bugs already. Some new sanity
checks have been added to the debug build to catch potential misuse.
And the new helpers will be used by swap cache when working with locked
swap cache folios, as a locked swap cache ensures the entries are valid
and stable.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  6 ------
 mm/page_io.c         | 12 ++++++------
 mm/swap.h            | 33 ++++++++++++++++++++++++++++++---
 mm/swap_state.c      |  4 ++--
 mm/swapfile.c        | 35 ++++++++++++++++++-----------------
 5 files changed, 56 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 20efd9a34034..cb59c13fef42 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -469,7 +469,6 @@ extern sector_t swapdev_block(int, pgoff_t);
 extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
-struct swap_info_struct *swp_swap_info(swp_entry_t entry);
 struct backing_dev_info;
 extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
 extern void exit_swap_address_space(unsigned int type);
@@ -482,11 +481,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
 }
 
 #else /* CONFIG_SWAP */
-static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return NULL;
-}
-
 static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
 {
 	return NULL;
diff --git a/mm/page_io.c b/mm/page_io.c
index a2056a5ecb13..bc164677d70b 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,7 +204,7 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
 	unsigned int i;
@@ -223,7 +223,7 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	swp_entry_t entry;
 	unsigned int i;
 
@@ -374,7 +374,7 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 static void swap_writepage_fs(struct folio *folio, struct swap_iocb **swap_plug)
 {
 	struct swap_iocb *sio = swap_plug ? *swap_plug : NULL;
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	struct file *swap_file = sis->swap_file;
 	loff_t pos = swap_dev_pos(folio->swap);
 
@@ -446,7 +446,7 @@ static void swap_writepage_bdev_async(struct folio *folio,
 
 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	/*
@@ -537,7 +537,7 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	struct swap_iocb *sio = NULL;
 	loff_t pos = swap_dev_pos(folio->swap);
 
@@ -608,7 +608,7 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset = folio_test_workingset(folio);
 	unsigned long pflags;
diff --git a/mm/swap.h b/mm/swap.h
index 223b40f2d37e..7b3efaa51624 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -15,6 +15,8 @@ extern int page_cluster;
 #define swap_entry_order(order)	0
 #endif
 
+extern struct swap_info_struct *swap_info[];
+
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
  * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
@@ -53,9 +55,28 @@ enum swap_cluster_flags {
 #include <linux/swapops.h> /* for swp_offset */
 #include <linux/blk_types.h> /* for bio_end_io_t */
 
+/*
+ * Callers of all swp_* helpers here must ensure the entry is valid, and
+ * pin the swap device by reference or in other ways.
+ */
+static inline struct swap_info_struct *swp_type_info(int type)
+{
+	struct swap_info_struct *si;
+
+	si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
+	return si;
+}
+
+static inline struct swap_info_struct *swp_info(swp_entry_t entry)
+{
+	return swp_type_info(swp_type(entry));
+}
+
 static inline struct swap_cluster_info *swp_offset_cluster(
 		struct swap_info_struct *si, pgoff_t offset)
 {
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
@@ -65,6 +86,7 @@ static inline struct swap_cluster_info *swap_cluster_lock(
 {
 	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
 
+	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	spin_lock(&ci->lock);
 	return ci;
 }
@@ -164,7 +186,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->flags;
+	return swp_info(folio->swap)->flags;
 }
 
 /*
@@ -175,7 +197,7 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		bool *is_zeromap)
 {
-	struct swap_info_struct *sis = swp_swap_info(entry);
+	struct swap_info_struct *sis = swp_info(entry);
 	unsigned long start = swp_offset(entry);
 	unsigned long end = start + max_nr;
 	bool first_bit;
@@ -194,7 +216,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	pgoff_t offset = swp_offset(entry);
 	int i;
 
@@ -213,6 +235,11 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+static inline struct swap_info_struct *swp_info(swp_entry_t entry)
+{
+	return NULL;
+}
+
 static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index be0d96494dc1..721ff1a5e73a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -330,7 +330,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	struct folio *folio;
 	struct folio *new_folio = NULL;
 	struct folio *result = NULL;
@@ -554,7 +554,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	unsigned long offset = entry_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 618cf4333a3d..85606fbebf0f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -102,7 +102,7 @@ static PLIST_HEAD(swap_active_head);
 static struct plist_head *swap_avail_heads;
 static DEFINE_SPINLOCK(swap_avail_lock);
 
-static struct swap_info_struct *swap_info[MAX_SWAPFILES];
+struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
 static DEFINE_MUTEX(swapon_mutex);
 
@@ -124,14 +124,20 @@ static DEFINE_PER_CPU(struct percpu_swap_cluster, percpu_swap_cluster) = {
 	.lock = INIT_LOCAL_LOCK(),
 };
 
-static struct swap_info_struct *swap_type_to_swap_info(int type)
+/* May return NULL on invalid type, caller must check for NULL return */
+static struct swap_info_struct *swp_type_get_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
 		return NULL;
-
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
+/* May return NULL on invalid entry, caller must check for NULL return */
+static struct swap_info_struct *swp_get_info(swp_entry_t entry)
+{
+	return swp_type_get_info(swp_type(entry));
+}
+
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -343,7 +349,7 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
 
 sector_t swap_folio_sector(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swp_info(folio->swap);
 	struct swap_extent *se;
 	sector_t sector;
 	pgoff_t offset;
@@ -1301,7 +1307,7 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 
 	if (!entry.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swp_get_info(entry);
 	if (!si)
 		goto bad_nofile;
 	if (data_race(!(si->flags & SWP_USED)))
@@ -1416,7 +1422,7 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 
 	if (!entry.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swp_get_info(entry);
 	if (!si)
 		goto bad_nofile;
 	if (!get_swap_device_info(si))
@@ -1597,7 +1603,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swp_info(entry);
 	pgoff_t offset = swp_offset(entry);
 
 	return swap_count(si->swap_map[offset]);
@@ -1828,7 +1834,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 
 swp_entry_t get_swap_page_of_type(int type)
 {
-	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	struct swap_info_struct *si = swp_type_get_info(type);
 	unsigned long offset;
 	swp_entry_t entry = {0};
 
@@ -1909,7 +1915,7 @@ int find_first_swap(dev_t *device)
  */
 sector_t swapdev_block(int type, pgoff_t offset)
 {
-	struct swap_info_struct *si = swap_type_to_swap_info(type);
+	struct swap_info_struct *si = swp_type_get_info(type);
 	struct swap_extent *se;
 
 	if (!si || !(si->flags & SWP_WRITEOK))
@@ -2837,7 +2843,7 @@ static void *swap_start(struct seq_file *swap, loff_t *pos)
 	if (!l)
 		return SEQ_START_TOKEN;
 
-	for (type = 0; (si = swap_type_to_swap_info(type)); type++) {
+	for (type = 0; (si = swp_type_get_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		if (!--l)
@@ -2858,7 +2864,7 @@ static void *swap_next(struct seq_file *swap, void *v, loff_t *pos)
 		type = si->type + 1;
 
 	++(*pos);
-	for (; (si = swap_type_to_swap_info(type)); type++) {
+	for (; (si = swp_type_get_info(type)); type++) {
 		if (!(si->flags & SWP_USED) || !si->swap_map)
 			continue;
 		return si;
@@ -3531,7 +3537,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	unsigned char has_cache;
 	int err, i;
 
-	si = swp_swap_info(entry);
+	si = swp_get_info(entry);
 	if (WARN_ON_ONCE(!si)) {
 		pr_err("%s%08lx\n", Bad_file, entry.val);
 		return -EINVAL;
@@ -3646,11 +3652,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 	swap_entries_put_cache(si, entry, nr);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-{
-	return swap_type_to_swap_info(swp_type(entry));
-}
-
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (3 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-25  3:02   ` Baolin Wang
  2025-09-03  8:25   ` David Hildenbrand
  2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Shmem may replace a folio in the swap cache if the cached one doesn't
fit the swapin's GFP zone. When doing so, shmem has already double
checked that the swap cache folio is locked, still has the swap cache
flag set, and contains the wanted swap entry. So it is impossible to
fail due to an Xarray mismatch. There is even a comment for that.

Delete the defensive error handling path, and add a WARN_ON instead:
if that happened, something has broken the basic principle of how the
swap cache works, we should catch and fix that.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 28 +++-------------------------
 1 file changed, 3 insertions(+), 25 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index b4d39f2a1e0a..e03793cc5169 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2158,35 +2158,13 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	/* Swap cache still stores N entries instead of a high-order entry */
 	xa_lock_irq(&swap_mapping->i_pages);
 	for (i = 0; i < nr_pages; i++) {
-		void *item = xas_load(&xas);
-
-		if (item != old) {
-			error = -ENOENT;
-			break;
-		}
-
-		xas_store(&xas, new);
+		WARN_ON_ONCE(xas_store(&xas, new));
 		xas_next(&xas);
 	}
-	if (!error) {
-		mem_cgroup_replace_folio(old, new);
-		shmem_update_stats(new, nr_pages);
-		shmem_update_stats(old, -nr_pages);
-	}
 	xa_unlock_irq(&swap_mapping->i_pages);
 
-	if (unlikely(error)) {
-		/*
-		 * Is this possible?  I think not, now that our callers
-		 * check both the swapcache flag and folio->private
-		 * after getting the folio lock; but be defensive.
-		 * Reverse old to newpage for clear and free.
-		 */
-		old = new;
-	} else {
-		folio_add_lru(new);
-		*foliop = new;
-	}
+	folio_add_lru(new);
+	*foliop = new;
 
 	folio_clear_swapcache(old);
 	old->private = NULL;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (4 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-30  1:54   ` Baoquan He
                     ` (3 more replies)
  2025-08-22 19:20 ` [PATCH 7/9] mm, swap: remove contention workaround for swap cache Kairui Song
                   ` (4 subsequent siblings)
  10 siblings, 4 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Introduce basic swap table infrastructures, which are now just a
fixed-sized flat array inside each swap cluster, with access wrappers.

Each cluster contains a swap table of 512 entries. Each table entry is
an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
a folio type (pointer), or NULL.

In this first step, it only supports storing a folio or shadow, and it
is a drop-in replacement for the current swap cache. Convert all swap
cache users to use the new sets of APIs. Chris Li has been suggesting
using a new infrastructure for swap cache for better performance, and
that idea combined well with the swap table as the new backing
structure. Now the lock contention range is reduced to 2M clusters,
which is much smaller than the 64M address_space. And we can also drop
the multiple address_space design.

All the internal works are done with swap_cache_get_* helpers. Swap
cache lookup is still lock-less like before, and the helper's contexts
are same with original swap cache helpers. They still require a pin
on the swap device to prevent the backing data from being freed.

Swap cache updates are now protected by the swap cluster lock
instead of the Xarray lock. This is mostly handled internally, but new
__swap_cache_* helpers require the caller to lock the cluster. So, a
few new cluster access and locking helpers are also introduced.

A fully cluster-based unified swap table can be implemented on top
of this to take care of all count tracking and synchronization work,
with dynamic allocation. It should reduce the memory usage while
making the performance even better.

Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 MAINTAINERS          |   1 +
 include/linux/swap.h |   2 -
 mm/filemap.c         |   2 +-
 mm/huge_memory.c     |  16 +--
 mm/memory-failure.c  |   2 +-
 mm/memory.c          |   2 +-
 mm/migrate.c         |  28 ++--
 mm/shmem.c           |  26 ++--
 mm/swap.h            | 151 +++++++++++++++------
 mm/swap_state.c      | 315 +++++++++++++++++++++----------------------
 mm/swap_table.h      | 106 +++++++++++++++
 mm/swapfile.c        | 105 +++++++++++----
 mm/vmscan.c          |  20 ++-
 mm/zswap.c           |   2 +-
 14 files changed, 500 insertions(+), 278 deletions(-)
 create mode 100644 mm/swap_table.h

diff --git a/MAINTAINERS b/MAINTAINERS
index b6f7c6939ff8..b78adfb3c7f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16214,6 +16214,7 @@ F:	include/linux/swapops.h
 F:	mm/page_io.c
 F:	mm/swap.c
 F:	mm/swap.h
+F:	mm/swap_table.h
 F:	mm/swap_state.c
 F:	mm/swapfile.c
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index cb59c13fef42..7455df9bf340 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -470,8 +470,6 @@ extern int __swap_count(swp_entry_t entry);
 extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 struct backing_dev_info;
-extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
-extern void exit_swap_address_space(unsigned int type);
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
 sector_t swap_folio_sector(struct folio *folio);
 
diff --git a/mm/filemap.c b/mm/filemap.c
index e4a5a46db89b..1fd0565b56e4 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4504,7 +4504,7 @@ static void filemap_cachestat(struct address_space *mapping,
 				 * invalidation, so there might not be
 				 * a shadow in the swapcache (yet).
 				 */
-				shadow = get_shadow_from_swap_cache(swp);
+				shadow = swap_cache_get_shadow(swp);
 				if (!shadow)
 					goto resched;
 			}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2a47cd3bb649..209580d395a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3721,7 +3721,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
-		struct address_space *swap_cache = NULL;
+		struct swap_cluster_info *swp_ci = NULL;
 		struct lruvec *lruvec;
 		int expected_refs;
 
@@ -3765,8 +3765,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 				goto fail;
 			}
 
-			swap_cache = swap_address_space(folio->swap);
-			xa_lock(&swap_cache->i_pages);
+			swp_ci = swap_cluster_lock_by_folio(folio);
 		}
 
 		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
@@ -3798,10 +3797,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			 * Anonymous folio with swap cache.
 			 * NOTE: shmem in swap cache is not supported yet.
 			 */
-			if (swap_cache) {
-				__xa_store(&swap_cache->i_pages,
-					   swap_cache_index(new_folio->swap),
-					   new_folio, 0);
+			if (swp_ci) {
+				__swap_cache_replace_folio(swp_ci, new_folio->swap,
+							   folio, new_folio);
 				continue;
 			}
 
@@ -3836,8 +3834,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		unlock_page_lruvec(lruvec);
 
-		if (swap_cache)
-			xa_unlock(&swap_cache->i_pages);
+		if (swp_ci)
+			swap_cluster_unlock(swp_ci);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 		ret = -EAGAIN;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c15ffee7d32b..bb92d0c72aec 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1199,7 +1199,7 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
 	struct folio *folio = page_folio(p);
 	int ret;
 
-	delete_from_swap_cache(folio);
+	swap_cache_del_folio(folio);
 
 	ret = delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED;
 	folio_unlock(folio);
diff --git a/mm/memory.c b/mm/memory.c
index 9ca8e1873c6e..f81bf06e6ff5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4696,7 +4696,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 				memcg1_swapin(entry, nr_pages);
 
-				shadow = get_shadow_from_swap_cache(entry);
+				shadow = swap_cache_get_shadow(entry);
 				if (shadow)
 					workingset_refault(folio, shadow);
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 8e435a078fc3..74db32caba2d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -563,10 +563,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int expected_count)
 {
 	XA_STATE(xas, &mapping->i_pages, folio_index(folio));
+	struct swap_cluster_info *swp_ci = NULL;
 	struct zone *oldzone, *newzone;
 	int dirty;
 	long nr = folio_nr_pages(folio);
-	long entries, i;
 
 	if (!mapping) {
 		/* Take off deferred split queue while frozen and memcg set */
@@ -592,9 +592,16 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	oldzone = folio_zone(folio);
 	newzone = folio_zone(newfolio);
 
-	xas_lock_irq(&xas);
+	if (folio_test_swapcache(folio))
+		swp_ci = swap_cluster_lock_by_folio_irq(folio);
+	else
+		xas_lock_irq(&xas);
+
 	if (!folio_ref_freeze(folio, expected_count)) {
-		xas_unlock_irq(&xas);
+		if (swp_ci)
+			swap_cluster_unlock(swp_ci);
+		else
+			xas_unlock_irq(&xas);
 		return -EAGAIN;
 	}
 
@@ -615,9 +622,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	if (folio_test_swapcache(folio)) {
 		folio_set_swapcache(newfolio);
 		newfolio->private = folio_get_private(folio);
-		entries = nr;
-	} else {
-		entries = 1;
 	}
 
 	/* Move dirty while folio refs frozen and newfolio not yet exposed */
@@ -627,11 +631,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 		folio_set_dirty(newfolio);
 	}
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	for (i = 0; i < entries; i++) {
+	if (folio_test_swapcache(folio))
+		__swap_cache_replace_folio(swp_ci, folio->swap, folio, newfolio);
+	else
 		xas_store(&xas, newfolio);
-		xas_next(&xas);
-	}
 
 	/*
 	 * Drop cache reference from old folio by unfreezing
@@ -640,8 +643,11 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	 */
 	folio_ref_unfreeze(folio, expected_count - nr);
 
-	xas_unlock(&xas);
 	/* Leave irq disabled to prevent preemption while updating stats */
+	if (swp_ci)
+		swap_cluster_unlock(swp_ci);
+	else
+		xas_unlock(&xas);
 
 	/*
 	 * If moved to a different zone then also account
diff --git a/mm/shmem.c b/mm/shmem.c
index e03793cc5169..f088115cf209 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1698,13 +1698,13 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		}
 
 		/*
-		 * The delete_from_swap_cache() below could be left for
+		 * The swap_cache_del_folio() below could be left for
 		 * shrink_folio_list()'s folio_free_swap() to dispose of;
 		 * but I'm a little nervous about letting this folio out of
 		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
 		 * e.g. folio_mapping(folio) might give an unexpected answer.
 		 */
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 		goto redirty;
 	}
 	if (nr_pages > 1)
@@ -2082,7 +2082,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
 	new->swap = entry;
 
 	memcg1_swapin(entry, nr_pages);
-	shadow = get_shadow_from_swap_cache(entry);
+	shadow = swap_cache_get_shadow(entry);
 	if (shadow)
 		workingset_refault(new, shadow);
 	folio_add_lru(new);
@@ -2120,13 +2120,11 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 				struct shmem_inode_info *info, pgoff_t index,
 				struct vm_area_struct *vma)
 {
+	struct swap_cluster_info *ci;
 	struct folio *new, *old = *foliop;
 	swp_entry_t entry = old->swap;
-	struct address_space *swap_mapping = swap_address_space(entry);
-	pgoff_t swap_index = swap_cache_index(entry);
-	XA_STATE(xas, &swap_mapping->i_pages, swap_index);
 	int nr_pages = folio_nr_pages(old);
-	int error = 0, i;
+	int error = 0;
 
 	/*
 	 * We have arrived here because our zones are constrained, so don't
@@ -2155,13 +2153,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
 	new->swap = entry;
 	folio_set_swapcache(new);
 
-	/* Swap cache still stores N entries instead of a high-order entry */
-	xa_lock_irq(&swap_mapping->i_pages);
-	for (i = 0; i < nr_pages; i++) {
-		WARN_ON_ONCE(xas_store(&xas, new));
-		xas_next(&xas);
-	}
-	xa_unlock_irq(&swap_mapping->i_pages);
+	ci = swap_cluster_lock_by_folio_irq(old);
+	__swap_cache_replace_folio(ci, entry, old, new);
+	swap_cluster_unlock(ci);
 
 	folio_add_lru(new);
 	*foliop = new;
@@ -2198,7 +2192,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
 	nr_pages = folio_nr_pages(folio);
 	folio_wait_writeback(folio);
 	if (!skip_swapcache)
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 	/*
 	 * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
 	 * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
@@ -2438,7 +2432,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio->swap.val = 0;
 		swapcache_clear(si, swap, nr_pages);
 	} else {
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 	}
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
diff --git a/mm/swap.h b/mm/swap.h
index 7b3efaa51624..4af42bc2cd72 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -2,6 +2,7 @@
 #ifndef _MM_SWAP_H
 #define _MM_SWAP_H
 
+#include <linux/atomic.h> /* for atomic_long_t */
 struct mempolicy;
 struct swap_iocb;
 
@@ -35,6 +36,7 @@ struct swap_cluster_info {
 	u16 count;
 	u8 flags;
 	u8 order;
+	atomic_long_t *table;	/* Swap table entries, see mm/swap_table.h */
 	struct list_head list;
 };
 
@@ -80,22 +82,62 @@ static inline struct swap_cluster_info *swp_offset_cluster(
 	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }
 
-static inline struct swap_cluster_info *swap_cluster_lock(
-		struct swap_info_struct *si,
-		unsigned long offset)
+static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry)
+{
+	return swp_offset_cluster(swp_info(entry), swp_offset(entry));
+}
+
+static inline unsigned int swp_cluster_offset(swp_entry_t entry)
+{
+	return swp_offset(entry) % SWAPFILE_CLUSTER;
+}
+
+/*
+ * Lock the swap cluster of the given offset. The caller must ensure the swap
+ * offset is valid and that the following accesses won't go beyond the locked
+ * cluster. swap_cluster_lock_by_folio is preferred when possible
+ */
+static __always_inline struct swap_cluster_info *__swap_cluster_lock(
+		struct swap_info_struct *si, unsigned long offset, bool irq)
 {
 	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
 
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-	spin_lock(&ci->lock);
+	if (irq)
+		spin_lock_irq(&ci->lock);
+	else
+		spin_lock(&ci->lock);
 	return ci;
 }
+#define swap_cluster_lock(si, off) __swap_cluster_lock(si, off, false)
+
+/*
+ * Lock the swap cluster that holds a folio's swap entries. Caller needs to lock
+ * the folio and ensure it's in the swap cache, and only touch the folio's swap
+ * entries. A folio's entries are always in one cluster, and a locked folio lock
+ * ensures it won't be freed from the swap cache, hence stabilizing the device.
+ */
+static inline struct swap_cluster_info *__swap_cluster_lock_by_folio(
+		struct folio *folio, bool irq)
+{
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	return __swap_cluster_lock(swp_info(folio->swap),
+				   swp_offset(folio->swap), irq);
+}
+#define swap_cluster_lock_by_folio(folio) __swap_cluster_lock_by_folio(folio, false)
+#define swap_cluster_lock_by_folio_irq(folio) __swap_cluster_lock_by_folio(folio, true)
 
 static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
 {
 	spin_unlock(&ci->lock);
 }
 
+static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
+{
+	spin_unlock_irq(&ci->lock);
+}
+
 /* linux/mm/page_io.c */
 int sio_pool_init(void);
 struct swap_iocb;
@@ -115,10 +157,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
-extern struct address_space *swapper_spaces[];
-#define swap_address_space(entry)			    \
-	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
-		>> SWAP_ADDRESS_SPACE_SHIFT])
+extern struct address_space swap_space __ro_after_init;
+static inline struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return &swap_space;
+}
 
 /*
  * Return the swap device position of the swap entry.
@@ -128,15 +171,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
 	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
 }
 
-/*
- * Return the swap cache index of the swap entry.
- */
-static inline pgoff_t swap_cache_index(swp_entry_t entry)
-{
-	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
-	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
-}
-
 /**
  * folio_contains_swap - Does this folio contain this swap entry?
  * @folio: The folio.
@@ -160,17 +194,31 @@ static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
 	return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
 }
 
+/*
+ * All swap cache helpers below require the caller to ensure the swap entries
+ * are valid and pin the device. This can be guaranteed by:
+ * - get_swap_device: this ensures a single entry is valid and increases the
+ *   swap device's refcount.
+ * - Locking a folio in the swap cache: this ensures the folio won't be freed
+ *   from the swap cache, stabilizes its entries, and the swap device.
+ * - Locking anything referencing the swap entry: e.g. locking the PTL that
+ *   protects swap entries in the page table, so they won't be freed.
+ */
+extern struct folio *swap_cache_get_folio(swp_entry_t entry);
+extern void *swap_cache_get_shadow(swp_entry_t entry);
+extern int swap_cache_add_folio(swp_entry_t entry,
+				struct folio *folio, void **shadow);
+extern void swap_cache_del_folio(struct folio *folio);
+/* Below helpers also require the caller to lock the swap cluster. */
+extern void __swap_cache_del_folio(swp_entry_t entry,
+				   struct folio *folio, void *shadow);
+extern void __swap_cache_replace_folio(struct swap_cluster_info *ci,
+				      swp_entry_t entry, struct folio *old,
+				      struct folio *new);
+extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
+
 void show_swap_cache_info(void);
-void *get_shadow_from_swap_cache(swp_entry_t entry);
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-		      gfp_t gfp, void **shadowp);
-void __delete_from_swap_cache(struct folio *folio,
-			      swp_entry_t entry, void *shadow);
-void delete_from_swap_cache(struct folio *folio);
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				  unsigned long end);
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
-struct folio *swap_cache_get_folio(swp_entry_t entry);
 struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug);
@@ -235,6 +283,33 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
+
+static inline struct swap_cluster_info *swap_cluster_lock(
+	struct swap_info_struct *si, pgoff_t offset, bool irq)
+{
+	return NULL;
+}
+
+static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
+		struct folio *folio)
+{
+	return NULL;
+}
+
+static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
+		struct folio *folio)
+{
+	return NULL;
+}
+
+static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
+{
+}
+
+static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
+{
+}
+
 static inline struct swap_info_struct *swp_info(swp_entry_t entry)
 {
 	return NULL;
@@ -252,11 +327,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
 	return NULL;
 }
 
-static inline pgoff_t swap_cache_index(swp_entry_t entry)
-{
-	return 0;
-}
-
 static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
 {
 	return false;
@@ -298,28 +368,27 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
 	return NULL;
 }
 
-static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
+static inline void *swap_cache_get_shadow(swp_entry_t end)
 {
 	return NULL;
 }
 
-static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-					gfp_t gfp_mask, void **shadowp)
+static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
 {
-	return -1;
+	return -EINVAL;
 }
 
-static inline void __delete_from_swap_cache(struct folio *folio,
-					swp_entry_t entry, void *shadow)
+static inline void swap_cache_del_folio(struct folio *folio)
 {
 }
 
-static inline void delete_from_swap_cache(struct folio *folio)
+static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
 {
 }
 
-static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+static inline void __swap_cache_replace_folio(
+		struct swap_cluster_info *ci, swp_entry_t entry,
+		struct folio *old, struct folio *new)
 {
 }
 
@@ -354,7 +423,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 static inline pgoff_t folio_index(struct folio *folio)
 {
 	if (unlikely(folio_test_swapcache(folio)))
-		return swap_cache_index(folio->swap);
+		return swp_offset(folio->swap);
 	return folio->index;
 }
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 721ff1a5e73a..c0342024b4a8 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -23,6 +23,7 @@
 #include <linux/huge_mm.h>
 #include <linux/shmem_fs.h>
 #include "internal.h"
+#include "swap_table.h"
 #include "swap.h"
 
 /*
@@ -36,8 +37,11 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
-struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
-static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
+/* Set swap_space is read only as swap cache is handled by swap table */
+struct address_space swap_space __ro_after_init = {
+	.a_ops = &swap_aops,
+};
+
 static bool enable_vma_readahead __read_mostly = true;
 
 #define SWAP_RA_ORDER_CEILING	5
@@ -69,7 +73,7 @@ void show_swap_cache_info(void)
 	printk("Total swap = %lukB\n", K(total_swap_pages));
 }
 
-/*
+/**
  * swap_cache_get_folio - Lookup a swap entry in the swap cache.
  *
  * A found folio will be returned unlocked and with its refcount increased.
@@ -79,155 +83,179 @@ void show_swap_cache_info(void)
  */
 struct folio *swap_cache_get_folio(swp_entry_t entry)
 {
-	struct folio *folio = filemap_get_folio(swap_address_space(entry),
-						swap_cache_index(entry));
-	if (!IS_ERR(folio))
-		return folio;
+	unsigned long swp_tb;
+	struct folio *folio;
+
+	for (;;) {
+		swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
+		if (!swp_tb_is_folio(swp_tb))
+			return NULL;
+		folio = swp_tb_to_folio(swp_tb);
+		if (folio_try_get(folio))
+			return folio;
+	}
+
 	return NULL;
 }
 
-void *get_shadow_from_swap_cache(swp_entry_t entry)
+/**
+ * swap_cache_get_shadow - Lookup a shadow in the swap cache.
+ *
+ * Context: Caller must ensure @entry is valid and pin the swap device.
+ */
+void *swap_cache_get_shadow(swp_entry_t entry)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	pgoff_t idx = swap_cache_index(entry);
-	void *shadow;
+	unsigned long swp_tb;
+
+	swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
+	if (swp_tb_is_shadow(swp_tb))
+		return swp_tb_to_shadow(swp_tb);
 
-	shadow = xa_load(&address_space->i_pages, idx);
-	if (xa_is_value(shadow))
-		return shadow;
 	return NULL;
 }
 
-/*
- * add_to_swap_cache resembles filemap_add_folio on swapper_space,
- * but sets SwapCache flag and 'swap' instead of mapping and index.
+/**
+ * swap_cache_add_folio -  add a folio into the swap cache.
+ *
+ * The folio will be used for swapin or swapout of swap entries
+ * starting with @entry. May fail due to race.
+ *
+ * Context: Caller must ensure @entry is valid and pin the swap device.
  */
-int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
-			gfp_t gfp, void **shadowp)
+int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **shadowp)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	pgoff_t idx = swap_cache_index(entry);
-	XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
-	unsigned long i, nr = folio_nr_pages(folio);
-	void *old;
-
-	xas_set_update(&xas, workingset_update_node);
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
+	unsigned long exist;
+	void *shadow = NULL;
+	struct swap_cluster_info *ci;
+	unsigned int ci_start, ci_off, ci_end;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+
+	ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
+	ci_start = swp_cluster_offset(entry);
+	ci_end = ci_start + nr_pages;
+	ci_off = ci_start;
+	do {
+		exist = __swap_table_get(ci, ci_off);
+		if (unlikely(swp_tb_is_folio(exist)))
+			goto fail;
+		if (swp_tb_is_shadow(exist))
+			shadow = swp_tb_to_shadow(exist);
+	} while (++ci_off < ci_end);
+
+	ci_off = ci_start;
+	do {
+		__swap_table_set_folio(ci, ci_off, folio);
+	} while (++ci_off < ci_end);
 
-	folio_ref_add(folio, nr);
+	folio_ref_add(folio, nr_pages);
 	folio_set_swapcache(folio);
 	folio->swap = entry;
+	swap_cluster_unlock(ci);
 
-	do {
-		xas_lock_irq(&xas);
-		xas_create_range(&xas);
-		if (xas_error(&xas))
-			goto unlock;
-		for (i = 0; i < nr; i++) {
-			VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
-			if (shadowp) {
-				old = xas_load(&xas);
-				if (xa_is_value(old))
-					*shadowp = old;
-			}
-			xas_store(&xas, folio);
-			xas_next(&xas);
-		}
-		address_space->nrpages += nr;
-		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
-		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
-unlock:
-		xas_unlock_irq(&xas);
-	} while (xas_nomem(&xas, gfp));
-
-	if (!xas_error(&xas))
-		return 0;
+	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
 
-	folio_clear_swapcache(folio);
-	folio_ref_sub(folio, nr);
-	return xas_error(&xas);
+	if (shadowp)
+		*shadowp = shadow;
+	return 0;
+fail:
+	swap_cluster_unlock(ci);
+	return -EEXIST;
 }
 
 /*
- * This must be called only on folios that have
- * been verified to be in the swap cache.
+ * Caller must ensure the folio is in the swap cache and locked,
+ * also lock the swap cluster.
  */
-void __delete_from_swap_cache(struct folio *folio,
-			swp_entry_t entry, void *shadow)
+void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio,
+			    void *shadow)
 {
-	struct address_space *address_space = swap_address_space(entry);
-	int i;
-	long nr = folio_nr_pages(folio);
-	pgoff_t idx = swap_cache_index(entry);
-	XA_STATE(xas, &address_space->i_pages, idx);
-
-	xas_set_update(&xas, workingset_update_node);
-
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
-
-	for (i = 0; i < nr; i++) {
-		void *entry = xas_store(&xas, shadow);
-		VM_BUG_ON_PAGE(entry != folio, entry);
-		xas_next(&xas);
-	}
+	unsigned long exist;
+	struct swap_cluster_info *ci;
+	unsigned int ci_start, ci_off, ci_end;
+	unsigned long nr_pages = folio_nr_pages(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
+
+	ci = swp_offset_cluster(swp_info(entry), swp_offset(entry));
+	ci_start = swp_cluster_offset(entry);
+	ci_end = ci_start + nr_pages;
+	ci_off = ci_start;
+	do {
+		exist = __swap_table_get(ci, ci_off);
+		VM_WARN_ON_ONCE(swp_tb_to_folio(exist) != folio);
+		/* If shadow is NULL, we sets an empty shadow */
+		__swap_table_set_shadow(ci, ci_off, shadow);
+	} while (++ci_off < ci_end);
+
 	folio->swap.val = 0;
 	folio_clear_swapcache(folio);
-	address_space->nrpages -= nr;
-	__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
-	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
+	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
+	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
 }
 
 /*
- * This must be called only on folios that have
- * been verified to be in the swap cache and locked.
- * It will never put the folio into the free list,
- * the caller has a reference on the folio.
+ * Replace an old folio in the swap cache with a new one. The caller must
+ * hold the cluster lock and set the new folio's entry and flags.
  */
-void delete_from_swap_cache(struct folio *folio)
+void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
+				struct folio *old, struct folio *new)
+{
+	unsigned int ci_off = swp_cluster_offset(entry);
+	unsigned long nr_pages = folio_nr_pages(new);
+	unsigned int ci_end = ci_off + nr_pages;
+
+	VM_WARN_ON_ONCE(entry.val != new->swap.val);
+	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
+	do {
+		WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
+		__swap_table_set_folio(ci, ci_off, new);
+	} while (++ci_off < ci_end);
+
+	/*
+	 * If the old folio is partially replaced (e.g., splitting a large
+	 * folio, the old folio is shrunk in place, and new split sub folios
+	 * are added to cache), ensure the new folio doesn't overlap it.
+	 */
+	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
+	    folio_order(old) != folio_order(new)) {
+		ci_off = swp_cluster_offset(old->swap);
+		ci_end = ci_off + folio_nr_pages(old);
+		while (ci_off++ < ci_end)
+			WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
+	}
+}
+
+void swap_cache_del_folio(struct folio *folio)
 {
+	struct swap_cluster_info *ci;
 	swp_entry_t entry = folio->swap;
-	struct address_space *address_space = swap_address_space(entry);
 
-	xa_lock_irq(&address_space->i_pages);
-	__delete_from_swap_cache(folio, entry, NULL);
-	xa_unlock_irq(&address_space->i_pages);
+	ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
+	__swap_cache_del_folio(entry, folio, NULL);
+	swap_cluster_unlock(ci);
 
 	put_swap_folio(folio, entry);
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
 {
-	unsigned long curr = begin;
-	void *old;
-
-	for (;;) {
-		swp_entry_t entry = swp_entry(type, curr);
-		unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
-		struct address_space *address_space = swap_address_space(entry);
-		XA_STATE(xas, &address_space->i_pages, index);
-
-		xas_set_update(&xas, workingset_update_node);
-
-		xa_lock_irq(&address_space->i_pages);
-		xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
-			if (!xa_is_value(old))
-				continue;
-			xas_store(&xas, NULL);
-		}
-		xa_unlock_irq(&address_space->i_pages);
+	struct swap_cluster_info *ci = swp_cluster(entry);
+	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
 
-		/* search the next swapcache until we meet end */
-		curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
-		if (curr > end)
-			break;
-	}
+	ci_end = ci_off + nr_ents;
+	do {
+		WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
+		__swap_table_init_null(ci, ci_off);
+	} while (++ci_off < ci_end);
 }
 
 /*
@@ -292,8 +320,7 @@ static inline bool swap_use_vma_readahead(void)
 /*
  * Update the readahead statistics of a vma or globally.
  */
-void swap_update_readahead(struct folio *folio,
-			   struct vm_area_struct *vma,
+void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
 			   unsigned long addr)
 {
 	bool readahead, vma_ra = swap_use_vma_readahead();
@@ -387,7 +414,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			goto put_and_return;
 
 		/*
-		 * We might race against __delete_from_swap_cache(), and
+		 * We might race against __swap_cache_del_folio(), and
 		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
 		 * has not yet been cleared.  Or race against another
 		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
@@ -405,8 +432,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
 		goto fail_unlock;
 
-	/* May fail (-ENOMEM) if XArray node allocation failed. */
-	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
+	if (swap_cache_add_folio(entry, new_folio, &shadow))
 		goto fail_unlock;
 
 	memcg1_swapin(entry, 1);
@@ -572,11 +598,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		end_offset = si->max - 1;
 
 	blk_start_plug(&plug);
-	for (offset = start_offset; offset <= end_offset ; offset++) {
+	for (offset = start_offset; offset <= end_offset; offset++) {
 		/* Ok, do the async read-ahead now */
 		folio = __read_swap_cache_async(
-				swp_entry(swp_type(entry), offset),
-				gfp_mask, mpol, ilx, &page_allocated, false);
+				swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
+				&page_allocated, false);
 		if (!folio)
 			continue;
 		if (page_allocated) {
@@ -600,41 +626,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
-int init_swap_address_space(unsigned int type, unsigned long nr_pages)
-{
-	struct address_space *spaces, *space;
-	unsigned int i, nr;
-
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
-	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
-	if (!spaces)
-		return -ENOMEM;
-	for (i = 0; i < nr; i++) {
-		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
-		atomic_set(&space->i_mmap_writable, 0);
-		space->a_ops = &swap_aops;
-		/* swap cache doesn't use writeback related tags */
-		mapping_set_no_writeback_tags(space);
-	}
-	nr_swapper_spaces[type] = nr;
-	swapper_spaces[type] = spaces;
-
-	return 0;
-}
-
-void exit_swap_address_space(unsigned int type)
-{
-	int i;
-	struct address_space *spaces = swapper_spaces[type];
-
-	for (i = 0; i < nr_swapper_spaces[type]; i++)
-		VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
-	kvfree(spaces);
-	nr_swapper_spaces[type] = 0;
-	swapper_spaces[type] = NULL;
-}
-
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
 {
@@ -807,7 +798,7 @@ static const struct attribute_group swap_attr_group = {
 	.attrs = swap_attrs,
 };
 
-static int __init swap_init_sysfs(void)
+static int __init swap_init(void)
 {
 	int err;
 	struct kobject *swap_kobj;
@@ -822,11 +813,13 @@ static int __init swap_init_sysfs(void)
 		pr_err("failed to register swap group\n");
 		goto delete_obj;
 	}
+	/* swap_space is set RO after init, so do it here before init ends. */
+	mapping_set_no_writeback_tags(&swap_space);
 	return 0;
 
 delete_obj:
 	kobject_put(swap_kobj);
 	return err;
 }
-subsys_initcall(swap_init_sysfs);
+subsys_initcall(swap_init);
 #endif
diff --git a/mm/swap_table.h b/mm/swap_table.h
new file mode 100644
index 000000000000..ed9676547071
--- /dev/null
+++ b/mm/swap_table.h
@@ -0,0 +1,106 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _MM_SWAP_TABLE_H
+#define _MM_SWAP_TABLE_H
+
+#include "swap.h"
+
+/*
+ * A swap table entry represents the status of a swap slot on a swap
+ * (physical or virtual) device. The swap table in each cluster is a
+ * 1:1 map of the swap slots in this cluster.
+ *
+ * Each swap table entry could be a pointer (folio), a XA_VALUE
+ * (shadow), or NULL.
+ */
+
+/*
+ * Helpers for casting one type of info into a swap table entry.
+ */
+static inline unsigned long null_to_swp_tb(void)
+{
+	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
+	return 0;
+}
+
+static inline unsigned long folio_to_swp_tb(struct folio *folio)
+{
+	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
+	return (unsigned long)folio;
+}
+
+static inline unsigned long shadow_swp_to_tb(void *shadow)
+{
+	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
+		     BITS_PER_BYTE * sizeof(unsigned long));
+	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
+	return (unsigned long)shadow;
+}
+
+/*
+ * Helpers for swap table entry type checking.
+ */
+static inline bool swp_tb_is_null(unsigned long swp_tb)
+{
+	return !swp_tb;
+}
+
+static inline bool swp_tb_is_folio(unsigned long swp_tb)
+{
+	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
+}
+
+static inline bool swp_tb_is_shadow(unsigned long swp_tb)
+{
+	return xa_is_value((void *)swp_tb);
+}
+
+/*
+ * Helpers for retrieving info from swap table.
+ */
+static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
+	return (void *)swp_tb;
+}
+
+static inline void *swp_tb_to_shadow(unsigned long swp_tb)
+{
+	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
+	return (void *)swp_tb;
+}
+
+/*
+ * Helpers for accessing or modifying the swap table of a cluster,
+ * the swap cluster must be locked.
+ */
+static inline void __swap_table_set(struct swap_cluster_info *ci,
+				    unsigned int off, unsigned long swp_tb)
+{
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	atomic_long_set(&ci->table[off], swp_tb);
+}
+
+static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
+					     unsigned int off)
+{
+	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	return atomic_long_read(&ci->table[off]);
+}
+
+static inline void __swap_table_set_folio(struct swap_cluster_info *ci,
+					  unsigned int off, struct folio *folio)
+{
+	__swap_table_set(ci, off, folio_to_swp_tb(folio));
+}
+
+static inline void __swap_table_set_shadow(struct swap_cluster_info *ci,
+					   unsigned int off, void *shadow)
+{
+	__swap_table_set(ci, off, shadow_swp_to_tb(shadow));
+}
+
+static inline void __swap_table_init_null(struct swap_cluster_info *ci, unsigned int off)
+{
+	__swap_table_set(ci, off, null_to_swp_tb());
+}
+#endif
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 85606fbebf0f..df68b5e242a6 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -46,6 +46,7 @@
 #include <asm/tlbflush.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "swap_table.h"
 #include "internal.h"
 #include "swap.h"
 
@@ -268,7 +269,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	delete_from_swap_cache(folio);
+	swap_cache_del_folio(folio);
 	folio_set_dirty(folio);
 	ret = nr_pages;
 out_unlock:
@@ -422,6 +423,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
+static int swap_table_alloc_table(struct swap_cluster_info *ci)
+{
+	WARN_ON(ci->table);
+	ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
+	if (!ci->table)
+		return -ENOMEM;
+	return 0;
+}
+
+static void swap_cluster_free_table(struct swap_cluster_info *ci)
+{
+	unsigned int ci_off;
+	unsigned long swp_tb;
+
+	if (!ci->table)
+		return;
+
+	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
+		swp_tb = __swap_table_get(ci, ci_off);
+		if (!swp_tb_is_null(swp_tb))
+			pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
+				    swp_tb);
+	}
+
+	kfree(ci->table);
+	ci->table = NULL;
+}
+
 static void move_cluster(struct swap_info_struct *si,
 			 struct swap_cluster_info *ci, struct list_head *list,
 			 enum swap_cluster_flags new_flags)
@@ -704,6 +733,25 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 	return true;
 }
 
+/*
+ * Currently, the swap table is not used for count tracking,
+ * just do a sanity check to ensure nothing went wrong.
+ */
+static void cluster_table_check(struct swap_cluster_info *ci,
+				unsigned int start, unsigned int nr)
+{
+	unsigned int ci_off = start % SWAPFILE_CLUSTER;
+	unsigned int ci_end = ci_off + nr;
+	unsigned long swp_tb;
+
+	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
+		do {
+			swp_tb = __swap_table_get(ci, ci_off);
+			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
+		} while (++ci_off < ci_end);
+	}
+}
+
 static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
 				unsigned int start, unsigned char usage,
 				unsigned int order)
@@ -723,6 +771,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 		ci->order = order;
 
 	memset(si->swap_map + start, usage, nr_pages);
+	cluster_table_check(ci, start, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
@@ -1100,8 +1149,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			    unsigned int nr_entries)
 {
-	unsigned long begin = offset;
-	unsigned long end = offset + nr_entries - 1;
+	unsigned long start = offset, end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
 	unsigned int i;
 
@@ -1125,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 			swap_slot_free_notify(si->bdev, offset);
 		offset++;
 	}
-	clear_shadow_from_swap_cache(si->type, begin, end);
+	__swap_cache_clear_shadow(swp_entry(si->type, start), nr_entries);
 
 	/*
 	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
@@ -1282,15 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
 	if (!entry.val)
 		return -ENOMEM;
 
-	/*
-	 * XArray node allocations from PF_MEMALLOC contexts could
-	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
-	 * stops emergency reserves from being allocated.
-	 *
-	 * TODO: this could cause a theoretical memory reclaim
-	 * deadlock in the swap out path.
-	 */
-	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
+	if (swap_cache_add_folio(entry, folio, NULL))
 		goto out_free;
 
 	return 0;
@@ -1557,6 +1597,7 @@ static void swap_entries_free(struct swap_info_struct *si,
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
+	cluster_table_check(ci, offset, nr_pages);
 
 	if (!ci->count)
 		free_cluster(si, ci);
@@ -1760,7 +1801,7 @@ bool folio_free_swap(struct folio *folio)
 	if (folio_swapped(folio))
 		return false;
 
-	delete_from_swap_cache(folio);
+	swap_cache_del_folio(folio);
 	folio_set_dirty(folio);
 	return true;
 }
@@ -2634,6 +2675,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	}
 }
 
+static void free_cluster_info(struct swap_cluster_info *cluster_info,
+			      unsigned long maxpages)
+{
+	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
+
+	if (!cluster_info)
+		return;
+	for (i = 0; i < nr_clusters; i++)
+		swap_cluster_free_table(&cluster_info[i]);
+	kvfree(cluster_info);
+}
+
 /*
  * Called after swap device's reference count is dead, so
  * neither scan nor allocation will use it.
@@ -2768,12 +2821,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
-	p->max = 0;
 	swap_map = p->swap_map;
 	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
 	cluster_info = p->cluster_info;
+	free_cluster_info(cluster_info, p->max);
+	p->max = 0;
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
@@ -2784,10 +2838,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
-	kvfree(cluster_info);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
-	exit_swap_address_space(p->type);
 
 	inode = mapping->host;
 
@@ -3171,8 +3223,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	if (!cluster_info)
 		goto err;
 
-	for (i = 0; i < nr_clusters; i++)
+	for (i = 0; i < nr_clusters; i++) {
 		spin_lock_init(&cluster_info[i].lock);
+		if (swap_table_alloc_table(&cluster_info[i]))
+			goto err_free;
+	}
 
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
@@ -3233,9 +3288,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	}
 
 	return cluster_info;
-
 err_free:
-	kvfree(cluster_info);
+	free_cluster_info(cluster_info, maxpages);
 err:
 	return ERR_PTR(err);
 }
@@ -3429,13 +3483,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		}
 	}
 
-	error = init_swap_address_space(si->type, maxpages);
-	if (error)
-		goto bad_swap_unlock_inode;
-
 	error = zswap_swapon(si->type, maxpages);
 	if (error)
-		goto free_swap_address_space;
+		goto bad_swap_unlock_inode;
 
 	/*
 	 * Flush any pending IO and dirty mappings before we start using this
@@ -3470,8 +3520,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	goto out;
 free_swap_zswap:
 	zswap_swapoff(si->type);
-free_swap_address_space:
-	exit_swap_address_space(si->type);
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
@@ -3486,7 +3534,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	spin_unlock(&swap_lock);
 	vfree(swap_map);
 	kvfree(zeromap);
-	kvfree(cluster_info);
+	if (cluster_info)
+		free_cluster_info(cluster_info, maxpages);
 	if (inced_nr_rotate_swap)
 		atomic_dec(&nr_rotate_swap);
 	if (swap_file)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index b0afd7f41a22..1ed3cf9dac4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 {
 	int refcount;
 	void *shadow = NULL;
+	struct swap_cluster_info *ci;
 
 	BUG_ON(!folio_test_locked(folio));
 	BUG_ON(mapping != folio_mapping(folio));
 
-	if (!folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		ci = swap_cluster_lock_by_folio_irq(folio);
+	} else {
 		spin_lock(&mapping->host->i_lock);
-	xa_lock_irq(&mapping->i_pages);
+		xa_lock_irq(&mapping->i_pages);
+	}
+
 	/*
 	 * The non racy check for a busy folio.
 	 *
@@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 
 		if (reclaimed && !mapping_exiting(mapping))
 			shadow = workingset_eviction(folio, target_memcg);
-		__delete_from_swap_cache(folio, swap, shadow);
+		__swap_cache_del_folio(swap, folio, shadow);
 		memcg1_swapout(folio, swap);
-		xa_unlock_irq(&mapping->i_pages);
+		swap_cluster_unlock_irq(ci);
 		put_swap_folio(folio, swap);
 	} else {
 		void (*free_folio)(struct folio *);
@@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
 	return 1;
 
 cannot_free:
-	xa_unlock_irq(&mapping->i_pages);
-	if (!folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
+		swap_cluster_unlock_irq(ci);
+	} else {
+		xa_unlock_irq(&mapping->i_pages);
 		spin_unlock(&mapping->host->i_lock);
+	}
 	return 0;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index ee443b317ac7..c869859eec77 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1166,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 
 out:
 	if (ret && ret != -EEXIST) {
-		delete_from_swap_cache(folio);
+		swap_cache_del_folio(folio);
 		folio_unlock(folio);
 	}
 	folio_put(folio);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (5 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-30  4:07   ` Chris Li
  2025-09-02 10:06   ` Barry Song
  2025-08-22 19:20 ` [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Kairui Song
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song, kernel test robot

From: Kairui Song <kasong@tencent.com>

Swap cluster setup will try to shuffle the clusters on initialization.
It was helpful to avoid contention for the swap cache space. The cluster
size (2M) was much smaller than each swap cache space (64M), so shuffling
the cluster means the allocator will try to allocate swap slots that are
in different swap cache spaces for each CPU, reducing the chance of two
CPUs using the same swap cache space, and hence reducing the contention.

Now, swap cache is managed by swap clusters, this shuffle is pointless.
Just remove it, and clean up related macros.

This should also improve the HDD swap performance as shuffling IO is a
bad idea for HDD, and now the shuffling is gone.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h     |  4 ----
 mm/swapfile.c | 32 ++++++++------------------------
 mm/zswap.c    |  7 +++++--
 3 files changed, 13 insertions(+), 30 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index 4af42bc2cd72..ce3ec62cc05e 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
 void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
-#define SWAP_ADDRESS_SPACE_SHIFT	14
-#define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
-#define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
 extern struct address_space swap_space __ro_after_init;
 static inline struct address_space *swap_address_space(swp_entry_t entry)
 {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index df68b5e242a6..0c8001c99f30 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *si,
 	return 0;
 }
 
-#define SWAP_CLUSTER_INFO_COLS						\
-	DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
-#define SWAP_CLUSTER_SPACE_COLS						\
-	DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
-#define SWAP_CLUSTER_COLS						\
-	max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
-
 static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 						union swap_header *swap_header,
 						unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 	struct swap_cluster_info *cluster_info;
-	unsigned long i, j, idx;
 	int err = -ENOMEM;
+	unsigned long i;
 
 	cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
 	if (!cluster_info)
@@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
 	}
 
-	/*
-	 * Reduce false cache line sharing between cluster_info and
-	 * sharing same address space.
-	 */
-	for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
-		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
-			struct swap_cluster_info *ci;
-			idx = i * SWAP_CLUSTER_COLS + j;
-			ci = cluster_info + idx;
-			if (idx >= nr_clusters)
-				continue;
-			if (ci->count) {
-				ci->flags = CLUSTER_FLAG_NONFULL;
-				list_add_tail(&ci->list, &si->nonfull_clusters[0]);
-				continue;
-			}
+	for (i = 0; i < nr_clusters; i++) {
+		struct swap_cluster_info *ci = &cluster_info[i];
+
+		if (ci->count) {
+			ci->flags = CLUSTER_FLAG_NONFULL;
+			list_add_tail(&ci->list, &si->nonfull_clusters[0]);
+		} else {
 			ci->flags = CLUSTER_FLAG_FREE;
 			list_add_tail(&ci->list, &si->free_clusters);
 		}
diff --git a/mm/zswap.c b/mm/zswap.c
index c869859eec77..c0a9be14a725 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -237,10 +237,13 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+/* One swap address space for each 64M swap space */
+#define ZSWAP_ADDRESS_SPACE_SHIFT 14
+#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
-		>> SWAP_ADDRESS_SPACE_SHIFT];
+		>> ZSWAP_ADDRESS_SPACE_SHIFT];
 }
 
 #define zswap_pool_debug(msg, p)				\
@@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
 	struct xarray *trees, *tree;
 	unsigned int nr, i;
 
-	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
+	nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
 	trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
 	if (!trees) {
 		pr_err("alloc failed, zswap disabled for swap type %d\n", type);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (6 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 7/9] mm, swap: remove contention workaround for swap cache Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-30  4:17   ` Chris Li
  2025-09-02 11:15   ` Barry Song
  2025-08-22 19:20 ` [PATCH 9/9] mm, swap: use a single page for swap table when the size fits Kairui Song
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Now swap table is cluster based, which means free clusters can free its
table since no one should modify it.

There could be speculative readers, like swap cache look up, protect
them by making them RCU safe. All swap table should be filled with null
entries before free, so such readers will either see a NULL pointer or
a null filled table being lazy freed.

On allocation, allocate the table when a cluster is used by any order.

This way, we can reduce the memory usage of large swap device
significantly.

This idea to dynamically release unused swap cluster data was initially
suggested by Chris Li while proposing the cluster swap allocator and
I found it suits the swap table idea very well.

Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap.h       |   2 +-
 mm/swap_state.c |   9 ++-
 mm/swap_table.h |  32 +++++++-
 mm/swapfile.c   | 202 ++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 197 insertions(+), 48 deletions(-)

diff --git a/mm/swap.h b/mm/swap.h
index ce3ec62cc05e..ee33733027f4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -36,7 +36,7 @@ struct swap_cluster_info {
 	u16 count;
 	u8 flags;
 	u8 order;
-	atomic_long_t *table;	/* Swap table entries, see mm/swap_table.h */
+	atomic_long_t __rcu *table;	/* Swap table entries, see mm/swap_table.h */
 	struct list_head list;
 };
 
diff --git a/mm/swap_state.c b/mm/swap_state.c
index c0342024b4a8..a0120d822fbe 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -87,7 +87,8 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	struct folio *folio;
 
 	for (;;) {
-		swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
+		swp_tb = swap_table_get(swp_cluster(entry),
+					swp_cluster_offset(entry));
 		if (!swp_tb_is_folio(swp_tb))
 			return NULL;
 		folio = swp_tb_to_folio(swp_tb);
@@ -107,10 +108,9 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	unsigned long swp_tb;
 
-	swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
+	swp_tb = swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
 	if (swp_tb_is_shadow(swp_tb))
 		return swp_tb_to_shadow(swp_tb);
-
 	return NULL;
 }
 
@@ -135,6 +135,9 @@ int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **shadowp)
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
 
 	ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
+	if (unlikely(!ci->table))
+		goto fail;
+
 	ci_start = swp_cluster_offset(entry);
 	ci_end = ci_start + nr_pages;
 	ci_off = ci_start;
diff --git a/mm/swap_table.h b/mm/swap_table.h
index ed9676547071..4e97513b11ef 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -2,8 +2,15 @@
 #ifndef _MM_SWAP_TABLE_H
 #define _MM_SWAP_TABLE_H
 
+#include <linux/rcupdate.h>
+#include <linux/atomic.h>
 #include "swap.h"
 
+/* A typical flat array in each cluster as swap table */
+struct swap_table {
+	atomic_long_t entries[SWAPFILE_CLUSTER];
+};
+
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
@@ -76,15 +83,36 @@ static inline void *swp_tb_to_shadow(unsigned long swp_tb)
 static inline void __swap_table_set(struct swap_cluster_info *ci,
 				    unsigned int off, unsigned long swp_tb)
 {
+	atomic_long_t *table = rcu_dereference_protected(ci->table, true);
+
+	lockdep_assert_held(&ci->lock);
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
-	atomic_long_set(&ci->table[off], swp_tb);
+	atomic_long_set(&table[off], swp_tb);
 }
 
 static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
 					     unsigned int off)
 {
+	atomic_long_t *table;
+
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
-	return atomic_long_read(&ci->table[off]);
+	table = rcu_dereference_check(ci->table, lockdep_is_held(&ci->lock));
+
+	return atomic_long_read(&table[off]);
+}
+
+static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
+					unsigned int off)
+{
+	atomic_long_t *table;
+	unsigned long swp_tb;
+
+	rcu_read_lock();
+	table = rcu_dereference(ci->table);
+	swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
+	rcu_read_unlock();
+
+	return swp_tb;
 }
 
 static inline void __swap_table_set_folio(struct swap_cluster_info *ci,
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0c8001c99f30..00651e947eb2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -105,6 +105,8 @@ static DEFINE_SPINLOCK(swap_avail_lock);
 
 struct swap_info_struct *swap_info[MAX_SWAPFILES];
 
+static struct kmem_cache *swap_table_cachep;
+
 static DEFINE_MUTEX(swapon_mutex);
 
 static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
@@ -402,10 +404,17 @@ static inline bool cluster_is_discard(struct swap_cluster_info *info)
 	return info->flags == CLUSTER_FLAG_DISCARD;
 }
 
+static inline bool cluster_table_is_alloced(struct swap_cluster_info *ci)
+{
+	return rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock));
+}
+
 static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
 {
 	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
 		return false;
+	if (!cluster_table_is_alloced(ci))
+		return false;
 	if (!order)
 		return true;
 	return cluster_is_empty(ci) || order == ci->order;
@@ -423,32 +432,98 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
-static int swap_table_alloc_table(struct swap_cluster_info *ci)
+static void swap_cluster_free_table(struct swap_cluster_info *ci)
 {
-	WARN_ON(ci->table);
-	ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
-	if (!ci->table)
-		return -ENOMEM;
-	return 0;
+	unsigned int ci_off;
+	struct swap_table *table;
+
+	/* Only empty cluster's table is allow to be freed  */
+	lockdep_assert_held(&ci->lock);
+	VM_WARN_ON_ONCE(!cluster_is_empty(ci));
+	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
+		VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
+	table = (void *)rcu_dereference_protected(ci->table, true);
+	rcu_assign_pointer(ci->table, NULL);
+
+	kmem_cache_free(swap_table_cachep, table);
 }
 
-static void swap_cluster_free_table(struct swap_cluster_info *ci)
+/*
+ * Allocate a swap table may need to sleep, which leads to migration,
+ * so attempt an atomic allocation first then fallback and handle
+ * potential race.
+ */
+static struct swap_cluster_info *
+swap_cluster_alloc_table(struct swap_info_struct *si,
+			 struct swap_cluster_info *ci,
+			 int order)
 {
-	unsigned int ci_off;
-	unsigned long swp_tb;
+	struct swap_cluster_info *pcp_ci;
+	struct swap_table *table;
+	unsigned long offset;
 
-	if (!ci->table)
-		return;
+	/*
+	 * Only cluster isolation from the allocator does table allocation.
+	 * Swap allocator uses a percpu cluster and holds the local lock.
+	 */
+	lockdep_assert_held(&ci->lock);
+	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
+
+	table = kmem_cache_zalloc(swap_table_cachep,
+				  __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	if (table) {
+		rcu_assign_pointer(ci->table, table);
+		return ci;
+	}
+
+	/*
+	 * Try a sleep allocation. Each isolated free cluster may cause
+	 * a sleep allocation, but there is a limited number of them, so
+	 * the potential recursive allocation should be limited.
+	 */
+	spin_unlock(&ci->lock);
+	if (!(si->flags & SWP_SOLIDSTATE))
+		spin_unlock(&si->global_cluster_lock);
+	local_unlock(&percpu_swap_cluster.lock);
+	table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
 
-	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
-		swp_tb = __swap_table_get(ci, ci_off);
-		if (!swp_tb_is_null(swp_tb))
-			pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
-				    swp_tb);
+	local_lock(&percpu_swap_cluster.lock);
+	if (!(si->flags & SWP_SOLIDSTATE))
+		spin_lock(&si->global_cluster_lock);
+	/*
+	 * Back to atomic context. First, check if we migrated to a new
+	 * CPU with a usable percpu cluster. If so, try using that instead.
+	 * No need to check it for the spinning device, as swap is
+	 * serialized by the global lock on them.
+	 *
+	 * The is_usable check is a bit rough, but ensures order 0 success.
+	 */
+	offset = this_cpu_read(percpu_swap_cluster.offset[order]);
+	if ((si->flags & SWP_SOLIDSTATE) && offset) {
+		pcp_ci = swap_cluster_lock(si, offset);
+		if (cluster_is_usable(pcp_ci, order) &&
+		    pcp_ci->count < SWAPFILE_CLUSTER) {
+			ci = pcp_ci;
+			goto free_table;
+		}
+		swap_cluster_unlock(pcp_ci);
 	}
 
-	kfree(ci->table);
-	ci->table = NULL;
+	if (!table)
+		return NULL;
+
+	spin_lock(&ci->lock);
+	/* Nothing should have touched the dangling empty cluster. */
+	if (WARN_ON_ONCE(cluster_table_is_alloced(ci)))
+		goto free_table;
+
+	rcu_assign_pointer(ci->table, table);
+	return ci;
+
+free_table:
+	if (table)
+		kmem_cache_free(swap_table_cachep, table);
+	return ci;
 }
 
 static void move_cluster(struct swap_info_struct *si,
@@ -480,7 +555,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	lockdep_assert_held(&ci->lock);
+	swap_cluster_free_table(ci);
 	move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
@@ -495,15 +570,11 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
  * this returns NULL for an non-empty list.
  */
 static struct swap_cluster_info *isolate_lock_cluster(
-		struct swap_info_struct *si, struct list_head *list)
+		struct swap_info_struct *si, struct list_head *list, int order)
 {
-	struct swap_cluster_info *ci, *ret = NULL;
+	struct swap_cluster_info *ci, *found = NULL;
 
 	spin_lock(&si->lock);
-
-	if (unlikely(!(si->flags & SWP_WRITEOK)))
-		goto out;
-
 	list_for_each_entry(ci, list, list) {
 		if (!spin_trylock(&ci->lock))
 			continue;
@@ -515,13 +586,19 @@ static struct swap_cluster_info *isolate_lock_cluster(
 
 		list_del(&ci->list);
 		ci->flags = CLUSTER_FLAG_NONE;
-		ret = ci;
+		found = ci;
 		break;
 	}
-out:
 	spin_unlock(&si->lock);
 
-	return ret;
+	if (found && !cluster_table_is_alloced(found)) {
+		/* Only an empty free cluster's swap table can be freed. */
+		VM_WARN_ON_ONCE(list != &si->free_clusters);
+		VM_WARN_ON_ONCE(!cluster_is_empty(found));
+		return swap_cluster_alloc_table(si, found, order);
+	}
+
+	return found;
 }
 
 /*
@@ -654,17 +731,27 @@ static void relocate_cluster(struct swap_info_struct *si,
  * added to free cluster list and its usage counter will be increased by 1.
  * Only used for initialization.
  */
-static void inc_cluster_info_page(struct swap_info_struct *si,
+static int inc_cluster_info_page(struct swap_info_struct *si,
 	struct swap_cluster_info *cluster_info, unsigned long page_nr)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	struct swap_table *table;
 	struct swap_cluster_info *ci;
 
 	ci = cluster_info + idx;
+	if (!ci->table) {
+		table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
+		if (!table)
+			return -ENOMEM;
+		rcu_assign_pointer(ci->table, table);
+	}
+
 	ci->count++;
 
 	VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
 	VM_BUG_ON(ci->flags);
+
+	return 0;
 }
 
 static bool cluster_reclaim_range(struct swap_info_struct *si,
@@ -845,7 +932,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
 	unsigned int found = SWAP_ENTRY_INVALID;
 
 	do {
-		struct swap_cluster_info *ci = isolate_lock_cluster(si, list);
+		struct swap_cluster_info *ci = isolate_lock_cluster(si, list, order);
 		unsigned long offset;
 
 		if (!ci)
@@ -870,7 +957,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	if (force)
 		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
 
-	while ((ci = isolate_lock_cluster(si, &si->full_clusters))) {
+	while ((ci = isolate_lock_cluster(si, &si->full_clusters, 0))) {
 		offset = cluster_offset(si, ci);
 		end = min(si->max, offset + SWAPFILE_CLUSTER);
 		to_scan--;
@@ -1018,6 +1105,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 done:
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
+
 	return found;
 }
 
@@ -1885,7 +1973,13 @@ swp_entry_t get_swap_page_of_type(int type)
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
+			/*
+			 * Grab the local lock to be complaint
+			 * with swap table allocation.
+			 */
+			local_lock(&percpu_swap_cluster.lock);
 			offset = cluster_alloc_swap_entry(si, 0, 1);
+			local_unlock(&percpu_swap_cluster.lock);
 			if (offset) {
 				entry = swp_entry(si->type, offset);
 				atomic_long_dec(&nr_swap_pages);
@@ -2678,12 +2772,21 @@ static void wait_for_allocation(struct swap_info_struct *si)
 static void free_cluster_info(struct swap_cluster_info *cluster_info,
 			      unsigned long maxpages)
 {
+	struct swap_cluster_info *ci;
 	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
 
 	if (!cluster_info)
 		return;
-	for (i = 0; i < nr_clusters; i++)
-		swap_cluster_free_table(&cluster_info[i]);
+	for (i = 0; i < nr_clusters; i++) {
+		ci = cluster_info + i;
+		/* Cluster with bad marks count will have a remaining table */
+		spin_lock(&ci->lock);
+		if (rcu_dereference_protected(ci->table, true)) {
+			ci->count = 0;
+			swap_cluster_free_table(ci);
+		}
+		spin_unlock(&ci->lock);
+	}
 	kvfree(cluster_info);
 }
 
@@ -2719,6 +2822,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	struct address_space *mapping;
 	struct inode *inode;
 	struct filename *pathname;
+	unsigned int maxpages;
 	int err, found = 0;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -2825,8 +2929,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->swap_map = NULL;
 	zeromap = p->zeromap;
 	p->zeromap = NULL;
+	maxpages = p->max;
 	cluster_info = p->cluster_info;
-	free_cluster_info(cluster_info, p->max);
 	p->max = 0;
 	p->cluster_info = NULL;
 	spin_unlock(&p->lock);
@@ -2838,6 +2942,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
+	free_cluster_info(cluster_info, maxpages);
 	/* Destroy swap account information */
 	swap_cgroup_swapoff(p->type);
 
@@ -3216,11 +3321,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	if (!cluster_info)
 		goto err;
 
-	for (i = 0; i < nr_clusters; i++) {
+	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
-		if (swap_table_alloc_table(&cluster_info[i]))
-			goto err_free;
-	}
 
 	if (!(si->flags & SWP_SOLIDSTATE)) {
 		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
@@ -3239,16 +3341,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * See setup_swap_map(): header page, bad pages,
 	 * and the EOF part of the last cluster.
 	 */
-	inc_cluster_info_page(si, cluster_info, 0);
+	err = inc_cluster_info_page(si, cluster_info, 0);
+	if (err)
+		goto err;
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 
 		if (page_nr >= maxpages)
 			continue;
-		inc_cluster_info_page(si, cluster_info, page_nr);
+		err = inc_cluster_info_page(si, cluster_info, page_nr);
+		if (err)
+			goto err;
+	}
+	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
+		err = inc_cluster_info_page(si, cluster_info, i);
+		if (err)
+			goto err;
 	}
-	for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
-		inc_cluster_info_page(si, cluster_info, i);
 
 	INIT_LIST_HEAD(&si->free_clusters);
 	INIT_LIST_HEAD(&si->full_clusters);
@@ -3962,6 +4071,15 @@ static int __init swapfile_init(void)
 
 	swapfile_maximum_size = arch_max_swapfile_size();
 
+	/*
+	 * Once a cluster is freed, it's swap table content is read
+	 * only, and all swap cache readers (swap_cache_*) verifies
+	 * the content before use. So it's safe to use RCU slab here.
+	 */
+	swap_table_cachep = kmem_cache_create("swap_table",
+			    sizeof(struct swap_table),
+			    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
+
 #ifdef CONFIG_MIGRATION
 	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
 		swap_migration_ad_supported = true;
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* [PATCH 9/9] mm, swap: use a single page for swap table when the size fits
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (7 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Kairui Song
@ 2025-08-22 19:20 ` Kairui Song
  2025-08-30  4:23   ` Chris Li
  2025-08-26 22:00 ` [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Chris Li
  2025-08-30  5:44 ` Chris Li
  10 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-08-22 19:20 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap
table so the swap table size of each cluster is exactly one page (4K).

If that condition is true, allocate one page direct and disable the slab
cache to reduce the memory usage of swap table and avoid fragmentation.

Co-developed-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_table.h |  2 ++
 mm/swapfile.c   | 50 ++++++++++++++++++++++++++++++++++++++++---------
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/swap_table.h b/mm/swap_table.h
index 4e97513b11ef..984474e37dd7 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -11,6 +11,8 @@ struct swap_table {
 	atomic_long_t entries[SWAPFILE_CLUSTER];
 };
 
+#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
+
 /*
  * A swap table entry represents the status of a swap slot on a swap
  * (physical or virtual) device. The swap table in each cluster is a
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 00651e947eb2..7539ee26d59a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -432,6 +432,38 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
 	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
 }
 
+static struct swap_table *swap_table_alloc(gfp_t gfp)
+{
+	struct folio *folio;
+
+	if (!SWP_TABLE_USE_PAGE)
+		return kmem_cache_zalloc(swap_table_cachep, gfp);
+
+	folio = folio_alloc(gfp | __GFP_ZERO, 0);
+	if (folio)
+		return folio_address(folio);
+	return NULL;
+}
+
+static void swap_table_free_folio_rcu_cb(struct rcu_head *head)
+{
+	struct folio *folio;
+
+	folio = page_folio(container_of(head, struct page, rcu_head));
+	folio_put(folio);
+}
+
+static void swap_table_free(struct swap_table *table)
+{
+	if (!SWP_TABLE_USE_PAGE) {
+		kmem_cache_free(swap_table_cachep, table);
+		return;
+	}
+
+	call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head),
+		 swap_table_free_folio_rcu_cb);
+}
+
 static void swap_cluster_free_table(struct swap_cluster_info *ci)
 {
 	unsigned int ci_off;
@@ -445,7 +477,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
 	table = (void *)rcu_dereference_protected(ci->table, true);
 	rcu_assign_pointer(ci->table, NULL);
 
-	kmem_cache_free(swap_table_cachep, table);
+	swap_table_free(table);
 }
 
 /*
@@ -469,8 +501,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	lockdep_assert_held(&ci->lock);
 	lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
 
-	table = kmem_cache_zalloc(swap_table_cachep,
-				  __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+	table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
 	if (table) {
 		rcu_assign_pointer(ci->table, table);
 		return ci;
@@ -485,7 +516,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 	if (!(si->flags & SWP_SOLIDSTATE))
 		spin_unlock(&si->global_cluster_lock);
 	local_unlock(&percpu_swap_cluster.lock);
-	table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
+	table = swap_table_alloc(__GFP_HIGH | GFP_KERNEL);
 
 	local_lock(&percpu_swap_cluster.lock);
 	if (!(si->flags & SWP_SOLIDSTATE))
@@ -522,7 +553,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
 
 free_table:
 	if (table)
-		kmem_cache_free(swap_table_cachep, table);
+		swap_table_free(table);
 	return ci;
 }
 
@@ -740,7 +771,7 @@ static int inc_cluster_info_page(struct swap_info_struct *si,
 
 	ci = cluster_info + idx;
 	if (!ci->table) {
-		table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
+		table = swap_table_alloc(GFP_KERNEL);
 		if (!table)
 			return -ENOMEM;
 		rcu_assign_pointer(ci->table, table);
@@ -4076,9 +4107,10 @@ static int __init swapfile_init(void)
 	 * only, and all swap cache readers (swap_cache_*) verifies
 	 * the content before use. So it's safe to use RCU slab here.
 	 */
-	swap_table_cachep = kmem_cache_create("swap_table",
-			    sizeof(struct swap_table),
-			    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
+	if (!SWP_TABLE_USE_PAGE)
+		swap_table_cachep = kmem_cache_create("swap_table",
+				    sizeof(struct swap_table),
+				    0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
 
 #ifdef CONFIG_MIGRATION
 	if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-08-22 19:20 ` [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
@ 2025-08-25  3:02   ` Baolin Wang
  2025-08-25  9:45     ` Kairui Song
  2025-09-03  8:25   ` David Hildenbrand
  1 sibling, 1 reply; 90+ messages in thread
From: Baolin Wang @ 2025-08-25  3:02 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang, Johannes Weiner,
	David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel



On 2025/8/23 03:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Shmem may replace a folio in the swap cache if the cached one doesn't
> fit the swapin's GFP zone. When doing so, shmem has already double
> checked that the swap cache folio is locked, still has the swap cache
> flag set, and contains the wanted swap entry. So it is impossible to
> fail due to an Xarray mismatch. There is even a comment for that.
> 
> Delete the defensive error handling path, and add a WARN_ON instead:
> if that happened, something has broken the basic principle of how the
> swap cache works, we should catch and fix that.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 28 +++-------------------------
>   1 file changed, 3 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b4d39f2a1e0a..e03793cc5169 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2158,35 +2158,13 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>   	/* Swap cache still stores N entries instead of a high-order entry */
>   	xa_lock_irq(&swap_mapping->i_pages);
>   	for (i = 0; i < nr_pages; i++) {
> -		void *item = xas_load(&xas);
> -
> -		if (item != old) {
> -			error = -ENOENT;
> -			break;
> -		}
> -
> -		xas_store(&xas, new);
> +		WARN_ON_ONCE(xas_store(&xas, new));
>   		xas_next(&xas);
>   	}
> -	if (!error) {
> -		mem_cgroup_replace_folio(old, new);
> -		shmem_update_stats(new, nr_pages);
> -		shmem_update_stats(old, -nr_pages);
> -	}

It looks like the shmem statistics update was mistakenly deleted?

( Continue to understand the whole series:) )

>   	xa_unlock_irq(&swap_mapping->i_pages);
>   
> -	if (unlikely(error)) {
> -		/*
> -		 * Is this possible?  I think not, now that our callers
> -		 * check both the swapcache flag and folio->private
> -		 * after getting the folio lock; but be defensive.
> -		 * Reverse old to newpage for clear and free.
> -		 */
> -		old = new;
> -	} else {
> -		folio_add_lru(new);
> -		*foliop = new;
> -	}
> +	folio_add_lru(new);
> +	*foliop = new;
>   
>   	folio_clear_swapcache(old);
>   	old->private = NULL;


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-08-25  3:02   ` Baolin Wang
@ 2025-08-25  9:45     ` Kairui Song
  2025-08-30  2:41       ` Chris Li
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-08-25  9:45 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Aug 25, 2025 at 11:09 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/8/23 03:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Shmem may replace a folio in the swap cache if the cached one doesn't
> > fit the swapin's GFP zone. When doing so, shmem has already double
> > checked that the swap cache folio is locked, still has the swap cache
> > flag set, and contains the wanted swap entry. So it is impossible to
> > fail due to an Xarray mismatch. There is even a comment for that.
> >
> > Delete the defensive error handling path, and add a WARN_ON instead:
> > if that happened, something has broken the basic principle of how the
> > swap cache works, we should catch and fix that.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/shmem.c | 28 +++-------------------------
> >   1 file changed, 3 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index b4d39f2a1e0a..e03793cc5169 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2158,35 +2158,13 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
> >       /* Swap cache still stores N entries instead of a high-order entry */
> >       xa_lock_irq(&swap_mapping->i_pages);
> >       for (i = 0; i < nr_pages; i++) {
> > -             void *item = xas_load(&xas);
> > -
> > -             if (item != old) {
> > -                     error = -ENOENT;
> > -                     break;
> > -             }
> > -
> > -             xas_store(&xas, new);
> > +             WARN_ON_ONCE(xas_store(&xas, new));
> >               xas_next(&xas);
> >       }
> > -     if (!error) {
> > -             mem_cgroup_replace_folio(old, new);
> > -             shmem_update_stats(new, nr_pages);
> > -             shmem_update_stats(old, -nr_pages);
> > -     }
>
> It looks like the shmem statistics update was mistakenly deleted?

Ah, you are right, I'll need to add it back. I somehow misread this as
the error handling path. Need to add it back just drop the !error
check.

Thanks for the review.

>
> ( Continue to understand the whole series:) )
>
> >       xa_unlock_irq(&swap_mapping->i_pages);
> >
> > -     if (unlikely(error)) {
> > -             /*
> > -              * Is this possible?  I think not, now that our callers
> > -              * check both the swapcache flag and folio->private
> > -              * after getting the folio lock; but be defensive.
> > -              * Reverse old to newpage for clear and free.
> > -              */
> > -             old = new;
> > -     } else {
> > -             folio_add_lru(new);
> > -             *foliop = new;
> > -     }
> > +     folio_add_lru(new);
> > +     *foliop = new;
> >
> >       folio_clear_swapcache(old);
> >       old->private = NULL;
>
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I)
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (8 preceding siblings ...)
  2025-08-22 19:20 ` [PATCH 9/9] mm, swap: use a single page for swap table when the size fits Kairui Song
@ 2025-08-26 22:00 ` Chris Li
  2025-08-30  5:44 ` Chris Li
  10 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-26 22:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Fri, Aug 22, 2025 at 12:20 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This is the first phase of the bigger series implementing basic
> infrastructures for the Swap Table idea proposed at the LSF/MM/BPF
> topic "Integrate swap cache, swap maps with swap allocator" [1].
>
> This phase I contains 9 patches, introduces the swap table infrastructure
> and uses it as the swap cache backend. By doing so, we have up to ~5-20%
> performance gain in throughput, RPS or build time for benchmark and
> workload tests. This is based on Chris Li's idea of using cluster size
> atomic arrays to implement swap cache. It has less contention on the swap
> cache access. The cluster size is much finer-grained than the 64M address
> space split, which is removed in this phase I. It also unifies and cleans
> up the swap code base.

Thanks for making this happen. It has gone a long way from my early
messy experimental patches on replacing xarray in swap caches. Beating
the original swap_map in terms of memory usage is particularly hard. I
once received this feedback from Matthew that whoever wants to replace
the swap cache is asking for a lot of pain and suffering. He is
absolutely right.

I am so glad that we are finally seeing the light of the other end of
the tunnel.  We are close to a state that is able to beat the original
swap layer both in terms of memory usage and CPU performance.

Just to recap. The current swap layer per slot memory usage is 3 + 8
bytes. 3 up front static, 1 from swap map, 2 from swap cgroup. The 8
byte dynamic allocations are from the xarray of swap cache.
At the end of this full series (27+ patches) we can completely get rid
of the 3 up front allocation. Only dynamic allocate 8 byte per slot
entry. That is a straight win in terms of memory allocation, no
compromise was made there.
The reason we can beat the previous CPU usage is that each cluster has
512 entries. Much smaller than the 64M xarray tree. The cluster lock
is a much smaller lock than the xarray tree lock. We can do lockless
atomic lookup on the swap cache that is pretty cool as well.

I will do one more review pass on this series again soon.

Very exciting.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
@ 2025-08-27  2:47   ` Chris Li
  2025-08-27  3:50     ` Chris Li
  2025-08-27 13:45     ` Kairui Song
  2025-08-27  3:52   ` Baoquan He
                     ` (4 subsequent siblings)
  5 siblings, 2 replies; 90+ messages in thread
From: Chris Li @ 2025-08-27  2:47 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

Hi Kairui,

This commit message can use some improvement, I feel the part I am
interested in, what changed is buried in a lot of detail.

The background is that swap_cache_get_folio() used to do readahead
update as well. It has VMA as part of the argument. However, the
hibernation usage does not map swap entry to VMA. It was forced to
call filemap_get_entry() on swap cache instead, due to no VMA.

So the TL; DR; of what this patch does:

Split the swap readahead outside of swap_cache_get_folio(), so that
the hibernation non VMA usage can reuse swap_cache_get_folio()  as
well. No more  calling filemap_get_entry() on swap cache due to lack
of VMA.

The code itself looks fine. It has gone through some rounds of
feedback from me already. We can always update the commit message on
the next iteration.

Acked-by: Chris Li <chrisl@kernel.org>

Chris


On Fri, Aug 22, 2025 at 12:20 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Always use swap_cache_get_folio for swap cache folio look up. The reason
> we are not using it in all places is that it also updates the readahead
> info, and some callsites want to avoid that.
>
> So decouple readahead update with swap cache lookup into a standalone
> helper, let the caller call the readahead update helper if that's
> needed. And convert all swap cache lookups to use swap_cache_get_folio.
>
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration and shmem replacing,
> because they need to lock the Xarray. Following commits will wrap their
I commonly saw using xarray or XArray.
> accesses to the swap cache too with special helpers.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c      |  6 ++-
>  mm/mincore.c     |  3 +-
>  mm/shmem.c       |  4 +-
>  mm/swap.h        | 13 +++++--
>  mm/swap_state.c  | 99 +++++++++++++++++++++++-------------------------
>  mm/swapfile.c    | 11 +++---
>  mm/userfaultfd.c |  5 +--
>  7 files changed, 72 insertions(+), 69 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index d9de6c056179..10ef528a5f44 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         if (unlikely(!si))
>                 goto out;
>
> -       folio = swap_cache_get_folio(entry, vma, vmf->address);
> -       if (folio)
> +       folio = swap_cache_get_folio(entry);
> +       if (folio) {
> +               swap_update_readahead(folio, vma, vmf->address);
>                 page = folio_file_page(folio, swp_offset(entry));
> +       }
>         swapcache = folio;
>
>         if (!folio) {
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 2f3e1816a30d..8ec4719370e1 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
>                 if (!si)
>                         return 0;
>         }
> -       folio = filemap_get_entry(swap_address_space(entry),
> -                                 swap_cache_index(entry));
> +       folio = swap_cache_get_folio(entry);
>         if (shmem)
>                 put_swap_device(si);
>         /* The swap cache space contains either folio, shadow or NULL */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 13cc51df3893..e9d0d2784cd5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>         }
>
>         /* Look it up and read it in.. */
> -       folio = swap_cache_get_folio(swap, NULL, 0);
> +       folio = swap_cache_get_folio(swap);
>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>                         /* Direct swapin skipping swap cache & readahead */
> @@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                         count_vm_event(PGMAJFAULT);
>                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
>                 }
> +       } else {
> +               swap_update_readahead(folio, NULL, 0);
>         }
>
>         if (order > folio_order(folio)) {
> diff --git a/mm/swap.h b/mm/swap.h
> index 1ae44d4193b1..efb6d7ff9f30 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
>  void clear_shadow_from_swap_cache(int type, unsigned long begin,
>                                   unsigned long end);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr);
> +struct folio *swap_cache_get_folio(swp_entry_t entry);
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 struct vm_area_struct *vma, unsigned long addr,
>                 struct swap_iocb **plug);
> @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>                 struct mempolicy *mpol, pgoff_t ilx);
>  struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
>                 struct vm_fault *vmf);
> +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> +                          unsigned long addr);
>
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>         return NULL;
>  }
>
> +static inline void swap_update_readahead(struct folio *folio,
> +               struct vm_area_struct *vma, unsigned long addr)
> +{
> +}
> +
>  static inline int swap_writeout(struct folio *folio,
>                 struct swap_iocb **swap_plug)
>  {
> @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
>  {
>  }
>
> -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr)
> +static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
>         return NULL;
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 99513b74b5d8..ff9eb761a103 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
>         printk("Total swap = %lukB\n", K(total_swap_pages));
>  }
>
> +/*
> + * Lookup a swap entry in the swap cache. A found folio will be returned
> + * unlocked and with its refcount incremented.
> + *
> + * Caller must lock the swap device or hold a reference to keep it valid.
> + */
> +struct folio *swap_cache_get_folio(swp_entry_t entry)
> +{
> +       struct folio *folio = filemap_get_folio(swap_address_space(entry),
> +                                               swap_cache_index(entry));
> +       if (!IS_ERR(folio))
> +               return folio;
> +       return NULL;
> +}
> +
>  void *get_shadow_from_swap_cache(swp_entry_t entry)
>  {
>         struct address_space *address_space = swap_address_space(entry);
> @@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void)
>  }
>
>  /*
> - * Lookup a swap entry in the swap cache. A found folio will be returned
> - * unlocked and with its refcount incremented - we rely on the kernel
> - * lock getting page table operations atomic even if we drop the folio
> - * lock before returning.
> - *
> - * Caller must lock the swap device or hold a reference to keep it valid.
> + * Update the readahead statistics of a vma or globally.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -               struct vm_area_struct *vma, unsigned long addr)
> +void swap_update_readahead(struct folio *folio,
> +                          struct vm_area_struct *vma,
> +                          unsigned long addr)
>  {
> -       struct folio *folio;
> -
> -       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> -       if (!IS_ERR(folio)) {
> -               bool vma_ra = swap_use_vma_readahead();
> -               bool readahead;
> +       bool readahead, vma_ra = swap_use_vma_readahead();
>
> -               /*
> -                * At the moment, we don't support PG_readahead for anon THP
> -                * so let's bail out rather than confusing the readahead stat.
> -                */
> -               if (unlikely(folio_test_large(folio)))
> -                       return folio;
> -
> -               readahead = folio_test_clear_readahead(folio);
> -               if (vma && vma_ra) {
> -                       unsigned long ra_val;
> -                       int win, hits;
> -
> -                       ra_val = GET_SWAP_RA_VAL(vma);
> -                       win = SWAP_RA_WIN(ra_val);
> -                       hits = SWAP_RA_HITS(ra_val);
> -                       if (readahead)
> -                               hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> -                       atomic_long_set(&vma->swap_readahead_info,
> -                                       SWAP_RA_VAL(addr, win, hits));
> -               }
> -
> -               if (readahead) {
> -                       count_vm_event(SWAP_RA_HIT);
> -                       if (!vma || !vma_ra)
> -                               atomic_inc(&swapin_readahead_hits);
> -               }
> -       } else {
> -               folio = NULL;
> +       /*
> +        * At the moment, we don't support PG_readahead for anon THP
> +        * so let's bail out rather than confusing the readahead stat.
> +        */
> +       if (unlikely(folio_test_large(folio)))
> +               return;
> +
> +       readahead = folio_test_clear_readahead(folio);
> +       if (vma && vma_ra) {
> +               unsigned long ra_val;
> +               int win, hits;
> +
> +               ra_val = GET_SWAP_RA_VAL(vma);
> +               win = SWAP_RA_WIN(ra_val);
> +               hits = SWAP_RA_HITS(ra_val);
> +               if (readahead)
> +                       hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> +               atomic_long_set(&vma->swap_readahead_info,
> +                               SWAP_RA_VAL(addr, win, hits));
>         }
>
> -       return folio;
> +       if (readahead) {
> +               count_vm_event(SWAP_RA_HIT);
> +               if (!vma || !vma_ra)
> +                       atomic_inc(&swapin_readahead_hits);
> +       }
>  }
>
>  struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> @@ -336,14 +337,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         *new_page_allocated = false;
>         for (;;) {
>                 int err;
> -               /*
> -                * First check the swap cache.  Since this is normally
> -                * called after swap_cache_get_folio() failed, re-calling
> -                * that would confuse statistics.
> -                */
> -               folio = filemap_get_folio(swap_address_space(entry),
> -                                         swap_cache_index(entry));
> -               if (!IS_ERR(folio))
> +
> +               /* Check the swap cache in case the folio is already there */
> +               folio = swap_cache_get_folio(entry);
> +               if (folio)
>                         goto got_folio;
>
>                 /*
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index a7ffabbe65ef..4b8ab2cb49ca 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>                                  unsigned long offset, unsigned long flags)
>  {
>         swp_entry_t entry = swp_entry(si->type, offset);
> -       struct address_space *address_space = swap_address_space(entry);
>         struct swap_cluster_info *ci;
>         struct folio *folio;
>         int ret, nr_pages;
>         bool need_reclaim;
>
>  again:
> -       folio = filemap_get_folio(address_space, swap_cache_index(entry));
> -       if (IS_ERR(folio))
> +       folio = swap_cache_get_folio(entry);
> +       if (!folio)
>                 return 0;
>
>         nr_pages = folio_nr_pages(folio);
> @@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                 pte_unmap(pte);
>                 pte = NULL;
>
> -               folio = swap_cache_get_folio(entry, vma, addr);
> +               folio = swap_cache_get_folio(entry);
>                 if (!folio) {
>                         struct vm_fault vmf = {
>                                 .vma = vma,
> @@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type)
>                (i = find_next_to_unuse(si, i)) != 0) {
>
>                 entry = swp_entry(type, i);
> -               folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> -               if (IS_ERR(folio))
> +               folio = swap_cache_get_folio(entry);
> +               if (!folio)
>                         continue;
>
>                 /*
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 50aaa8dcd24c..af61b95c89e4 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
>                  * separately to allow proper handling.
>                  */
>                 if (!src_folio)
> -                       folio = filemap_get_folio(swap_address_space(entry),
> -                                       swap_cache_index(entry));
> -               if (!IS_ERR_OR_NULL(folio)) {
> +                       folio = swap_cache_get_folio(entry);
> +               if (folio) {
>                         if (folio_test_large(folio)) {
>                                 ret = -EBUSY;
>                                 folio_put(folio);
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
@ 2025-08-27  3:47   ` Baoquan He
  2025-08-27 17:44     ` Chris Li
  2025-09-02  6:02   ` Barry Song
  2025-09-02 13:33   ` David Hildenbrand
  2 siblings, 1 reply; 90+ messages in thread
From: Baoquan He @ 2025-08-27  3:47 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 08/23/25 at 03:20am, Kairui Song wrote:
......
> diff --git a/mm/swap.h b/mm/swap.h
> index 223b40f2d37e..7b3efaa51624 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -15,6 +15,8 @@ extern int page_cluster;
>  #define swap_entry_order(order)	0
>  #endif
>  
> +extern struct swap_info_struct *swap_info[];
> +
>  /*
>   * We use this to track usage of a cluster. A cluster is a block of swap disk
>   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> @@ -53,9 +55,28 @@ enum swap_cluster_flags {
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>  
> +/*
> + * Callers of all swp_* helpers here must ensure the entry is valid, and
> + * pin the swap device by reference or in other ways.
> + */
> +static inline struct swap_info_struct *swp_type_info(int type)
> +{
> +	struct swap_info_struct *si;
> +
> +	si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
> +	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> +	return si;
> +}
> +
> +static inline struct swap_info_struct *swp_info(swp_entry_t entry)
> +{
> +	return swp_type_info(swp_type(entry));
> +}

swp_type_info() is only used by swp_info() in the whole series, can we
open code it in swp_info()?

If you plan to use it in later phase of swap table patchset, then please
ignore this.

> +
>  static inline struct swap_cluster_info *swp_offset_cluster(
>  		struct swap_info_struct *si, pgoff_t offset)
>  {
> +	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
>  	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  }
>  
> @@ -65,6 +86,7 @@ static inline struct swap_cluster_info *swap_cluster_lock(
>  {
>  	struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
>  
> +	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
>  	spin_lock(&ci->lock);
>  	return ci;
>  }
> @@ -164,7 +186,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>  
>  static inline unsigned int folio_swap_flags(struct folio *folio)
>  {
> -	return swp_swap_info(folio->swap)->flags;
> +	return swp_info(folio->swap)->flags;
>  }
>  
>  /*
> @@ -175,7 +197,7 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
>  static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>  		bool *is_zeromap)
>  {
> -	struct swap_info_struct *sis = swp_swap_info(entry);
> +	struct swap_info_struct *sis = swp_info(entry);
>  	unsigned long start = swp_offset(entry);
>  	unsigned long end = start + max_nr;
>  	bool first_bit;
> @@ -194,7 +216,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>  
>  static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  {
> -	struct swap_info_struct *si = swp_swap_info(entry);
> +	struct swap_info_struct *si = swp_info(entry);
>  	pgoff_t offset = swp_offset(entry);
>  	int i;
>  
> @@ -213,6 +235,11 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +static inline struct swap_info_struct *swp_info(swp_entry_t entry)
> +{
> +	return NULL;
> +}
> +
>  static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
>  {
>  }
......


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-27  2:47   ` Chris Li
@ 2025-08-27  3:50     ` Chris Li
  2025-08-27 13:45     ` Kairui Song
  1 sibling, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-27  3:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

BTW, I think this patch can add no functional change as expected.

Chris

On Tue, Aug 26, 2025 at 7:47 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> This commit message can use some improvement, I feel the part I am
> interested in, what changed is buried in a lot of detail.
>
> The background is that swap_cache_get_folio() used to do readahead
> update as well. It has VMA as part of the argument. However, the
> hibernation usage does not map swap entry to VMA. It was forced to
> call filemap_get_entry() on swap cache instead, due to no VMA.
>
> So the TL; DR; of what this patch does:
>
> Split the swap readahead outside of swap_cache_get_folio(), so that
> the hibernation non VMA usage can reuse swap_cache_get_folio()  as
> well. No more  calling filemap_get_entry() on swap cache due to lack
> of VMA.
>
> The code itself looks fine. It has gone through some rounds of
> feedback from me already. We can always update the commit message on
> the next iteration.
>
> Acked-by: Chris Li <chrisl@kernel.org>
>
> Chris
>
>
> On Fri, Aug 22, 2025 at 12:20 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Always use swap_cache_get_folio for swap cache folio look up. The reason
> > we are not using it in all places is that it also updates the readahead
> > info, and some callsites want to avoid that.
> >
> > So decouple readahead update with swap cache lookup into a standalone
> > helper, let the caller call the readahead update helper if that's
> > needed. And convert all swap cache lookups to use swap_cache_get_folio.
> >
> > After this commit, there are only three special cases for accessing swap
> > cache space now: huge memory splitting, migration and shmem replacing,
> > because they need to lock the Xarray. Following commits will wrap their
> I commonly saw using xarray or XArray.
> > accesses to the swap cache too with special helpers.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c      |  6 ++-
> >  mm/mincore.c     |  3 +-
> >  mm/shmem.c       |  4 +-
> >  mm/swap.h        | 13 +++++--
> >  mm/swap_state.c  | 99 +++++++++++++++++++++++-------------------------
> >  mm/swapfile.c    | 11 +++---
> >  mm/userfaultfd.c |  5 +--
> >  7 files changed, 72 insertions(+), 69 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d9de6c056179..10ef528a5f44 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >         if (unlikely(!si))
> >                 goto out;
> >
> > -       folio = swap_cache_get_folio(entry, vma, vmf->address);
> > -       if (folio)
> > +       folio = swap_cache_get_folio(entry);
> > +       if (folio) {
> > +               swap_update_readahead(folio, vma, vmf->address);
> >                 page = folio_file_page(folio, swp_offset(entry));
> > +       }
> >         swapcache = folio;
> >
> >         if (!folio) {
> > diff --git a/mm/mincore.c b/mm/mincore.c
> > index 2f3e1816a30d..8ec4719370e1 100644
> > --- a/mm/mincore.c
> > +++ b/mm/mincore.c
> > @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
> >                 if (!si)
> >                         return 0;
> >         }
> > -       folio = filemap_get_entry(swap_address_space(entry),
> > -                                 swap_cache_index(entry));
> > +       folio = swap_cache_get_folio(entry);
> >         if (shmem)
> >                 put_swap_device(si);
> >         /* The swap cache space contains either folio, shadow or NULL */
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 13cc51df3893..e9d0d2784cd5 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >         }
> >
> >         /* Look it up and read it in.. */
> > -       folio = swap_cache_get_folio(swap, NULL, 0);
> > +       folio = swap_cache_get_folio(swap);
> >         if (!folio) {
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> >                         /* Direct swapin skipping swap cache & readahead */
> > @@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                         count_vm_event(PGMAJFAULT);
> >                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >                 }
> > +       } else {
> > +               swap_update_readahead(folio, NULL, 0);
> >         }
> >
> >         if (order > folio_order(folio)) {
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 1ae44d4193b1..efb6d7ff9f30 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
> >  void clear_shadow_from_swap_cache(int type, unsigned long begin,
> >                                   unsigned long end);
> >  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > -struct folio *swap_cache_get_folio(swp_entry_t entry,
> > -               struct vm_area_struct *vma, unsigned long addr);
> > +struct folio *swap_cache_get_folio(swp_entry_t entry);
> >  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >                 struct vm_area_struct *vma, unsigned long addr,
> >                 struct swap_iocb **plug);
> > @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >                 struct mempolicy *mpol, pgoff_t ilx);
> >  struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
> >                 struct vm_fault *vmf);
> > +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> > +                          unsigned long addr);
> >
> >  static inline unsigned int folio_swap_flags(struct folio *folio)
> >  {
> > @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> >         return NULL;
> >  }
> >
> > +static inline void swap_update_readahead(struct folio *folio,
> > +               struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +}
> > +
> >  static inline int swap_writeout(struct folio *folio,
> >                 struct swap_iocb **swap_plug)
> >  {
> > @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
> >  {
> >  }
> >
> > -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> > -               struct vm_area_struct *vma, unsigned long addr)
> > +static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
> >  {
> >         return NULL;
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 99513b74b5d8..ff9eb761a103 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
> >         printk("Total swap = %lukB\n", K(total_swap_pages));
> >  }
> >
> > +/*
> > + * Lookup a swap entry in the swap cache. A found folio will be returned
> > + * unlocked and with its refcount incremented.
> > + *
> > + * Caller must lock the swap device or hold a reference to keep it valid.
> > + */
> > +struct folio *swap_cache_get_folio(swp_entry_t entry)
> > +{
> > +       struct folio *folio = filemap_get_folio(swap_address_space(entry),
> > +                                               swap_cache_index(entry));
> > +       if (!IS_ERR(folio))
> > +               return folio;
> > +       return NULL;
> > +}
> > +
> >  void *get_shadow_from_swap_cache(swp_entry_t entry)
> >  {
> >         struct address_space *address_space = swap_address_space(entry);
> > @@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void)
> >  }
> >
> >  /*
> > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > - * unlocked and with its refcount incremented - we rely on the kernel
> > - * lock getting page table operations atomic even if we drop the folio
> > - * lock before returning.
> > - *
> > - * Caller must lock the swap device or hold a reference to keep it valid.
> > + * Update the readahead statistics of a vma or globally.
> >   */
> > -struct folio *swap_cache_get_folio(swp_entry_t entry,
> > -               struct vm_area_struct *vma, unsigned long addr)
> > +void swap_update_readahead(struct folio *folio,
> > +                          struct vm_area_struct *vma,
> > +                          unsigned long addr)
> >  {
> > -       struct folio *folio;
> > -
> > -       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > -       if (!IS_ERR(folio)) {
> > -               bool vma_ra = swap_use_vma_readahead();
> > -               bool readahead;
> > +       bool readahead, vma_ra = swap_use_vma_readahead();
> >
> > -               /*
> > -                * At the moment, we don't support PG_readahead for anon THP
> > -                * so let's bail out rather than confusing the readahead stat.
> > -                */
> > -               if (unlikely(folio_test_large(folio)))
> > -                       return folio;
> > -
> > -               readahead = folio_test_clear_readahead(folio);
> > -               if (vma && vma_ra) {
> > -                       unsigned long ra_val;
> > -                       int win, hits;
> > -
> > -                       ra_val = GET_SWAP_RA_VAL(vma);
> > -                       win = SWAP_RA_WIN(ra_val);
> > -                       hits = SWAP_RA_HITS(ra_val);
> > -                       if (readahead)
> > -                               hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> > -                       atomic_long_set(&vma->swap_readahead_info,
> > -                                       SWAP_RA_VAL(addr, win, hits));
> > -               }
> > -
> > -               if (readahead) {
> > -                       count_vm_event(SWAP_RA_HIT);
> > -                       if (!vma || !vma_ra)
> > -                               atomic_inc(&swapin_readahead_hits);
> > -               }
> > -       } else {
> > -               folio = NULL;
> > +       /*
> > +        * At the moment, we don't support PG_readahead for anon THP
> > +        * so let's bail out rather than confusing the readahead stat.
> > +        */
> > +       if (unlikely(folio_test_large(folio)))
> > +               return;
> > +
> > +       readahead = folio_test_clear_readahead(folio);
> > +       if (vma && vma_ra) {
> > +               unsigned long ra_val;
> > +               int win, hits;
> > +
> > +               ra_val = GET_SWAP_RA_VAL(vma);
> > +               win = SWAP_RA_WIN(ra_val);
> > +               hits = SWAP_RA_HITS(ra_val);
> > +               if (readahead)
> > +                       hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> > +               atomic_long_set(&vma->swap_readahead_info,
> > +                               SWAP_RA_VAL(addr, win, hits));
> >         }
> >
> > -       return folio;
> > +       if (readahead) {
> > +               count_vm_event(SWAP_RA_HIT);
> > +               if (!vma || !vma_ra)
> > +                       atomic_inc(&swapin_readahead_hits);
> > +       }
> >  }
> >
> >  struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > @@ -336,14 +337,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >         *new_page_allocated = false;
> >         for (;;) {
> >                 int err;
> > -               /*
> > -                * First check the swap cache.  Since this is normally
> > -                * called after swap_cache_get_folio() failed, re-calling
> > -                * that would confuse statistics.
> > -                */
> > -               folio = filemap_get_folio(swap_address_space(entry),
> > -                                         swap_cache_index(entry));
> > -               if (!IS_ERR(folio))
> > +
> > +               /* Check the swap cache in case the folio is already there */
> > +               folio = swap_cache_get_folio(entry);
> > +               if (folio)
> >                         goto got_folio;
> >
> >                 /*
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index a7ffabbe65ef..4b8ab2cb49ca 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -213,15 +213,14 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >                                  unsigned long offset, unsigned long flags)
> >  {
> >         swp_entry_t entry = swp_entry(si->type, offset);
> > -       struct address_space *address_space = swap_address_space(entry);
> >         struct swap_cluster_info *ci;
> >         struct folio *folio;
> >         int ret, nr_pages;
> >         bool need_reclaim;
> >
> >  again:
> > -       folio = filemap_get_folio(address_space, swap_cache_index(entry));
> > -       if (IS_ERR(folio))
> > +       folio = swap_cache_get_folio(entry);
> > +       if (!folio)
> >                 return 0;
> >
> >         nr_pages = folio_nr_pages(folio);
> > @@ -2131,7 +2130,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >                 pte_unmap(pte);
> >                 pte = NULL;
> >
> > -               folio = swap_cache_get_folio(entry, vma, addr);
> > +               folio = swap_cache_get_folio(entry);
> >                 if (!folio) {
> >                         struct vm_fault vmf = {
> >                                 .vma = vma,
> > @@ -2357,8 +2356,8 @@ static int try_to_unuse(unsigned int type)
> >                (i = find_next_to_unuse(si, i)) != 0) {
> >
> >                 entry = swp_entry(type, i);
> > -               folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > -               if (IS_ERR(folio))
> > +               folio = swap_cache_get_folio(entry);
> > +               if (!folio)
> >                         continue;
> >
> >                 /*
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 50aaa8dcd24c..af61b95c89e4 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -1489,9 +1489,8 @@ static long move_pages_ptes(struct mm_struct *mm, pmd_t *dst_pmd, pmd_t *src_pmd
> >                  * separately to allow proper handling.
> >                  */
> >                 if (!src_folio)
> > -                       folio = filemap_get_folio(swap_address_space(entry),
> > -                                       swap_cache_index(entry));
> > -               if (!IS_ERR_OR_NULL(folio)) {
> > +                       folio = swap_cache_get_folio(entry);
> > +               if (folio) {
> >                         if (folio_test_large(folio)) {
> >                                 ret = -EBUSY;
> >                                 folio_put(folio);
> > --
> > 2.51.0
> >

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
  2025-08-27  2:47   ` Chris Li
@ 2025-08-27  3:52   ` Baoquan He
  2025-08-27 13:46     ` Kairui Song
  2025-08-28  3:20   ` Baolin Wang
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 90+ messages in thread
From: Baoquan He @ 2025-08-27  3:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 08/23/25 at 03:20am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
......snip...
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 99513b74b5d8..ff9eb761a103 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
>  	printk("Total swap = %lukB\n", K(total_swap_pages));
>  }
>  
> +/*
> + * Lookup a swap entry in the swap cache. A found folio will be returned

Lookup is a noun, we should use 'look up' which is a verb here instead?

And all other places in swap code, even though they are not introduced
by this patchset. Just a nitpick.

> + * unlocked and with its refcount incremented.
> + *
> + * Caller must lock the swap device or hold a reference to keep it valid.
> + */
> +struct folio *swap_cache_get_folio(swp_entry_t entry)
> +{
> +	struct folio *folio = filemap_get_folio(swap_address_space(entry),
> +						swap_cache_index(entry));
> +	if (!IS_ERR(folio))
> +		return folio;
> +	return NULL;
> +}
> +
>  void *get_shadow_from_swap_cache(swp_entry_t entry)
>  {
>  	struct address_space *address_space = swap_address_space(entry);
> @@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void)
>  }
>  
>  /*
> - * Lookup a swap entry in the swap cache. A found folio will be returned
> - * unlocked and with its refcount incremented - we rely on the kernel
> - * lock getting page table operations atomic even if we drop the folio
> - * lock before returning.
> - *
> - * Caller must lock the swap device or hold a reference to keep it valid.
> + * Update the readahead statistics of a vma or globally.
>   */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr)
> +void swap_update_readahead(struct folio *folio,
> +			   struct vm_area_struct *vma,
> +			   unsigned long addr)
>  {
> -	struct folio *folio;
> -
> -	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> -	if (!IS_ERR(folio)) {
> -		bool vma_ra = swap_use_vma_readahead();
> -		bool readahead;
> +	bool readahead, vma_ra = swap_use_vma_readahead();
>  
> -		/*
> -		 * At the moment, we don't support PG_readahead for anon THP
> -		 * so let's bail out rather than confusing the readahead stat.
> -		 */
> -		if (unlikely(folio_test_large(folio)))
> -			return folio;
> -
> -		readahead = folio_test_clear_readahead(folio);
> -		if (vma && vma_ra) {
> -			unsigned long ra_val;
> -			int win, hits;
> -
> -			ra_val = GET_SWAP_RA_VAL(vma);
> -			win = SWAP_RA_WIN(ra_val);
> -			hits = SWAP_RA_HITS(ra_val);
> -			if (readahead)
> -				hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> -			atomic_long_set(&vma->swap_readahead_info,
> -					SWAP_RA_VAL(addr, win, hits));
> -		}
> -
> -		if (readahead) {
> -			count_vm_event(SWAP_RA_HIT);
> -			if (!vma || !vma_ra)
> -				atomic_inc(&swapin_readahead_hits);
> -		}
> -	} else {
> -		folio = NULL;
> +	/*
> +	 * At the moment, we don't support PG_readahead for anon THP
> +	 * so let's bail out rather than confusing the readahead stat.
> +	 */
> +	if (unlikely(folio_test_large(folio)))
> +		return;
> +
> +	readahead = folio_test_clear_readahead(folio);
> +	if (vma && vma_ra) {
> +		unsigned long ra_val;
> +		int win, hits;
> +
> +		ra_val = GET_SWAP_RA_VAL(vma);
> +		win = SWAP_RA_WIN(ra_val);
> +		hits = SWAP_RA_HITS(ra_val);
> +		if (readahead)
> +			hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX);
> +		atomic_long_set(&vma->swap_readahead_info,
> +				SWAP_RA_VAL(addr, win, hits));
>  	}
>  
> -	return folio;
> +	if (readahead) {
> +		count_vm_event(SWAP_RA_HIT);
> +		if (!vma || !vma_ra)
> +			atomic_inc(&swapin_readahead_hits);
> +	}
>  }
>  
>  struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> @@ -336,14 +337,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	*new_page_allocated = false;
>  	for (;;) {
>  		int err;
> -		/*
> -		 * First check the swap cache.  Since this is normally
> -		 * called after swap_cache_get_folio() failed, re-calling
> -		 * that would confuse statistics.
> -		 */
> -		folio = filemap_get_folio(swap_address_space(entry),
> -					  swap_cache_index(entry));
> -		if (!IS_ERR(folio))
> +
> +		/* Check the swap cache in case the folio is already there */
> +		folio = swap_cache_get_folio(entry);
> +		if (folio)
>  			goto got_folio;
>  
>  		/*
......


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
@ 2025-08-27  6:13   ` Chris Li
  2025-08-27 13:44     ` Kairui Song
  2025-08-27  7:03   ` Chris Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-08-27  6:13 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, LKML

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cache lookup is lockless, it only increases the reference count
> of the returned folio. That's not enough to ensure a folio is stable in
> the swap cache, so the folio could be removed from the swap cache at any
> time. The caller always has to lock and check the folio before use.
>
> Document this as a comment, and introduce a helper for swap cache folio
> verification with proper sanity checks.
>
> Also, sanitize all current users to use this convention, and use the new
> helper when possible for easier debugging. Some existing callers won't
> cause any major problem right now, only trivial issues like incorrect
> readahead statistic (swapin) or wasted loop (swapoff). It's better to
> always follow this convention to make things robust.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/memory.c     | 28 +++++++++++++---------------
>  mm/shmem.c      |  4 ++--
>  mm/swap.h       | 28 ++++++++++++++++++++++++++++
>  mm/swap_state.c | 13 +++++++++----
>  mm/swapfile.c   | 10 ++++++++--
>  5 files changed, 60 insertions(+), 23 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 10ef528a5f44..9ca8e1873c6e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4661,12 +4661,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 goto out;
>
>         folio = swap_cache_get_folio(entry);
> -       if (folio) {
> -               swap_update_readahead(folio, vma, vmf->address);
> -               page = folio_file_page(folio, swp_offset(entry));
> -       }
>         swapcache = folio;
> -

Can simplify as:
           folio = swapcache = swap_cache_get_folio(entry);

>         if (!folio) {
>                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
>                     __swap_count(entry) == 1) {
> @@ -4735,20 +4730,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                 ret = VM_FAULT_MAJOR;
>                 count_vm_event(PGMAJFAULT);
>                 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> -               page = folio_file_page(folio, swp_offset(entry));
> -       } else if (PageHWPoison(page)) {
> -               /*
> -                * hwpoisoned dirty swapcache pages are kept for killing
> -                * owner processes (which may be unknown at hwpoison time)
> -                */
> -               ret = VM_FAULT_HWPOISON;
> -               goto out_release;

Here you move the HWPosion(page) bail out from before taking the page
lock to after the page lock. The HWPosion page should be able to bail
out without taking the lock.

>         }
>
>         ret |= folio_lock_or_retry(folio, vmf);
>         if (ret & VM_FAULT_RETRY)
>                 goto out_release;
>
> +       page = folio_file_page(folio, swp_offset(entry));
>         if (swapcache) {
>                 /*
>                  * Make sure folio_free_swap() or swapoff did not release the
> @@ -4757,10 +4745,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>                  * swapcache, we need to check that the page's swap has not
>                  * changed.
>                  */
> -               if (unlikely(!folio_test_swapcache(folio) ||
> -                            page_swap_entry(page).val != entry.val))
> +               if (!folio_contains_swap(folio, entry))
>                         goto out_page;
>
> +               if (PageHWPoison(page)) {
> +                       /*
> +                        * hwpoisoned dirty swapcache pages are kept for killing
> +                        * owner processes (which may be unknown at hwpoison time)
> +                        */
> +                       ret = VM_FAULT_HWPOISON;
> +                       goto out_page;

It seems you bail out with the page still locked, that seems like a bug to me.

I think this HWPoision() check move order with the page lock is problematic.

Can you double check?

To be continued.

Chris

> +               }
> +
> +               swap_update_readahead(folio, vma, vmf->address);
> +
>                 /*
>                  * KSM sometimes has to copy on read faults, for example, if
>                  * folio->index of non-ksm folios would be nonlinear inside the
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e9d0d2784cd5..b4d39f2a1e0a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                         count_vm_event(PGMAJFAULT);
>                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
>                 }
> -       } else {
> -               swap_update_readahead(folio, NULL, 0);
>         }
>
>         if (order > folio_order(folio)) {
> @@ -2431,6 +2429,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                 error = -EIO;
>                 goto failed;
>         }
> +       if (!skip_swapcache)
> +               swap_update_readahead(folio, NULL, 0);
>         folio_wait_writeback(folio);
>         nr_pages = folio_nr_pages(folio);
>
> diff --git a/mm/swap.h b/mm/swap.h
> index efb6d7ff9f30..bb2adbfd64a9 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -52,6 +52,29 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>         return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
>  }
>
> +/**
> + * folio_contains_swap - Does this folio contain this swap entry?
> + * @folio: The folio.
> + * @entry: The swap entry to check against.
> + *
> + * Swap version of folio_contains()
> + *
> + * Context: The caller should have the folio locked to ensure
> + * nothing will move it out of the swap cache.
> + * Return: true or false.
> + */
> +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> +{
> +       pgoff_t offset = swp_offset(entry);
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       if (unlikely(!folio_test_swapcache(folio)))
> +               return false;
> +       if (unlikely(swp_type(entry) != swp_type(folio->swap)))
> +               return false;
> +       return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
> +}
> +
>  void show_swap_cache_info(void);
>  void *get_shadow_from_swap_cache(swp_entry_t entry);
>  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> @@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>         return 0;
>  }
>
> +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> +{
> +       return false;
> +}
> +
>  static inline void show_swap_cache_info(void)
>  {
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff9eb761a103..be0d96494dc1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -70,10 +70,12 @@ void show_swap_cache_info(void)
>  }
>
>  /*
> - * Lookup a swap entry in the swap cache. A found folio will be returned
> - * unlocked and with its refcount incremented.
> + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
>   *
> - * Caller must lock the swap device or hold a reference to keep it valid.
> + * A found folio will be returned unlocked and with its refcount increased.
> + *
> + * Context: Caller must ensure @entry is valid and pin the swap device, also
> + * check the returned folio after locking it (e.g. folio_swap_contains).
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         for (;;) {
>                 int err;
>
> -               /* Check the swap cache in case the folio is already there */
> +               /*
> +                * Check the swap cache first, if a cached folio is found,
> +                * return it unlocked. The caller will lock and check it.
> +                */
>                 folio = swap_cache_get_folio(entry);
>                 if (folio)
>                         goto got_folio;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4b8ab2cb49ca..12f2580ebe8d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * Offset could point to the middle of a large folio, or folio
>          * may no longer point to the expected offset before it's locked.
>          */
> -       entry = folio->swap;
> -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> +       if (!folio_contains_swap(folio, entry)) {
>                 folio_unlock(folio);
>                 folio_put(folio);
>                 goto again;
>         }
> +       entry = folio->swap;
>         offset = swp_offset(entry);
>
>         need_reclaim = ((flags & TTRS_ANYWAY) ||
> @@ -2150,6 +2150,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>                 }
>
>                 folio_lock(folio);
> +               if (!folio_contains_swap(folio, entry)) {
> +                       folio_unlock(folio);
> +                       folio_put(folio);
> +                       continue;
> +               }
> +
>                 folio_wait_writeback(folio);
>                 ret = unuse_pte(vma, pmd, addr, entry, folio);
>                 if (ret < 0) {
> --
> 2.51.0
>
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
  2025-08-27  6:13   ` Chris Li
@ 2025-08-27  7:03   ` Chris Li
  2025-08-27 14:35     ` Kairui Song
  2025-09-02  5:40   ` Barry Song
  2025-09-02 10:18   ` David Hildenbrand
  3 siblings, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-08-27  7:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:

> diff --git a/mm/shmem.c b/mm/shmem.c
> index e9d0d2784cd5..b4d39f2a1e0a 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                         count_vm_event(PGMAJFAULT);
>                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
>                 }
> -       } else {
> -               swap_update_readahead(folio, NULL, 0);

Also this update readahead move to later might have a similar problem.
All the bail out in the move will lose the readahead status update.
The readahead deed is already done. Missing the status update seems
incorrect.


>         }
>
>         if (order > folio_order(folio)) {
> @@ -2431,6 +2429,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                 error = -EIO;
>                 goto failed;
>         }
> +       if (!skip_swapcache)
> +               swap_update_readahead(folio, NULL, 0);
>         folio_wait_writeback(folio);
>         nr_pages = folio_nr_pages(folio);


>
> diff --git a/mm/swap.h b/mm/swap.h
> index efb6d7ff9f30..bb2adbfd64a9 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -52,6 +52,29 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>         return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
>  }
>
> +/**
> + * folio_contains_swap - Does this folio contain this swap entry?
> + * @folio: The folio.
> + * @entry: The swap entry to check against.
> + *
> + * Swap version of folio_contains()
> + *
> + * Context: The caller should have the folio locked to ensure
> + * nothing will move it out of the swap cache.
> + * Return: true or false.
> + */
> +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> +{
> +       pgoff_t offset = swp_offset(entry);
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       if (unlikely(!folio_test_swapcache(folio)))
> +               return false;
> +       if (unlikely(swp_type(entry) != swp_type(folio->swap)))
> +               return false;
> +       return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
> +}
> +
>  void show_swap_cache_info(void);
>  void *get_shadow_from_swap_cache(swp_entry_t entry);
>  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> @@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>         return 0;
>  }
>
> +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> +{
> +       return false;
> +}
> +
>  static inline void show_swap_cache_info(void)
>  {
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ff9eb761a103..be0d96494dc1 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -70,10 +70,12 @@ void show_swap_cache_info(void)
>  }
>
>  /*
> - * Lookup a swap entry in the swap cache. A found folio will be returned
> - * unlocked and with its refcount incremented.
> + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
>   *
> - * Caller must lock the swap device or hold a reference to keep it valid.
> + * A found folio will be returned unlocked and with its refcount increased.
> + *
> + * Context: Caller must ensure @entry is valid and pin the swap device, also
Is the "pin" the same as  "lock the swap device or hold a reference"?
Not sure why you changed that comment to "pin".

It seems to me that you want to add the comment for the return value check.
Is that it?

> + * check the returned folio after locking it (e.g. folio_swap_contains).
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         for (;;) {
>                 int err;
>
> -               /* Check the swap cache in case the folio is already there */
> +               /*
> +                * Check the swap cache first, if a cached folio is found,
> +                * return it unlocked. The caller will lock and check it.
> +                */
>                 folio = swap_cache_get_folio(entry);
>                 if (folio)
>                         goto got_folio;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 4b8ab2cb49ca..12f2580ebe8d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * Offset could point to the middle of a large folio, or folio
>          * may no longer point to the expected offset before it's locked.
>          */
> -       entry = folio->swap;
> -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> +       if (!folio_contains_swap(folio, entry)) {
>                 folio_unlock(folio);
>                 folio_put(folio);
>                 goto again;
>         }
> +       entry = folio->swap;

Can you also check this as well? The "goto again" will have entries
not assigned compared to previously.
Too late for me to think straight now if that will cause a problem.

To be continued.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-27  6:13   ` Chris Li
@ 2025-08-27 13:44     ` Kairui Song
  2025-08-30  1:42       ` Chris Li
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-08-27 13:44 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, LKML

On Wed, Aug 27, 2025 at 4:06 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Swap cache lookup is lockless, it only increases the reference count
> > of the returned folio. That's not enough to ensure a folio is stable in
> > the swap cache, so the folio could be removed from the swap cache at any
> > time. The caller always has to lock and check the folio before use.
> >
> > Document this as a comment, and introduce a helper for swap cache folio
> > verification with proper sanity checks.
> >
> > Also, sanitize all current users to use this convention, and use the new
> > helper when possible for easier debugging. Some existing callers won't
> > cause any major problem right now, only trivial issues like incorrect
> > readahead statistic (swapin) or wasted loop (swapoff). It's better to
> > always follow this convention to make things robust.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/memory.c     | 28 +++++++++++++---------------
> >  mm/shmem.c      |  4 ++--
> >  mm/swap.h       | 28 ++++++++++++++++++++++++++++
> >  mm/swap_state.c | 13 +++++++++----
> >  mm/swapfile.c   | 10 ++++++++--
> >  5 files changed, 60 insertions(+), 23 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 10ef528a5f44..9ca8e1873c6e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4661,12 +4661,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 goto out;
> >
> >         folio = swap_cache_get_folio(entry);
> > -       if (folio) {
> > -               swap_update_readahead(folio, vma, vmf->address);
> > -               page = folio_file_page(folio, swp_offset(entry));
> > -       }
> >         swapcache = folio;
> > -
>
> Can simplify as:
>            folio = swapcache = swap_cache_get_folio(entry);

checkpatch.pl is unhappy about it:

checkpatch: multiple assignments should be avoided

>
> >         if (!folio) {
> >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> >                     __swap_count(entry) == 1) {
> > @@ -4735,20 +4730,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                 ret = VM_FAULT_MAJOR;
> >                 count_vm_event(PGMAJFAULT);
> >                 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > -               page = folio_file_page(folio, swp_offset(entry));
> > -       } else if (PageHWPoison(page)) {
> > -               /*
> > -                * hwpoisoned dirty swapcache pages are kept for killing
> > -                * owner processes (which may be unknown at hwpoison time)
> > -                */
> > -               ret = VM_FAULT_HWPOISON;
> > -               goto out_release;
>
> Here you move the HWPosion(page) bail out from before taking the page
> lock to after the page lock. The HWPosion page should be able to bail
> out without taking the lock.
>
>
> >         }
> >
> >         ret |= folio_lock_or_retry(folio, vmf);
> >         if (ret & VM_FAULT_RETRY)
> >                 goto out_release;
> >
> > +       page = folio_file_page(folio, swp_offset(entry));
> >         if (swapcache) {
> >                 /*
> >                  * Make sure folio_free_swap() or swapoff did not release the
> > @@ -4757,10 +4745,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >                  * swapcache, we need to check that the page's swap has not
> >                  * changed.
> >                  */
> > -               if (unlikely(!folio_test_swapcache(folio) ||
> > -                            page_swap_entry(page).val != entry.val))
> > +               if (!folio_contains_swap(folio, entry))
> >                         goto out_page;
> >
> > +               if (PageHWPoison(page)) {
> > +                       /*
> > +                        * hwpoisoned dirty swapcache pages are kept for killing
> > +                        * owner processes (which may be unknown at hwpoison time)
> > +                        */
> > +                       ret = VM_FAULT_HWPOISON;
> > +                       goto out_page;
>
> It seems you bail out with the page still locked, that seems like a bug to me.

I think it's the original behaviour that is kind of fragile. The
returned folio of swap_cache_get_folio is unstable unless locked, so
the folio could have been removed from swap cache or marked by some
other threads as Poisoned. So the PageHWPoison here could be tested
against an unrelated page, which leads to false positive PageHWPoison
results. We have encountered several similar bugs due to using folios
returned by swap cache lookup without lock & checking first.

So checking HWPoison after locking is actually safer.

>
> I think this HWPoision() check move order with the page lock is problematic.
>
> Can you double check?
>
> To be continued.
>
> Chris
>
> > +               }
> > +
> > +               swap_update_readahead(folio, vma, vmf->address);
> > +
> >                 /*
> >                  * KSM sometimes has to copy on read faults, for example, if
> >                  * folio->index of non-ksm folios would be nonlinear inside the
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index e9d0d2784cd5..b4d39f2a1e0a 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                         count_vm_event(PGMAJFAULT);
> >                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >                 }
> > -       } else {
> > -               swap_update_readahead(folio, NULL, 0);
> >         }
> >
> >         if (order > folio_order(folio)) {
> > @@ -2431,6 +2429,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                 error = -EIO;
> >                 goto failed;
> >         }
> > +       if (!skip_swapcache)
> > +               swap_update_readahead(folio, NULL, 0);
> >         folio_wait_writeback(folio);
> >         nr_pages = folio_nr_pages(folio);
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index efb6d7ff9f30..bb2adbfd64a9 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -52,6 +52,29 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
> >         return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> >  }
> >
> > +/**
> > + * folio_contains_swap - Does this folio contain this swap entry?
> > + * @folio: The folio.
> > + * @entry: The swap entry to check against.
> > + *
> > + * Swap version of folio_contains()
> > + *
> > + * Context: The caller should have the folio locked to ensure
> > + * nothing will move it out of the swap cache.
> > + * Return: true or false.
> > + */
> > +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> > +{
> > +       pgoff_t offset = swp_offset(entry);
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       if (unlikely(!folio_test_swapcache(folio)))
> > +               return false;
> > +       if (unlikely(swp_type(entry) != swp_type(folio->swap)))
> > +               return false;
> > +       return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
> > +}
> > +
> >  void show_swap_cache_info(void);
> >  void *get_shadow_from_swap_cache(swp_entry_t entry);
> >  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > @@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
> >         return 0;
> >  }
> >
> > +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline void show_swap_cache_info(void)
> >  {
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ff9eb761a103..be0d96494dc1 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -70,10 +70,12 @@ void show_swap_cache_info(void)
> >  }
> >
> >  /*
> > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > - * unlocked and with its refcount incremented.
> > + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
> >   *
> > - * Caller must lock the swap device or hold a reference to keep it valid.
> > + * A found folio will be returned unlocked and with its refcount increased.
> > + *
> > + * Context: Caller must ensure @entry is valid and pin the swap device, also
> > + * check the returned folio after locking it (e.g. folio_swap_contains).
> >   */
> >  struct folio *swap_cache_get_folio(swp_entry_t entry)
> >  {
> > @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >         for (;;) {
> >                 int err;
> >
> > -               /* Check the swap cache in case the folio is already there */
> > +               /*
> > +                * Check the swap cache first, if a cached folio is found,
> > +                * return it unlocked. The caller will lock and check it.
> > +                */
> >                 folio = swap_cache_get_folio(entry);
> >                 if (folio)
> >                         goto got_folio;
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4b8ab2cb49ca..12f2580ebe8d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >          * Offset could point to the middle of a large folio, or folio
> >          * may no longer point to the expected offset before it's locked.
> >          */
> > -       entry = folio->swap;
> > -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> > +       if (!folio_contains_swap(folio, entry)) {
> >                 folio_unlock(folio);
> >                 folio_put(folio);
> >                 goto again;
> >         }
> > +       entry = folio->swap;
> >         offset = swp_offset(entry);
> >
> >         need_reclaim = ((flags & TTRS_ANYWAY) ||
> > @@ -2150,6 +2150,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >                 }
> >
> >                 folio_lock(folio);
> > +               if (!folio_contains_swap(folio, entry)) {
> > +                       folio_unlock(folio);
> > +                       folio_put(folio);
> > +                       continue;
> > +               }
> > +
> >                 folio_wait_writeback(folio);
> >                 ret = unuse_pte(vma, pmd, addr, entry, folio);
> >                 if (ret < 0) {
> > --
> > 2.51.0
> >
> >
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-27  2:47   ` Chris Li
  2025-08-27  3:50     ` Chris Li
@ 2025-08-27 13:45     ` Kairui Song
  1 sibling, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-27 13:45 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Wed, Aug 27, 2025 at 10:59 AM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> This commit message can use some improvement, I feel the part I am
> interested in, what changed is buried in a lot of detail.
>
> The background is that swap_cache_get_folio() used to do readahead
> update as well. It has VMA as part of the argument. However, the
> hibernation usage does not map swap entry to VMA. It was forced to
> call filemap_get_entry() on swap cache instead, due to no VMA.
>
> So the TL; DR; of what this patch does:
>
> Split the swap readahead outside of swap_cache_get_folio(), so that
> the hibernation non VMA usage can reuse swap_cache_get_folio()  as
> well. No more  calling filemap_get_entry() on swap cache due to lack
> of VMA.
>
> The code itself looks fine. It has gone through some rounds of
> feedback from me already. We can always update the commit message on
> the next iteration.
>
> Acked-by: Chris Li <chrisl@kernel.org>

Thanks for the review and suggestions. Sounds good to me, I'll update
this commit message accordingly in the next version.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-27  3:52   ` Baoquan He
@ 2025-08-27 13:46     ` Kairui Song
  0 siblings, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-27 13:46 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Wed, Aug 27, 2025 at 12:48 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 08/23/25 at 03:20am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> ......snip...
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 99513b74b5d8..ff9eb761a103 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
> >       printk("Total swap = %lukB\n", K(total_swap_pages));
> >  }
> >
> > +/*
> > + * Lookup a swap entry in the swap cache. A found folio will be returned
>
> Lookup is a noun, we should use 'look up' which is a verb here instead?

Hi Baoquan,

I just checked filemap.c to see how page cache helpers describe
themselves, 'Look up' is better indeed. Thanks for the review.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-27  7:03   ` Chris Li
@ 2025-08-27 14:35     ` Kairui Song
  2025-08-28  3:41       ` Baolin Wang
  2025-08-30  1:53       ` Chris Li
  0 siblings, 2 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-27 14:35 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Wed, Aug 27, 2025 at 4:21 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index e9d0d2784cd5..b4d39f2a1e0a 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                         count_vm_event(PGMAJFAULT);
> >                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >                 }
> > -       } else {
> > -               swap_update_readahead(folio, NULL, 0);
>
> Also this update readahead move to later might have a similar problem.
> All the bail out in the move will lose the readahead status update.
>
> The readahead deed is already done. Missing the status update seems
> incorrect.

Thanks for the detailed review.

The only change I wanted here is that swap readahead update should be
done after checking the folio still corresponds to the swap entry
triggering the swapin. That should have slight to none effect compared
to before considering the extremely tiny time window. We are only
following the convention more strictly.

In theory it might even help to reduce false updates: if the folio no
longer corresponds to the swap entry, we are hitting an unrelated
folio, doing a readahead update will either mislead vma readahead's
address hint, or could clean up the readahead flag of an unrelated
folio without actually using it. If the folio does get hit in the
future, due to the missing readahead flag, the statistic will go
wrong.

>
>
> >         }
> >
> >         if (order > folio_order(folio)) {
> > @@ -2431,6 +2429,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                 error = -EIO;
> >                 goto failed;
> >         }
> > +       if (!skip_swapcache)
> > +               swap_update_readahead(folio, NULL, 0);
> >         folio_wait_writeback(folio);
> >         nr_pages = folio_nr_pages(folio);
>
>
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index efb6d7ff9f30..bb2adbfd64a9 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -52,6 +52,29 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
> >         return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> >  }
> >
> > +/**
> > + * folio_contains_swap - Does this folio contain this swap entry?
> > + * @folio: The folio.
> > + * @entry: The swap entry to check against.
> > + *
> > + * Swap version of folio_contains()
> > + *
> > + * Context: The caller should have the folio locked to ensure
> > + * nothing will move it out of the swap cache.
> > + * Return: true or false.
> > + */
> > +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> > +{
> > +       pgoff_t offset = swp_offset(entry);
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       if (unlikely(!folio_test_swapcache(folio)))
> > +               return false;
> > +       if (unlikely(swp_type(entry) != swp_type(folio->swap)))
> > +               return false;
> > +       return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
> > +}
> > +
> >  void show_swap_cache_info(void);
> >  void *get_shadow_from_swap_cache(swp_entry_t entry);
> >  int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > @@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
> >         return 0;
> >  }
> >
> > +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> > +{
> > +       return false;
> > +}
> > +
> >  static inline void show_swap_cache_info(void)
> >  {
> >  }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index ff9eb761a103..be0d96494dc1 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -70,10 +70,12 @@ void show_swap_cache_info(void)
> >  }
> >
> >  /*
> > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > - * unlocked and with its refcount incremented.
> > + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
> >   *
> > - * Caller must lock the swap device or hold a reference to keep it valid.
> > + * A found folio will be returned unlocked and with its refcount increased.
> > + *
> > + * Context: Caller must ensure @entry is valid and pin the swap device, also
> Is the "pin" the same as  "lock the swap device or hold a reference"?
> Not sure why you changed that comment to "pin".

Yes it's the same thing. We don't lock the device though, the device
can be pinned by the refcounf (get_swap_device) or locking anything
that is referencing the device (locking PTL the a PTE that contains an
swap entry pointing to the device, or locking a swap cache folio of a
swap entry that points to the device). So I juse used the word "pin".
I added some comments in mm/swap.h in later commits about what the
"pin" means.

>
> It seems to me that you want to add the comment for the return value check.
> Is that it?

Right, the caller has to check the folio before use, so I'm trying to
document this convention.

> > + * check the returned folio after locking it (e.g. folio_swap_contains).
> >   */
> >  struct folio *swap_cache_get_folio(swp_entry_t entry)
> >  {
> > @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >         for (;;) {
> >                 int err;
> >
> > -               /* Check the swap cache in case the folio is already there */
> > +               /*
> > +                * Check the swap cache first, if a cached folio is found,
> > +                * return it unlocked. The caller will lock and check it.
> > +                */
> >                 folio = swap_cache_get_folio(entry);
> >                 if (folio)
> >                         goto got_folio;
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 4b8ab2cb49ca..12f2580ebe8d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >          * Offset could point to the middle of a large folio, or folio
> >          * may no longer point to the expected offset before it's locked.
> >          */
> > -       entry = folio->swap;
> > -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> > +       if (!folio_contains_swap(folio, entry)) {
> >                 folio_unlock(folio);
> >                 folio_put(folio);
> >                 goto again;
> >         }
> > +       entry = folio->swap;
>
> Can you also check this as well? The "goto again" will have entries
> not assigned compared to previously.
> Too late for me to think straight now if that will cause a problem.

Oh, thanks for pointing this part out. This patch is correct, it's the
original behaviour that is not correct. If the folio is no longer
valid (the if check here failed), changing the `entry` value before
could lead to a wrong look in the next attempt with `goto again`. That
could lead to reclaim of an unrelated folio. It's a trivial issue
though, only might marginally slow down the performance. Maybe I
should make a seperate patch to fix this issue first in case anyone
wants to backport it.

>
> To be continued.
>
> Chris
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-27  3:47   ` Baoquan He
@ 2025-08-27 17:44     ` Chris Li
  2025-08-27 23:46       ` Baoquan He
                         ` (2 more replies)
  0 siblings, 3 replies; 90+ messages in thread
From: Chris Li @ 2025-08-27 17:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Aug 26, 2025 at 8:47 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 08/23/25 at 03:20am, Kairui Song wrote:
> ......
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 223b40f2d37e..7b3efaa51624 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -15,6 +15,8 @@ extern int page_cluster;
> >  #define swap_entry_order(order)      0
> >  #endif
> >
> > +extern struct swap_info_struct *swap_info[];
> > +
> >  /*
> >   * We use this to track usage of a cluster. A cluster is a block of swap disk
> >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> > @@ -53,9 +55,28 @@ enum swap_cluster_flags {
> >  #include <linux/swapops.h> /* for swp_offset */
> >  #include <linux/blk_types.h> /* for bio_end_io_t */
> >
> > +/*
> > + * Callers of all swp_* helpers here must ensure the entry is valid, and
> > + * pin the swap device by reference or in other ways.
> > + */
> > +static inline struct swap_info_struct *swp_type_info(int type)
> > +{
> > +     struct swap_info_struct *si;
> > +
> > +     si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
> > +     VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> > +     return si;
> > +}
> > +
> > +static inline struct swap_info_struct *swp_info(swp_entry_t entry)
> > +{
> > +     return swp_type_info(swp_type(entry));
> > +}
>
> swp_type_info() is only used by swp_info() in the whole series, can we
> open code it in swp_info()?

BTW, off topic here. I really don't like the "_info" suffix. Anything
you can put into a C struct by definition is some kind of information.
Same to the _struct. Anything defined by a struct is a struct. Don't
need to say that.
The "struct swap_info_struct" gets two of the unnecessary words. It
should be something like  "struct swap_file" or "struct swap_device".
Renaming it is too invasive to the code base and it will mess up the
git annotation history.

Oh well.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-27 17:44     ` Chris Li
@ 2025-08-27 23:46       ` Baoquan He
  2025-08-30  2:38         ` Chris Li
  2025-09-02  6:01       ` Barry Song
  2025-09-03  9:28       ` David Hildenbrand
  2 siblings, 1 reply; 90+ messages in thread
From: Baoquan He @ 2025-08-27 23:46 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On 08/27/25 at 10:44am, Chris Li wrote:
> On Tue, Aug 26, 2025 at 8:47 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 08/23/25 at 03:20am, Kairui Song wrote:
> > ......
> > > diff --git a/mm/swap.h b/mm/swap.h
> > > index 223b40f2d37e..7b3efaa51624 100644
> > > --- a/mm/swap.h
> > > +++ b/mm/swap.h
> > > @@ -15,6 +15,8 @@ extern int page_cluster;
> > >  #define swap_entry_order(order)      0
> > >  #endif
> > >
> > > +extern struct swap_info_struct *swap_info[];
> > > +
> > >  /*
> > >   * We use this to track usage of a cluster. A cluster is a block of swap disk
> > >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> > > @@ -53,9 +55,28 @@ enum swap_cluster_flags {
> > >  #include <linux/swapops.h> /* for swp_offset */
> > >  #include <linux/blk_types.h> /* for bio_end_io_t */
> > >
> > > +/*
> > > + * Callers of all swp_* helpers here must ensure the entry is valid, and
> > > + * pin the swap device by reference or in other ways.
> > > + */
> > > +static inline struct swap_info_struct *swp_type_info(int type)
> > > +{
> > > +     struct swap_info_struct *si;
> > > +
> > > +     si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
> > > +     VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> > > +     return si;
> > > +}
> > > +
> > > +static inline struct swap_info_struct *swp_info(swp_entry_t entry)
> > > +{
> > > +     return swp_type_info(swp_type(entry));
> > > +}
> >
> > swp_type_info() is only used by swp_info() in the whole series, can we
> > open code it in swp_info()?
> 
> BTW, off topic here. I really don't like the "_info" suffix. Anything
> you can put into a C struct by definition is some kind of information.
> Same to the _struct. Anything defined by a struct is a struct. Don't
> need to say that.
> The "struct swap_info_struct" gets two of the unnecessary words. It
> should be something like  "struct swap_file" or "struct swap_device".
> Renaming it is too invasive to the code base and it will mess up the
> git annotation history.

I agree. I searched for _info_struct in the current code, only found
swap_info_struct, ax25_info_struct, vm86plus_info_struct. The latter two
are seen in very few LOC. Maybe we can rename it later when things are all
done. And 'struct swap_cluster_info' too.


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
  2025-08-27  2:47   ` Chris Li
  2025-08-27  3:52   ` Baoquan He
@ 2025-08-28  3:20   ` Baolin Wang
  2025-09-01 23:50   ` Barry Song
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 90+ messages in thread
From: Baolin Wang @ 2025-08-28  3:20 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang, Johannes Weiner,
	David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel



On 2025/8/23 03:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Always use swap_cache_get_folio for swap cache folio look up. The reason
> we are not using it in all places is that it also updates the readahead
> info, and some callsites want to avoid that.
> 
> So decouple readahead update with swap cache lookup into a standalone
> helper, let the caller call the readahead update helper if that's
> needed. And convert all swap cache lookups to use swap_cache_get_folio.
> 
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration and shmem replacing,
> because they need to lock the Xarray. Following commits will wrap their
> accesses to the swap cache too with special helpers.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>

LGTM. With the commit message updated:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-27 14:35     ` Kairui Song
@ 2025-08-28  3:41       ` Baolin Wang
  2025-08-28 18:05         ` Kairui Song
  2025-08-30  1:53       ` Chris Li
  1 sibling, 1 reply; 90+ messages in thread
From: Baolin Wang @ 2025-08-28  3:41 UTC (permalink / raw)
  To: Kairui Song, Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang, Johannes Weiner,
	David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel



On 2025/8/27 22:35, Kairui Song wrote:
> On Wed, Aug 27, 2025 at 4:21 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index e9d0d2784cd5..b4d39f2a1e0a 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>                          count_vm_event(PGMAJFAULT);
>>>                          count_memcg_event_mm(fault_mm, PGMAJFAULT);
>>>                  }
>>> -       } else {
>>> -               swap_update_readahead(folio, NULL, 0);
>>
>> Also this update readahead move to later might have a similar problem.
>> All the bail out in the move will lose the readahead status update.
>>
>> The readahead deed is already done. Missing the status update seems
>> incorrect.
> 
> Thanks for the detailed review.
> 
> The only change I wanted here is that swap readahead update should be
> done after checking the folio still corresponds to the swap entry
> triggering the swapin. That should have slight to none effect compared
> to before considering the extremely tiny time window. We are only
> following the convention more strictly.
> 
> In theory it might even help to reduce false updates: if the folio no
> longer corresponds to the swap entry, we are hitting an unrelated
> folio, doing a readahead update will either mislead vma readahead's
> address hint, or could clean up the readahead flag of an unrelated
> folio without actually using it. If the folio does get hit in the
> future, due to the missing readahead flag, the statistic will go
> wrong.

Yes, that’s what I thought as well.

By the way, can we do it right all at once in patch 1 (I mean the shmem 
changes)?

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-28  3:41       ` Baolin Wang
@ 2025-08-28 18:05         ` Kairui Song
  0 siblings, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-28 18:05 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Chris Li, linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Thu, Aug 28, 2025 at 11:41 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/8/27 22:35, Kairui Song wrote:
> > On Wed, Aug 27, 2025 at 4:21 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index e9d0d2784cd5..b4d39f2a1e0a 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>                          count_vm_event(PGMAJFAULT);
> >>>                          count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >>>                  }
> >>> -       } else {
> >>> -               swap_update_readahead(folio, NULL, 0);
> >>
> >> Also this update readahead move to later might have a similar problem.
> >> All the bail out in the move will lose the readahead status update.
> >>
> >> The readahead deed is already done. Missing the status update seems
> >> incorrect.
> >
> > Thanks for the detailed review.
> >
> > The only change I wanted here is that swap readahead update should be
> > done after checking the folio still corresponds to the swap entry
> > triggering the swapin. That should have slight to none effect compared
> > to before considering the extremely tiny time window. We are only
> > following the convention more strictly.
> >
> > In theory it might even help to reduce false updates: if the folio no
> > longer corresponds to the swap entry, we are hitting an unrelated
> > folio, doing a readahead update will either mislead vma readahead's
> > address hint, or could clean up the readahead flag of an unrelated
> > folio without actually using it. If the folio does get hit in the
> > future, due to the missing readahead flag, the statistic will go
> > wrong.
>
> Yes, that’s what I thought as well.
>
> By the way, can we do it right all at once in patch 1 (I mean the shmem
> changes)?

Hi Baolin,

Yeah it's OK to do so but it's kind of a very slight behaviour change.
Currently patch 1 has zero behaviour change, so maybe just leave it in
this patch where we sanitize all swap cache conventions all at once.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-27 13:44     ` Kairui Song
@ 2025-08-30  1:42       ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  1:42 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, LKML

Hi Kairui,

Sorry for the late reply. I have been super busy this week.

On Wed, Aug 27, 2025 at 6:44 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Aug 27, 2025 at 4:06 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 10ef528a5f44..9ca8e1873c6e 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4661,12 +4661,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >                 goto out;
> > >
> > >         folio = swap_cache_get_folio(entry);
> > > -       if (folio) {
> > > -               swap_update_readahead(folio, vma, vmf->address);
> > > -               page = folio_file_page(folio, swp_offset(entry));
> > > -       }
> > >         swapcache = folio;
> > > -
> >
> > Can simplify as:
> >            folio = swapcache = swap_cache_get_folio(entry);
>
> checkpatch.pl is unhappy about it:
>
> checkpatch: multiple assignments should be avoided

Ah, never mind then. I actually like multiple assignments but I can
see checkpatch wants to ban it.

>
> >
> > >         if (!folio) {
> > >                 if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
> > >                     __swap_count(entry) == 1) {
> > > @@ -4735,20 +4730,13 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >                 ret = VM_FAULT_MAJOR;
> > >                 count_vm_event(PGMAJFAULT);
> > >                 count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
> > > -               page = folio_file_page(folio, swp_offset(entry));
> > > -       } else if (PageHWPoison(page)) {
> > > -               /*
> > > -                * hwpoisoned dirty swapcache pages are kept for killing
> > > -                * owner processes (which may be unknown at hwpoison time)
> > > -                */
> > > -               ret = VM_FAULT_HWPOISON;
> > > -               goto out_release;
> >
> > Here you move the HWPosion(page) bail out from before taking the page
> > lock to after the page lock. The HWPosion page should be able to bail
> > out without taking the lock.
> >
> >
> > >         }
> > >
> > >         ret |= folio_lock_or_retry(folio, vmf);
> > >         if (ret & VM_FAULT_RETRY)
> > >                 goto out_release;
> > >
> > > +       page = folio_file_page(folio, swp_offset(entry));
> > >         if (swapcache) {
> > >                 /*
> > >                  * Make sure folio_free_swap() or swapoff did not release the
> > > @@ -4757,10 +4745,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >                  * swapcache, we need to check that the page's swap has not
> > >                  * changed.
> > >                  */
> > > -               if (unlikely(!folio_test_swapcache(folio) ||
> > > -                            page_swap_entry(page).val != entry.val))
> > > +               if (!folio_contains_swap(folio, entry))
> > >                         goto out_page;
> > >
> > > +               if (PageHWPoison(page)) {
> > > +                       /*
> > > +                        * hwpoisoned dirty swapcache pages are kept for killing
> > > +                        * owner processes (which may be unknown at hwpoison time)
> > > +                        */
> > > +                       ret = VM_FAULT_HWPOISON;
> > > +                       goto out_page;
> >
> > It seems you bail out with the page still locked, that seems like a bug to me.
>
> I think it's the original behaviour that is kind of fragile. The
> returned folio of swap_cache_get_folio is unstable unless locked, so
> the folio could have been removed from swap cache or marked by some
> other threads as Poisoned. So the PageHWPoison here could be tested
> against an unrelated page, which leads to false positive PageHWPoison
> results. We have encountered several similar bugs due to using folios
> returned by swap cache lookup without lock & checking first.
>
> So checking HWPoison after locking is actually safer.

How do I verify the code can handle HWPoison from unlock to lock page?
I haven't followed the HWPoison path very much. I am still wondering
how does the HWPoison code handle the page previously unlocked, now
become locked and without any additional code change?

If you want this behavior, I strongly suggest you split this portion
of the change out, ideally outside this 9 series if you can afford to
do so.

This patch description has nothing mentioning this kind of subtle
behavior change, as a reviewer I am caught off guard by this. At the
very least the commit should mention this. Talk explains why this
change is needed and why it is safe to do so.

Having very subtle behavior changes buried in a large code restructure
change is the worst. Splitting it out will reduce the reviewer's
workload to reason this behavior change, and makes your series easier
to review.

Does it make sense?

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-27 14:35     ` Kairui Song
  2025-08-28  3:41       ` Baolin Wang
@ 2025-08-30  1:53       ` Chris Li
  2025-08-30 15:15         ` Kairui Song
  2025-09-01 18:17         ` Kairui Song
  1 sibling, 2 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  1:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Wed, Aug 27, 2025 at 7:36 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Wed, Aug 27, 2025 at 4:21 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index e9d0d2784cd5..b4d39f2a1e0a 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > >                         count_vm_event(PGMAJFAULT);
> > >                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
> > >                 }
> > > -       } else {
> > > -               swap_update_readahead(folio, NULL, 0);
> >
> > Also this update readahead move to later might have a similar problem.
> > All the bail out in the move will lose the readahead status update.
> >
> > The readahead deed is already done. Missing the status update seems
> > incorrect.
>
> Thanks for the detailed review.
>
> The only change I wanted here is that swap readahead update should be
> done after checking the folio still corresponds to the swap entry
> triggering the swapin. That should have slight to none effect compared
> to before considering the extremely tiny time window. We are only
> following the convention more strictly.
>
> In theory it might even help to reduce false updates: if the folio no
> longer corresponds to the swap entry, we are hitting an unrelated
> folio, doing a readahead update will either mislead vma readahead's
> address hint, or could clean up the readahead flag of an unrelated
> folio without actually using it. If the folio does get hit in the
> future, due to the missing readahead flag, the statistic will go
> wrong.

So the missing readahead stats update behavior is the correct and
better behavior. I suggest you spit that out as a separate patch with
appropriate comments about it too. It is also easier to bisect the
commit if that kind of the subtle change which is considered safe
turns out causing a problem. Causing problem not happen very often but
it does happen before.

> > >  /*
> > > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > > - * unlocked and with its refcount incremented.
> > > + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
> > >   *
> > > - * Caller must lock the swap device or hold a reference to keep it valid.
> > > + * A found folio will be returned unlocked and with its refcount increased.
> > > + *
> > > + * Context: Caller must ensure @entry is valid and pin the swap device, also
> > Is the "pin" the same as  "lock the swap device or hold a reference"?
> > Not sure why you changed that comment to "pin".
>
> Yes it's the same thing. We don't lock the device though, the device
> can be pinned by the refcounf (get_swap_device) or locking anything
> that is referencing the device (locking PTL the a PTE that contains an
> swap entry pointing to the device, or locking a swap cache folio of a
> swap entry that points to the device). So I juse used the word "pin".
> I added some comments in mm/swap.h in later commits about what the
> "pin" means.

In that case why not reuse the previous comment keeping "lock the swap
device or hold a reference" instead of "pin"?

> > It seems to me that you want to add the comment for the return value check.
> > Is that it?
>
> Right, the caller has to check the folio before use, so I'm trying to
> document this convention.

Again, I recommend reducing the unnecessary impact to the code, make
it more obvious what you did actually change. I spend quite some time
there trying to figure out what you are trying to accomplish with the
comments.

> > > + * check the returned folio after locking it (e.g. folio_swap_contains).
> > >   */
> > >  struct folio *swap_cache_get_folio(swp_entry_t entry)
> > >  {
> > > @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > >         for (;;) {
> > >                 int err;
> > >
> > > -               /* Check the swap cache in case the folio is already there */
> > > +               /*
> > > +                * Check the swap cache first, if a cached folio is found,
> > > +                * return it unlocked. The caller will lock and check it.
> > > +                */
> > >                 folio = swap_cache_get_folio(entry);
> > >                 if (folio)
> > >                         goto got_folio;
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 4b8ab2cb49ca..12f2580ebe8d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > >          * Offset could point to the middle of a large folio, or folio
> > >          * may no longer point to the expected offset before it's locked.
> > >          */
> > > -       entry = folio->swap;
> > > -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> > > +       if (!folio_contains_swap(folio, entry)) {
> > >                 folio_unlock(folio);
> > >                 folio_put(folio);
> > >                 goto again;
> > >         }
> > > +       entry = folio->swap;
> >
> > Can you also check this as well? The "goto again" will have entries
> > not assigned compared to previously.
> > Too late for me to think straight now if that will cause a problem.
>
> Oh, thanks for pointing this part out. This patch is correct, it's the
> original behaviour that is not correct. If the folio is no longer
> valid (the if check here failed), changing the `entry` value before
> could lead to a wrong look in the next attempt with `goto again`. That
> could lead to reclaim of an unrelated folio. It's a trivial issue
> though, only might marginally slow down the performance. Maybe I
> should make a seperate patch to fix this issue first in case anyone
> wants to backport it.

Thanks for the explanation, please do split this subtle behavior
change out with appropriate commit messages documenting your change,
why it is safe and the better behavior.

Thanks

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
@ 2025-08-30  1:54   ` Baoquan He
  2025-08-30  3:40     ` Chris Li
  2025-08-30  3:34   ` Chris Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 90+ messages in thread
From: Baoquan He @ 2025-08-30  1:54 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On 08/23/25 at 03:20am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
......snip...
> diff --git a/mm/swap.h b/mm/swap.h
> index 7b3efaa51624..4af42bc2cd72 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
......snip... 
> +/*
> + * All swap cache helpers below require the caller to ensure the swap entries
> + * are valid and pin the device. This can be guaranteed by:
> + * - get_swap_device: this ensures a single entry is valid and increases the
> + *   swap device's refcount.
> + * - Locking a folio in the swap cache: this ensures the folio won't be freed
> + *   from the swap cache, stabilizes its entries, and the swap device.
> + * - Locking anything referencing the swap entry: e.g. locking the PTL that
> + *   protects swap entries in the page table, so they won't be freed.
> + */
> +extern struct folio *swap_cache_get_folio(swp_entry_t entry);
> +extern void *swap_cache_get_shadow(swp_entry_t entry);
> +extern int swap_cache_add_folio(swp_entry_t entry,
> +				struct folio *folio, void **shadow);
> +extern void swap_cache_del_folio(struct folio *folio);
> +/* Below helpers also require the caller to lock the swap cluster. */
> +extern void __swap_cache_del_folio(swp_entry_t entry,
> +				   struct folio *folio, void *shadow);
> +extern void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> +				      swp_entry_t entry, struct folio *old,
> +				      struct folio *new);
> +extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
> +
>  void show_swap_cache_info(void);
> -void *get_shadow_from_swap_cache(swp_entry_t entry);
> -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -		      gfp_t gfp, void **shadowp);
> -void __delete_from_swap_cache(struct folio *folio,
> -			      swp_entry_t entry, void *shadow);
> -void delete_from_swap_cache(struct folio *folio);
> -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -				  unsigned long end);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> -struct folio *swap_cache_get_folio(swp_entry_t entry);
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct vm_area_struct *vma, unsigned long addr,
>  		struct swap_iocb **plug);

I would put this function renaming change to another standalone patch,
then let this key patch focus on swap table introducing.

> @@ -235,6 +283,33 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +
> +static inline struct swap_cluster_info *swap_cluster_lock(
> +	struct swap_info_struct *si, pgoff_t offset, bool irq)
> +{
> +	return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +		struct folio *folio)
> +{
> +	return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +		struct folio *folio)
> +{
> +	return NULL;
> +}
> +
> +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> +{
> +}
> +
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +}
> +
>  static inline struct swap_info_struct *swp_info(swp_entry_t entry)
>  {
>  	return NULL;
> @@ -252,11 +327,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
>  	return NULL;
>  }
>  
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -	return 0;
> -}
> -
>  static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
>  {
>  	return false;
> @@ -298,28 +368,27 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>  	return NULL;
>  }
>  
> -static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> +static inline void *swap_cache_get_shadow(swp_entry_t end)
>  {
>  	return NULL;
>  }
>  
> -static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -					gfp_t gfp_mask, void **shadowp)
> +static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
>  {
> -	return -1;
> +	return -EINVAL;
>  }
>  
> -static inline void __delete_from_swap_cache(struct folio *folio,
> -					swp_entry_t entry, void *shadow)
> +static inline void swap_cache_del_folio(struct folio *folio)
>  {
>  }
>  
> -static inline void delete_from_swap_cache(struct folio *folio)
> +static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
>  {
>  }
>  
> -static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -				unsigned long end)
> +static inline void __swap_cache_replace_folio(
> +		struct swap_cluster_info *ci, swp_entry_t entry,
> +		struct folio *old, struct folio *new)
>  {
>  }
>  
> @@ -354,7 +423,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  static inline pgoff_t folio_index(struct folio *folio)
>  {
>  	if (unlikely(folio_test_swapcache(folio)))
> -		return swap_cache_index(folio->swap);
> +		return swp_offset(folio->swap);
>  	return folio->index;
>  }
>  
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 721ff1a5e73a..c0342024b4a8 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -23,6 +23,7 @@
>  #include <linux/huge_mm.h>
>  #include <linux/shmem_fs.h>
>  #include "internal.h"
> +#include "swap_table.h"
>  #include "swap.h"
>  
>  /*
> @@ -36,8 +37,11 @@ static const struct address_space_operations swap_aops = {
>  #endif
>  };
>  
> -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> +/* Set swap_space is read only as swap cache is handled by swap table */
> +struct address_space swap_space __ro_after_init = {
> +	.a_ops = &swap_aops,
> +};
> +
>  static bool enable_vma_readahead __read_mostly = true;
>  
>  #define SWAP_RA_ORDER_CEILING	5
> @@ -69,7 +73,7 @@ void show_swap_cache_info(void)
>  	printk("Total swap = %lukB\n", K(total_swap_pages));
>  }
>  
> -/*
> +/**
>   * swap_cache_get_folio - Lookup a swap entry in the swap cache.
>   *
>   * A found folio will be returned unlocked and with its refcount increased.
> @@ -79,155 +83,179 @@ void show_swap_cache_info(void)
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> -	struct folio *folio = filemap_get_folio(swap_address_space(entry),
> -						swap_cache_index(entry));
> -	if (!IS_ERR(folio))
> -		return folio;
> +	unsigned long swp_tb;
> +	struct folio *folio;
> +
> +	for (;;) {
> +		swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> +		if (!swp_tb_is_folio(swp_tb))
> +			return NULL;
> +		folio = swp_tb_to_folio(swp_tb);
> +		if (folio_try_get(folio))
> +			return folio;
> +	}
> +
>  	return NULL;
>  }
>  
> -void *get_shadow_from_swap_cache(swp_entry_t entry)
> +/**
> + * swap_cache_get_shadow - Lookup a shadow in the swap cache.
> + *
> + * Context: Caller must ensure @entry is valid and pin the swap device.
> + */
> +void *swap_cache_get_shadow(swp_entry_t entry)
>  {
> -	struct address_space *address_space = swap_address_space(entry);
> -	pgoff_t idx = swap_cache_index(entry);
> -	void *shadow;
> +	unsigned long swp_tb;
> +
> +	swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> +	if (swp_tb_is_shadow(swp_tb))
> +		return swp_tb_to_shadow(swp_tb);
>  
> -	shadow = xa_load(&address_space->i_pages, idx);
> -	if (xa_is_value(shadow))
> -		return shadow;
>  	return NULL;
>  }
>  
> -/*
> - * add_to_swap_cache resembles filemap_add_folio on swapper_space,
> - * but sets SwapCache flag and 'swap' instead of mapping and index.
> +/**
> + * swap_cache_add_folio -  add a folio into the swap cache.
> + *
> + * The folio will be used for swapin or swapout of swap entries
> + * starting with @entry. May fail due to race.
> + *
> + * Context: Caller must ensure @entry is valid and pin the swap device.
>   */
> -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -			gfp_t gfp, void **shadowp)
> +int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **shadowp)
>  {
> -	struct address_space *address_space = swap_address_space(entry);
> -	pgoff_t idx = swap_cache_index(entry);
> -	XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> -	unsigned long i, nr = folio_nr_pages(folio);
> -	void *old;
> -
> -	xas_set_update(&xas, workingset_update_node);
> -
> -	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -	VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> -	VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
> +	unsigned long exist;
> +	void *shadow = NULL;
> +	struct swap_cluster_info *ci;
> +	unsigned int ci_start, ci_off, ci_end;
> +	unsigned long nr_pages = folio_nr_pages(folio);
> +
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> +	ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> +	ci_start = swp_cluster_offset(entry);
> +	ci_end = ci_start + nr_pages;
> +	ci_off = ci_start;
> +	do {
> +		exist = __swap_table_get(ci, ci_off);
> +		if (unlikely(swp_tb_is_folio(exist)))
> +			goto fail;
> +		if (swp_tb_is_shadow(exist))
> +			shadow = swp_tb_to_shadow(exist);
> +	} while (++ci_off < ci_end);
> +
> +	ci_off = ci_start;
> +	do {
> +		__swap_table_set_folio(ci, ci_off, folio);
> +	} while (++ci_off < ci_end);
>  
> -	folio_ref_add(folio, nr);
> +	folio_ref_add(folio, nr_pages);
>  	folio_set_swapcache(folio);
>  	folio->swap = entry;
> +	swap_cluster_unlock(ci);
>  
> -	do {
> -		xas_lock_irq(&xas);
> -		xas_create_range(&xas);
> -		if (xas_error(&xas))
> -			goto unlock;
> -		for (i = 0; i < nr; i++) {
> -			VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
> -			if (shadowp) {
> -				old = xas_load(&xas);
> -				if (xa_is_value(old))
> -					*shadowp = old;
> -			}
> -			xas_store(&xas, folio);
> -			xas_next(&xas);
> -		}
> -		address_space->nrpages += nr;
> -		__node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
> -		__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
> -unlock:
> -		xas_unlock_irq(&xas);
> -	} while (xas_nomem(&xas, gfp));
> -
> -	if (!xas_error(&xas))
> -		return 0;
> +	node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> +	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
>  
> -	folio_clear_swapcache(folio);
> -	folio_ref_sub(folio, nr);
> -	return xas_error(&xas);
> +	if (shadowp)
> +		*shadowp = shadow;
> +	return 0;
> +fail:
> +	swap_cluster_unlock(ci);
> +	return -EEXIST;
>  }
>  
>  /*
> - * This must be called only on folios that have
> - * been verified to be in the swap cache.
> + * Caller must ensure the folio is in the swap cache and locked,
> + * also lock the swap cluster.
>   */
> -void __delete_from_swap_cache(struct folio *folio,
> -			swp_entry_t entry, void *shadow)
> +void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio,
> +			    void *shadow)
>  {
> -	struct address_space *address_space = swap_address_space(entry);
> -	int i;
> -	long nr = folio_nr_pages(folio);
> -	pgoff_t idx = swap_cache_index(entry);
> -	XA_STATE(xas, &address_space->i_pages, idx);
> -
> -	xas_set_update(&xas, workingset_update_node);
> -
> -	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
> -	VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
> -
> -	for (i = 0; i < nr; i++) {
> -		void *entry = xas_store(&xas, shadow);
> -		VM_BUG_ON_PAGE(entry != folio, entry);
> -		xas_next(&xas);
> -	}
> +	unsigned long exist;
> +	struct swap_cluster_info *ci;
> +	unsigned int ci_start, ci_off, ci_end;
> +	unsigned long nr_pages = folio_nr_pages(folio);
> +
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
> +
> +	ci = swp_offset_cluster(swp_info(entry), swp_offset(entry));
> +	ci_start = swp_cluster_offset(entry);
> +	ci_end = ci_start + nr_pages;
> +	ci_off = ci_start;
> +	do {
> +		exist = __swap_table_get(ci, ci_off);
> +		VM_WARN_ON_ONCE(swp_tb_to_folio(exist) != folio);
> +		/* If shadow is NULL, we sets an empty shadow */
> +		__swap_table_set_shadow(ci, ci_off, shadow);
> +	} while (++ci_off < ci_end);
> +
>  	folio->swap.val = 0;
>  	folio_clear_swapcache(folio);
> -	address_space->nrpages -= nr;
> -	__node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
> -	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
> +	node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
> +	lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
>  }
>  
>  /*
> - * This must be called only on folios that have
> - * been verified to be in the swap cache and locked.
> - * It will never put the folio into the free list,
> - * the caller has a reference on the folio.
> + * Replace an old folio in the swap cache with a new one. The caller must
> + * hold the cluster lock and set the new folio's entry and flags.
>   */
> -void delete_from_swap_cache(struct folio *folio)
> +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> +				struct folio *old, struct folio *new)
> +{
> +	unsigned int ci_off = swp_cluster_offset(entry);
> +	unsigned long nr_pages = folio_nr_pages(new);
> +	unsigned int ci_end = ci_off + nr_pages;
> +
> +	VM_WARN_ON_ONCE(entry.val != new->swap.val);
> +	VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> +	VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> +	do {
> +		WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> +		__swap_table_set_folio(ci, ci_off, new);
> +	} while (++ci_off < ci_end);
> +
> +	/*
> +	 * If the old folio is partially replaced (e.g., splitting a large
> +	 * folio, the old folio is shrunk in place, and new split sub folios
> +	 * are added to cache), ensure the new folio doesn't overlap it.
> +	 */
> +	if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> +	    folio_order(old) != folio_order(new)) {
> +		ci_off = swp_cluster_offset(old->swap);
> +		ci_end = ci_off + folio_nr_pages(old);
> +		while (ci_off++ < ci_end)
> +			WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> +	}
> +}
> +
> +void swap_cache_del_folio(struct folio *folio)
>  {
> +	struct swap_cluster_info *ci;
>  	swp_entry_t entry = folio->swap;
> -	struct address_space *address_space = swap_address_space(entry);
>  
> -	xa_lock_irq(&address_space->i_pages);
> -	__delete_from_swap_cache(folio, entry, NULL);
> -	xa_unlock_irq(&address_space->i_pages);
> +	ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> +	__swap_cache_del_folio(entry, folio, NULL);
> +	swap_cluster_unlock(ci);
>  
>  	put_swap_folio(folio, entry);
>  	folio_ref_sub(folio, folio_nr_pages(folio));
>  }
>  
> -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -				unsigned long end)
> +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
>  {
> -	unsigned long curr = begin;
> -	void *old;
> -
> -	for (;;) {
> -		swp_entry_t entry = swp_entry(type, curr);
> -		unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
> -		struct address_space *address_space = swap_address_space(entry);
> -		XA_STATE(xas, &address_space->i_pages, index);
> -
> -		xas_set_update(&xas, workingset_update_node);
> -
> -		xa_lock_irq(&address_space->i_pages);
> -		xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
> -			if (!xa_is_value(old))
> -				continue;
> -			xas_store(&xas, NULL);
> -		}
> -		xa_unlock_irq(&address_space->i_pages);
> +	struct swap_cluster_info *ci = swp_cluster(entry);
> +	unsigned int ci_off = swp_cluster_offset(entry), ci_end;
>  
> -		/* search the next swapcache until we meet end */
> -		curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
> -		if (curr > end)
> -			break;
> -	}
> +	ci_end = ci_off + nr_ents;
> +	do {
> +		WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> +		__swap_table_init_null(ci, ci_off);
> +	} while (++ci_off < ci_end);
>  }
>  
>  /*
> @@ -292,8 +320,7 @@ static inline bool swap_use_vma_readahead(void)
>  /*
>   * Update the readahead statistics of a vma or globally.
>   */
> -void swap_update_readahead(struct folio *folio,
> -			   struct vm_area_struct *vma,
> +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>  			   unsigned long addr)
>  {
>  	bool readahead, vma_ra = swap_use_vma_readahead();
> @@ -387,7 +414,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  			goto put_and_return;
>  
>  		/*
> -		 * We might race against __delete_from_swap_cache(), and
> +		 * We might race against __swap_cache_del_folio(), and
>  		 * stumble across a swap_map entry whose SWAP_HAS_CACHE
>  		 * has not yet been cleared.  Or race against another
>  		 * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> @@ -405,8 +432,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
>  		goto fail_unlock;
>  
> -	/* May fail (-ENOMEM) if XArray node allocation failed. */
> -	if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
> +	if (swap_cache_add_folio(entry, new_folio, &shadow))
>  		goto fail_unlock;
>  
>  	memcg1_swapin(entry, 1);
> @@ -572,11 +598,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  		end_offset = si->max - 1;
>  
>  	blk_start_plug(&plug);
> -	for (offset = start_offset; offset <= end_offset ; offset++) {
> +	for (offset = start_offset; offset <= end_offset; offset++) {
>  		/* Ok, do the async read-ahead now */
>  		folio = __read_swap_cache_async(
> -				swp_entry(swp_type(entry), offset),
> -				gfp_mask, mpol, ilx, &page_allocated, false);
> +				swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
> +				&page_allocated, false);
>  		if (!folio)
>  			continue;
>  		if (page_allocated) {
> @@ -600,41 +626,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>  	return folio;
>  }
>  
> -int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> -{
> -	struct address_space *spaces, *space;
> -	unsigned int i, nr;
> -
> -	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> -	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
> -	if (!spaces)
> -		return -ENOMEM;
> -	for (i = 0; i < nr; i++) {
> -		space = spaces + i;
> -		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> -		atomic_set(&space->i_mmap_writable, 0);
> -		space->a_ops = &swap_aops;
> -		/* swap cache doesn't use writeback related tags */
> -		mapping_set_no_writeback_tags(space);
> -	}
> -	nr_swapper_spaces[type] = nr;
> -	swapper_spaces[type] = spaces;
> -
> -	return 0;
> -}
> -
> -void exit_swap_address_space(unsigned int type)
> -{
> -	int i;
> -	struct address_space *spaces = swapper_spaces[type];
> -
> -	for (i = 0; i < nr_swapper_spaces[type]; i++)
> -		VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
> -	kvfree(spaces);
> -	nr_swapper_spaces[type] = 0;
> -	swapper_spaces[type] = NULL;
> -}
> -
>  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
>  			   unsigned long *end)
>  {
> @@ -807,7 +798,7 @@ static const struct attribute_group swap_attr_group = {
>  	.attrs = swap_attrs,
>  };
>  
> -static int __init swap_init_sysfs(void)
> +static int __init swap_init(void)
>  {
>  	int err;
>  	struct kobject *swap_kobj;
> @@ -822,11 +813,13 @@ static int __init swap_init_sysfs(void)
>  		pr_err("failed to register swap group\n");
>  		goto delete_obj;
>  	}
> +	/* swap_space is set RO after init, so do it here before init ends. */
> +	mapping_set_no_writeback_tags(&swap_space);
>  	return 0;
>  
>  delete_obj:
>  	kobject_put(swap_kobj);
>  	return err;
>  }
> -subsys_initcall(swap_init_sysfs);
> +subsys_initcall(swap_init);
>  #endif
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> new file mode 100644
> index 000000000000..ed9676547071
> --- /dev/null
> +++ b/mm/swap_table.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _MM_SWAP_TABLE_H
> +#define _MM_SWAP_TABLE_H
> +
> +#include "swap.h"
> +
> +/*
> + * A swap table entry represents the status of a swap slot on a swap
> + * (physical or virtual) device. The swap table in each cluster is a
> + * 1:1 map of the swap slots in this cluster.
> + *
> + * Each swap table entry could be a pointer (folio), a XA_VALUE
> + * (shadow), or NULL.
> + */
> +
> +/*
> + * Helpers for casting one type of info into a swap table entry.
> + */
> +static inline unsigned long null_to_swp_tb(void)
> +{
> +	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
> +	return 0;
> +}
> +
> +static inline unsigned long folio_to_swp_tb(struct folio *folio)
> +{
> +	BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
> +	return (unsigned long)folio;
> +}
> +
> +static inline unsigned long shadow_swp_to_tb(void *shadow)
> +{
> +	BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
> +		     BITS_PER_BYTE * sizeof(unsigned long));
> +	VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
> +	return (unsigned long)shadow;
> +}
> +
> +/*
> + * Helpers for swap table entry type checking.
> + */
> +static inline bool swp_tb_is_null(unsigned long swp_tb)
> +{
> +	return !swp_tb;
> +}
> +
> +static inline bool swp_tb_is_folio(unsigned long swp_tb)
> +{
> +	return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
> +}
> +
> +static inline bool swp_tb_is_shadow(unsigned long swp_tb)
> +{
> +	return xa_is_value((void *)swp_tb);
> +}
> +
> +/*
> + * Helpers for retrieving info from swap table.
> + */
> +static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
> +{
> +	VM_WARN_ON(!swp_tb_is_folio(swp_tb));
> +	return (void *)swp_tb;
> +}
> +
> +static inline void *swp_tb_to_shadow(unsigned long swp_tb)
> +{
> +	VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
> +	return (void *)swp_tb;
> +}
> +
> +/*
> + * Helpers for accessing or modifying the swap table of a cluster,
> + * the swap cluster must be locked.
> + */
> +static inline void __swap_table_set(struct swap_cluster_info *ci,
> +				    unsigned int off, unsigned long swp_tb)
> +{
> +	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +	atomic_long_set(&ci->table[off], swp_tb);
> +}
> +
> +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> +					     unsigned int off)
> +{
> +	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +	return atomic_long_read(&ci->table[off]);
> +}
> +
> +static inline void __swap_table_set_folio(struct swap_cluster_info *ci,
> +					  unsigned int off, struct folio *folio)
> +{
> +	__swap_table_set(ci, off, folio_to_swp_tb(folio));
> +}
> +
> +static inline void __swap_table_set_shadow(struct swap_cluster_info *ci,
> +					   unsigned int off, void *shadow)
> +{
> +	__swap_table_set(ci, off, shadow_swp_to_tb(shadow));
> +}
> +
> +static inline void __swap_table_init_null(struct swap_cluster_info *ci, unsigned int off)
> +{
> +	__swap_table_set(ci, off, null_to_swp_tb());
> +}
> +#endif
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 85606fbebf0f..df68b5e242a6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -46,6 +46,7 @@
>  #include <asm/tlbflush.h>
>  #include <linux/swapops.h>
>  #include <linux/swap_cgroup.h>
> +#include "swap_table.h"
>  #include "internal.h"
>  #include "swap.h"
>  
> @@ -268,7 +269,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	if (!need_reclaim)
>  		goto out_unlock;
>  
> -	delete_from_swap_cache(folio);
> +	swap_cache_del_folio(folio);
>  	folio_set_dirty(folio);
>  	ret = nr_pages;
>  out_unlock:
> @@ -422,6 +423,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>  	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>  
> +static int swap_table_alloc_table(struct swap_cluster_info *ci)
> +{
> +	WARN_ON(ci->table);
> +	ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> +	if (!ci->table)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
> +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> +{
> +	unsigned int ci_off;
> +	unsigned long swp_tb;
> +
> +	if (!ci->table)
> +		return;
> +
> +	for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> +		swp_tb = __swap_table_get(ci, ci_off);
> +		if (!swp_tb_is_null(swp_tb))
> +			pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> +				    swp_tb);
> +	}
> +
> +	kfree(ci->table);
> +	ci->table = NULL;
> +}
> +
>  static void move_cluster(struct swap_info_struct *si,
>  			 struct swap_cluster_info *ci, struct list_head *list,
>  			 enum swap_cluster_flags new_flags)
> @@ -704,6 +733,25 @@ static bool cluster_scan_range(struct swap_info_struct *si,
>  	return true;
>  }
>  
> +/*
> + * Currently, the swap table is not used for count tracking,
> + * just do a sanity check to ensure nothing went wrong.
> + */
> +static void cluster_table_check(struct swap_cluster_info *ci,
> +				unsigned int start, unsigned int nr)
> +{
> +	unsigned int ci_off = start % SWAPFILE_CLUSTER;
> +	unsigned int ci_end = ci_off + nr;
> +	unsigned long swp_tb;
> +
> +	if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +		do {
> +			swp_tb = __swap_table_get(ci, ci_off);
> +			VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
> +		} while (++ci_off < ci_end);
> +	}
> +}
> +
>  static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
>  				unsigned int start, unsigned char usage,
>  				unsigned int order)
> @@ -723,6 +771,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  		ci->order = order;
>  
>  	memset(si->swap_map + start, usage, nr_pages);
> +	cluster_table_check(ci, start, nr_pages);
>  	swap_range_alloc(si, nr_pages);
>  	ci->count += nr_pages;
>  
> @@ -1100,8 +1149,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
>  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  			    unsigned int nr_entries)
>  {
> -	unsigned long begin = offset;
> -	unsigned long end = offset + nr_entries - 1;
> +	unsigned long start = offset, end = offset + nr_entries - 1;

And this kind of clean up or code style adjustment, adding them here will
distract people from focusing on swap table introducing.

>  	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
>  	unsigned int i;
>  
> @@ -1125,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  			swap_slot_free_notify(si->bdev, offset);
>  		offset++;
>  	}
> -	clear_shadow_from_swap_cache(si->type, begin, end);
> +	__swap_cache_clear_shadow(swp_entry(si->type, start), nr_entries);
>  
>  	/*
>  	 * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
> @@ -1282,15 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>  	if (!entry.val)
>  		return -ENOMEM;
>  
> -	/*
> -	 * XArray node allocations from PF_MEMALLOC contexts could
> -	 * completely exhaust the page allocator. __GFP_NOMEMALLOC
> -	 * stops emergency reserves from being allocated.
> -	 *
> -	 * TODO: this could cause a theoretical memory reclaim
> -	 * deadlock in the swap out path.
> -	 */
> -	if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> +	if (swap_cache_add_folio(entry, folio, NULL))
>  		goto out_free;
>  
>  	return 0;
> @@ -1557,6 +1597,7 @@ static void swap_entries_free(struct swap_info_struct *si,
>  
>  	mem_cgroup_uncharge_swap(entry, nr_pages);
>  	swap_range_free(si, offset, nr_pages);
> +	cluster_table_check(ci, offset, nr_pages);
>  
>  	if (!ci->count)
>  		free_cluster(si, ci);
> @@ -1760,7 +1801,7 @@ bool folio_free_swap(struct folio *folio)
>  	if (folio_swapped(folio))
>  		return false;
>  
> -	delete_from_swap_cache(folio);
> +	swap_cache_del_folio(folio);
>  	folio_set_dirty(folio);
>  	return true;
>  }
> @@ -2634,6 +2675,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
>  	}
>  }
>  
> +static void free_cluster_info(struct swap_cluster_info *cluster_info,
> +			      unsigned long maxpages)
> +{
> +	int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> +
> +	if (!cluster_info)
> +		return;
> +	for (i = 0; i < nr_clusters; i++)
> +		swap_cluster_free_table(&cluster_info[i]);
> +	kvfree(cluster_info);
> +}
> +
>  /*
>   * Called after swap device's reference count is dead, so
>   * neither scan nor allocation will use it.
> @@ -2768,12 +2821,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  
>  	swap_file = p->swap_file;
>  	p->swap_file = NULL;
> -	p->max = 0;
>  	swap_map = p->swap_map;
>  	p->swap_map = NULL;
>  	zeromap = p->zeromap;
>  	p->zeromap = NULL;
>  	cluster_info = p->cluster_info;
> +	free_cluster_info(cluster_info, p->max);
> +	p->max = 0;
>  	p->cluster_info = NULL;
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
> @@ -2784,10 +2838,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	p->global_cluster = NULL;
>  	vfree(swap_map);
>  	kvfree(zeromap);
> -	kvfree(cluster_info);
>  	/* Destroy swap account information */
>  	swap_cgroup_swapoff(p->type);
> -	exit_swap_address_space(p->type);
>  
>  	inode = mapping->host;
>  
> @@ -3171,8 +3223,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  	if (!cluster_info)
>  		goto err;
>  
> -	for (i = 0; i < nr_clusters; i++)
> +	for (i = 0; i < nr_clusters; i++) {
>  		spin_lock_init(&cluster_info[i].lock);
> +		if (swap_table_alloc_table(&cluster_info[i]))
> +			goto err_free;
> +	}
>  
>  	if (!(si->flags & SWP_SOLIDSTATE)) {
>  		si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> @@ -3233,9 +3288,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  	}
>  
>  	return cluster_info;
> -
>  err_free:
> -	kvfree(cluster_info);
> +	free_cluster_info(cluster_info, maxpages);
>  err:
>  	return ERR_PTR(err);
>  }
> @@ -3429,13 +3483,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  		}
>  	}
>  
> -	error = init_swap_address_space(si->type, maxpages);
> -	if (error)
> -		goto bad_swap_unlock_inode;
> -
>  	error = zswap_swapon(si->type, maxpages);
>  	if (error)
> -		goto free_swap_address_space;
> +		goto bad_swap_unlock_inode;
>  
>  	/*
>  	 * Flush any pending IO and dirty mappings before we start using this
> @@ -3470,8 +3520,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	goto out;
>  free_swap_zswap:
>  	zswap_swapoff(si->type);
> -free_swap_address_space:
> -	exit_swap_address_space(si->type);
>  bad_swap_unlock_inode:
>  	inode_unlock(inode);
>  bad_swap:
> @@ -3486,7 +3534,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  	spin_unlock(&swap_lock);
>  	vfree(swap_map);
>  	kvfree(zeromap);
> -	kvfree(cluster_info);
> +	if (cluster_info)
> +		free_cluster_info(cluster_info, maxpages);
>  	if (inced_nr_rotate_swap)
>  		atomic_dec(&nr_rotate_swap);
>  	if (swap_file)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b0afd7f41a22..1ed3cf9dac4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>  {
>  	int refcount;
>  	void *shadow = NULL;
> +	struct swap_cluster_info *ci;
>  
>  	BUG_ON(!folio_test_locked(folio));
>  	BUG_ON(mapping != folio_mapping(folio));
>  
> -	if (!folio_test_swapcache(folio))
> +	if (folio_test_swapcache(folio)) {
> +		ci = swap_cluster_lock_by_folio_irq(folio);
> +	} else {
>  		spin_lock(&mapping->host->i_lock);
> -	xa_lock_irq(&mapping->i_pages);
> +		xa_lock_irq(&mapping->i_pages);
> +	}
> +
>  	/*
>  	 * The non racy check for a busy folio.
>  	 *
> @@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>  
>  		if (reclaimed && !mapping_exiting(mapping))
>  			shadow = workingset_eviction(folio, target_memcg);
> -		__delete_from_swap_cache(folio, swap, shadow);
> +		__swap_cache_del_folio(swap, folio, shadow);
>  		memcg1_swapout(folio, swap);
> -		xa_unlock_irq(&mapping->i_pages);
> +		swap_cluster_unlock_irq(ci);
>  		put_swap_folio(folio, swap);
>  	} else {
>  		void (*free_folio)(struct folio *);
> @@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>  	return 1;
>  
>  cannot_free:
> -	xa_unlock_irq(&mapping->i_pages);
> -	if (!folio_test_swapcache(folio))
> +	if (folio_test_swapcache(folio)) {
> +		swap_cluster_unlock_irq(ci);
> +	} else {
> +		xa_unlock_irq(&mapping->i_pages);
>  		spin_unlock(&mapping->host->i_lock);
> +	}
>  	return 0;
>  }
>  
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ee443b317ac7..c869859eec77 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1166,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  
>  out:
>  	if (ret && ret != -EEXIST) {
> -		delete_from_swap_cache(folio);
> +		swap_cache_del_folio(folio);
>  		folio_unlock(folio);
>  	}
>  	folio_put(folio);
> -- 
> 2.51.0
> 


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers
  2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
@ 2025-08-30  2:31   ` Chris Li
  2025-09-02  5:53   ` Barry Song
  2025-09-02 10:20   ` David Hildenbrand
  2 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  2:31 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

No functional change patch is easier to review :-)

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, move cluster related definitions and helpers to
> mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
> helpers, so they can be used outside of swap files.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 34 ---------------
>  mm/swap.h            | 63 ++++++++++++++++++++++++++++
>  mm/swapfile.c        | 99 ++++++++++++++------------------------------
>  3 files changed, 93 insertions(+), 103 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c2da85cb7fe7..20efd9a34034 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -235,40 +235,6 @@ enum {
>  /* Special value in each swap_map continuation */
>  #define SWAP_CONT_MAX  0x7f    /* Max count */
>
> -/*
> - * We use this to track usage of a cluster. A cluster is a block of swap disk
> - * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> - * free clusters are organized into a list. We fetch an entry from the list to
> - * get a free cluster.
> - *
> - * The flags field determines if a cluster is free. This is
> - * protected by cluster lock.
> - */
> -struct swap_cluster_info {
> -       spinlock_t lock;        /*
> -                                * Protect swap_cluster_info fields
> -                                * other than list, and swap_info_struct->swap_map
> -                                * elements corresponding to the swap cluster.
> -                                */
> -       u16 count;
> -       u8 flags;
> -       u8 order;
> -       struct list_head list;
> -};
> -
> -/* All on-list cluster must have a non-zero flag. */
> -enum swap_cluster_flags {
> -       CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
> -       CLUSTER_FLAG_FREE,
> -       CLUSTER_FLAG_NONFULL,
> -       CLUSTER_FLAG_FRAG,
> -       /* Clusters with flags above are allocatable */
> -       CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
> -       CLUSTER_FLAG_FULL,
> -       CLUSTER_FLAG_DISCARD,
> -       CLUSTER_FLAG_MAX,
> -};
> -
>  /*
>   * The first page in the swap file is the swap header, which is always marked
>   * bad to prevent it from being allocated as an entry. This also prevents the
> diff --git a/mm/swap.h b/mm/swap.h
> index bb2adbfd64a9..223b40f2d37e 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -7,10 +7,73 @@ struct swap_iocb;
>
>  extern int page_cluster;
>
> +#ifdef CONFIG_THP_SWAP
> +#define SWAPFILE_CLUSTER       HPAGE_PMD_NR
> +#define swap_entry_order(order)        (order)
> +#else
> +#define SWAPFILE_CLUSTER       256
> +#define swap_entry_order(order)        0
> +#endif
> +
> +/*
> + * We use this to track usage of a cluster. A cluster is a block of swap disk
> + * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> + * free clusters are organized into a list. We fetch an entry from the list to
> + * get a free cluster.
> + *
> + * The flags field determines if a cluster is free. This is
> + * protected by cluster lock.
> + */
> +struct swap_cluster_info {
> +       spinlock_t lock;        /*
> +                                * Protect swap_cluster_info fields
> +                                * other than list, and swap_info_struct->swap_map
> +                                * elements corresponding to the swap cluster.
> +                                */
> +       u16 count;
> +       u8 flags;
> +       u8 order;
> +       struct list_head list;
> +};
> +
> +/* All on-list cluster must have a non-zero flag. */
> +enum swap_cluster_flags {
> +       CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
> +       CLUSTER_FLAG_FREE,
> +       CLUSTER_FLAG_NONFULL,
> +       CLUSTER_FLAG_FRAG,
> +       /* Clusters with flags above are allocatable */
> +       CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
> +       CLUSTER_FLAG_FULL,
> +       CLUSTER_FLAG_DISCARD,
> +       CLUSTER_FLAG_MAX,
> +};
> +
>  #ifdef CONFIG_SWAP
>  #include <linux/swapops.h> /* for swp_offset */
>  #include <linux/blk_types.h> /* for bio_end_io_t */
>
> +static inline struct swap_cluster_info *swp_offset_cluster(
> +               struct swap_info_struct *si, pgoff_t offset)
> +{
> +       return &si->cluster_info[offset / SWAPFILE_CLUSTER];
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock(
> +               struct swap_info_struct *si,
> +               unsigned long offset)
> +{
> +       struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
> +
> +       spin_lock(&ci->lock);
> +       return ci;
> +}
> +
> +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> +{
> +       spin_unlock(&ci->lock);
> +}
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 12f2580ebe8d..618cf4333a3d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -58,9 +58,6 @@ static void swap_entries_free(struct swap_info_struct *si,
>  static void swap_range_alloc(struct swap_info_struct *si,
>                              unsigned int nr_entries);
>  static bool folio_swapcache_freeable(struct folio *folio);
> -static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> -                                             unsigned long offset);
> -static inline void unlock_cluster(struct swap_cluster_info *ci);
>
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -259,9 +256,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * swap_map is HAS_CACHE only, which means the slots have no page table
>          * reference or pending writeback, and can't be allocated to others.
>          */
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         need_reclaim = swap_only_has_cache(si, offset, nr_pages);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         if (!need_reclaim)
>                 goto out_unlock;
>
> @@ -386,20 +383,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>         }
>  }
>
> -#ifdef CONFIG_THP_SWAP
> -#define SWAPFILE_CLUSTER       HPAGE_PMD_NR
> -
> -#define swap_entry_order(order)        (order)
> -#else
> -#define SWAPFILE_CLUSTER       256
> -
> -/*
> - * Define swap_entry_order() as constant to let compiler to optimize
> - * out some code if !CONFIG_THP_SWAP
> - */
> -#define swap_entry_order(order)        0
> -#endif
> -#define LATENCY_LIMIT          256
> +#define LATENCY_LIMIT 256
>
>  static inline bool cluster_is_empty(struct swap_cluster_info *info)
>  {
> @@ -426,34 +410,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
>         return ci - si->cluster_info;
>  }
>
> -static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
> -                                                         unsigned long offset)
> -{
> -       return &si->cluster_info[offset / SWAPFILE_CLUSTER];
> -}
> -
>  static inline unsigned int cluster_offset(struct swap_info_struct *si,
>                                           struct swap_cluster_info *ci)
>  {
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> -static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> -                                                    unsigned long offset)
> -{
> -       struct swap_cluster_info *ci;
> -
> -       ci = offset_to_cluster(si, offset);
> -       spin_lock(&ci->lock);
> -
> -       return ci;
> -}
> -
> -static inline void unlock_cluster(struct swap_cluster_info *ci)
> -{
> -       spin_unlock(&ci->lock);
> -}
> -
>  static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags)
> @@ -809,7 +771,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>         }
>  out:
>         relocate_cluster(si, ci);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         if (si->flags & SWP_SOLIDSTATE) {
>                 this_cpu_write(percpu_swap_cluster.offset[order], next);
>                 this_cpu_write(percpu_swap_cluster.si[order], si);
> @@ -876,7 +838,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>                 if (ci->flags == CLUSTER_FLAG_NONE)
>                         relocate_cluster(si, ci);
>
> -               unlock_cluster(ci);
> +               swap_cluster_unlock(ci);
>                 if (to_scan <= 0)
>                         break;
>         }
> @@ -915,7 +877,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                 if (offset == SWAP_ENTRY_INVALID)
>                         goto new_cluster;
>
> -               ci = lock_cluster(si, offset);
> +               ci = swap_cluster_lock(si, offset);
>                 /* Cluster could have been used by another order */
>                 if (cluster_is_usable(ci, order)) {
>                         if (cluster_is_empty(ci))
> @@ -923,7 +885,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         found = alloc_swap_scan_cluster(si, ci, offset,
>                                                         order, usage);
>                 } else {
> -                       unlock_cluster(ci);
> +                       swap_cluster_unlock(ci);
>                 }
>                 if (found)
>                         goto done;
> @@ -1204,7 +1166,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>         if (!si || !offset || !get_swap_device_info(si))
>                 return false;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         if (cluster_is_usable(ci, order)) {
>                 if (cluster_is_empty(ci))
>                         offset = cluster_offset(si, ci);
> @@ -1212,7 +1174,7 @@ static bool swap_alloc_fast(swp_entry_t *entry,
>                 if (found)
>                         *entry = swp_entry(si->type, found);
>         } else {
> -               unlock_cluster(ci);
> +               swap_cluster_unlock(ci);
>         }
>
>         put_swap_device(si);
> @@ -1480,14 +1442,14 @@ static void swap_entries_put_cache(struct swap_info_struct *si,
>         unsigned long offset = swp_offset(entry);
>         struct swap_cluster_info *ci;
>
> -       ci = lock_cluster(si, offset);
> -       if (swap_only_has_cache(si, offset, nr))
> +       ci = swap_cluster_lock(si, offset);
> +       if (swap_only_has_cache(si, offset, nr)) {
>                 swap_entries_free(si, ci, entry, nr);
> -       else {
> +       } else {
>                 for (int i = 0; i < nr; i++, entry.val++)
>                         swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
>         }
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>  }
>
>  static bool swap_entries_put_map(struct swap_info_struct *si,
> @@ -1505,7 +1467,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
>         if (count != 1 && count != SWAP_MAP_SHMEM)
>                 goto fallback;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         if (!swap_is_last_map(si, offset, nr, &has_cache)) {
>                 goto locked_fallback;
>         }
> @@ -1514,21 +1476,20 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
>         else
>                 for (i = 0; i < nr; i++)
>                         WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>
>         return has_cache;
>
>  fallback:
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>  locked_fallback:
>         for (i = 0; i < nr; i++, entry.val++) {
>                 count = swap_entry_put_locked(si, ci, entry, 1);
>                 if (count == SWAP_HAS_CACHE)
>                         has_cache = true;
>         }
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return has_cache;
> -
>  }
>
>  /*
> @@ -1578,7 +1539,7 @@ static void swap_entries_free(struct swap_info_struct *si,
>         unsigned char *map_end = map + nr_pages;
>
>         /* It should never free entries across different clusters */
> -       VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
> +       VM_BUG_ON(ci != swp_offset_cluster(si, offset + nr_pages - 1));
>         VM_BUG_ON(cluster_is_empty(ci));
>         VM_BUG_ON(ci->count < nr_pages);
>
> @@ -1653,9 +1614,9 @@ bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
>         struct swap_cluster_info *ci;
>         int count;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         count = swap_count(si->swap_map[offset]);
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return !!count;
>  }
>
> @@ -1678,7 +1639,7 @@ int swp_swapcount(swp_entry_t entry)
>
>         offset = swp_offset(entry);
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>
>         count = swap_count(si->swap_map[offset]);
>         if (!(count & COUNT_CONTINUED))
> @@ -1701,7 +1662,7 @@ int swp_swapcount(swp_entry_t entry)
>                 n *= (SWAP_CONT_MAX + 1);
>         } while (tmp_count & COUNT_CONTINUED);
>  out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return count;
>  }
>
> @@ -1716,7 +1677,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
>         int i;
>         bool ret = false;
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>         if (nr_pages == 1) {
>                 if (swap_count(map[roffset]))
>                         ret = true;
> @@ -1729,7 +1690,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
>                 }
>         }
>  unlock_out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return ret;
>  }
>
> @@ -2662,8 +2623,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
>         BUG_ON(si->flags & SWP_WRITEOK);
>
>         for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
> -               ci = lock_cluster(si, offset);
> -               unlock_cluster(ci);
> +               ci = swap_cluster_lock(si, offset);
> +               swap_cluster_unlock(ci);
>         }
>  }
>
> @@ -3579,7 +3540,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>         offset = swp_offset(entry);
>         VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
>         VM_WARN_ON(usage == 1 && nr > 1);
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>
>         err = 0;
>         for (i = 0; i < nr; i++) {
> @@ -3634,7 +3595,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>         }
>
>  unlock_out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         return err;
>  }
>
> @@ -3733,7 +3694,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>
>         offset = swp_offset(entry);
>
> -       ci = lock_cluster(si, offset);
> +       ci = swap_cluster_lock(si, offset);
>
>         count = swap_count(si->swap_map[offset]);
>
> @@ -3793,7 +3754,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
>  out_unlock_cont:
>         spin_unlock(&si->cont_lock);
>  out:
> -       unlock_cluster(ci);
> +       swap_cluster_unlock(ci);
>         put_swap_device(si);
>  outer:
>         if (page)
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-27 23:46       ` Baoquan He
@ 2025-08-30  2:38         ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  2:38 UTC (permalink / raw)
  To: Baoquan He
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Wed, Aug 27, 2025 at 4:46 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 08/27/25 at 10:44am, Chris Li wrote:
> > On Tue, Aug 26, 2025 at 8:47 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 08/23/25 at 03:20am, Kairui Song wrote:
> > > ......
> > > > diff --git a/mm/swap.h b/mm/swap.h
> > > > index 223b40f2d37e..7b3efaa51624 100644
> > > > --- a/mm/swap.h
> > > > +++ b/mm/swap.h
> > > > @@ -15,6 +15,8 @@ extern int page_cluster;
> > > >  #define swap_entry_order(order)      0
> > > >  #endif
> > > >
> > > > +extern struct swap_info_struct *swap_info[];
> > > > +
> > > >  /*
> > > >   * We use this to track usage of a cluster. A cluster is a block of swap disk
> > > >   * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
> > > > @@ -53,9 +55,28 @@ enum swap_cluster_flags {
> > > >  #include <linux/swapops.h> /* for swp_offset */
> > > >  #include <linux/blk_types.h> /* for bio_end_io_t */
> > > >
> > > > +/*
> > > > + * Callers of all swp_* helpers here must ensure the entry is valid, and
> > > > + * pin the swap device by reference or in other ways.
> > > > + */
> > > > +static inline struct swap_info_struct *swp_type_info(int type)
> > > > +{
> > > > +     struct swap_info_struct *si;
> > > > +
> > > > +     si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
> > > > +     VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> > > > +     return si;
> > > > +}
> > > > +
> > > > +static inline struct swap_info_struct *swp_info(swp_entry_t entry)
> > > > +{
> > > > +     return swp_type_info(swp_type(entry));
> > > > +}
> > >
> > > swp_type_info() is only used by swp_info() in the whole series, can we
> > > open code it in swp_info()?
> >
> > BTW, off topic here. I really don't like the "_info" suffix. Anything
> > you can put into a C struct by definition is some kind of information.
> > Same to the _struct. Anything defined by a struct is a struct. Don't
> > need to say that.
> > The "struct swap_info_struct" gets two of the unnecessary words. It
> > should be something like  "struct swap_file" or "struct swap_device".
> > Renaming it is too invasive to the code base and it will mess up the
> > git annotation history.
>
> I agree. I searched for _info_struct in the current code, only found
> swap_info_struct, ax25_info_struct, vm86plus_info_struct. The latter two
> are seen in very few LOC. Maybe we can rename it later when things are all
> done. And 'struct swap_cluster_info' too.

Agree but might impact Kairui's later part of the patch series. Let's
wait until things stabilize a bit.

Again, no functional change patches are easier to review.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-08-25  9:45     ` Kairui Song
@ 2025-08-30  2:41       ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  2:41 UTC (permalink / raw)
  To: Kairui Song
  Cc: Baolin Wang, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Baoquan He, Nhat Pham, Kemeng Shi,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Mon, Aug 25, 2025 at 2:46 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Aug 25, 2025 at 11:09 AM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> >
> > On 2025/8/23 03:20, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Shmem may replace a folio in the swap cache if the cached one doesn't
> > > fit the swapin's GFP zone. When doing so, shmem has already double
> > > checked that the swap cache folio is locked, still has the swap cache
> > > flag set, and contains the wanted swap entry. So it is impossible to
> > > fail due to an Xarray mismatch. There is even a comment for that.
> > >
> > > Delete the defensive error handling path, and add a WARN_ON instead:
> > > if that happened, something has broken the basic principle of how the
> > > swap cache works, we should catch and fix that.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >   mm/shmem.c | 28 +++-------------------------
> > >   1 file changed, 3 insertions(+), 25 deletions(-)
> > >
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index b4d39f2a1e0a..e03793cc5169 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -2158,35 +2158,13 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
> > >       /* Swap cache still stores N entries instead of a high-order entry */
> > >       xa_lock_irq(&swap_mapping->i_pages);
> > >       for (i = 0; i < nr_pages; i++) {
> > > -             void *item = xas_load(&xas);
> > > -
> > > -             if (item != old) {
> > > -                     error = -ENOENT;
> > > -                     break;
> > > -             }
> > > -
> > > -             xas_store(&xas, new);
> > > +             WARN_ON_ONCE(xas_store(&xas, new));
> > >               xas_next(&xas);
> > >       }
> > > -     if (!error) {
> > > -             mem_cgroup_replace_folio(old, new);
> > > -             shmem_update_stats(new, nr_pages);
> > > -             shmem_update_stats(old, -nr_pages);
> > > -     }
> >
> > It looks like the shmem statistics update was mistakenly deleted?
>
> Ah, you are right, I'll need to add it back. I somehow misread this as
> the error handling path. Need to add it back just drop the !error
> check.

+1, I will wait for your next version then. Otherwise the patch looks
fine to me.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
  2025-08-30  1:54   ` Baoquan He
@ 2025-08-30  3:34   ` Chris Li
  2025-08-30 16:52     ` Kairui Song
  2025-09-02  9:55   ` Barry Song
  2025-09-03 11:41   ` David Hildenbrand
  3 siblings, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-08-30  3:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Introduce basic swap table infrastructures, which are now just a
> fixed-sized flat array inside each swap cluster, with access wrappers.
>
> Each cluster contains a swap table of 512 entries. Each table entry is
> an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> a folio type (pointer), or NULL.
>
> In this first step, it only supports storing a folio or shadow, and it
> is a drop-in replacement for the current swap cache. Convert all swap
> cache users to use the new sets of APIs. Chris Li has been suggesting
> using a new infrastructure for swap cache for better performance, and
> that idea combined well with the swap table as the new backing
> structure. Now the lock contention range is reduced to 2M clusters,
> which is much smaller than the 64M address_space. And we can also drop
> the multiple address_space design.
>
> All the internal works are done with swap_cache_get_* helpers. Swap
> cache lookup is still lock-less like before, and the helper's contexts
> are same with original swap cache helpers. They still require a pin
> on the swap device to prevent the backing data from being freed.
>
> Swap cache updates are now protected by the swap cluster lock
> instead of the Xarray lock. This is mostly handled internally, but new
> __swap_cache_* helpers require the caller to lock the cluster. So, a
> few new cluster access and locking helpers are also introduced.
>
> A fully cluster-based unified swap table can be implemented on top
> of this to take care of all count tracking and synchronization work,
> with dynamic allocation. It should reduce the memory usage while
> making the performance even better.
>
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  MAINTAINERS          |   1 +
>  include/linux/swap.h |   2 -
>  mm/filemap.c         |   2 +-
>  mm/huge_memory.c     |  16 +--
>  mm/memory-failure.c  |   2 +-
>  mm/memory.c          |   2 +-
>  mm/migrate.c         |  28 ++--
>  mm/shmem.c           |  26 ++--
>  mm/swap.h            | 151 +++++++++++++++------
>  mm/swap_state.c      | 315 +++++++++++++++++++++----------------------
>  mm/swap_table.h      | 106 +++++++++++++++
>  mm/swapfile.c        | 105 +++++++++++----
>  mm/vmscan.c          |  20 ++-
>  mm/zswap.c           |   2 +-
>  14 files changed, 500 insertions(+), 278 deletions(-)
>  create mode 100644 mm/swap_table.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b6f7c6939ff8..b78adfb3c7f0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16214,6 +16214,7 @@ F:      include/linux/swapops.h
>  F:     mm/page_io.c
>  F:     mm/swap.c
>  F:     mm/swap.h
> +F:     mm/swap_table.h
>  F:     mm/swap_state.c
>  F:     mm/swapfile.c
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index cb59c13fef42..7455df9bf340 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -470,8 +470,6 @@ extern int __swap_count(swp_entry_t entry);
>  extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  struct backing_dev_info;
> -extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
> -extern void exit_swap_address_space(unsigned int type);
>  extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
>  sector_t swap_folio_sector(struct folio *folio);
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index e4a5a46db89b..1fd0565b56e4 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -4504,7 +4504,7 @@ static void filemap_cachestat(struct address_space *mapping,
>                                  * invalidation, so there might not be
>                                  * a shadow in the swapcache (yet).
>                                  */
> -                               shadow = get_shadow_from_swap_cache(swp);
> +                               shadow = swap_cache_get_shadow(swp);
>                                 if (!shadow)
>                                         goto resched;
>                         }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2a47cd3bb649..209580d395a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3721,7 +3721,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>         /* Prevent deferred_split_scan() touching ->_refcount */
>         spin_lock(&ds_queue->split_queue_lock);
>         if (folio_ref_freeze(folio, 1 + extra_pins)) {
> -               struct address_space *swap_cache = NULL;
> +               struct swap_cluster_info *swp_ci = NULL;

Not real review feedback. Just pure nitpick:
swp_ci reads strange to me. How about "cluster" or just "ci"?

>                 struct lruvec *lruvec;
>                 int expected_refs;
>
> @@ -3765,8 +3765,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>                                 goto fail;
>                         }
>
> -                       swap_cache = swap_address_space(folio->swap);
> -                       xa_lock(&swap_cache->i_pages);
> +                       swp_ci = swap_cluster_lock_by_folio(folio);
>                 }
>
>                 /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> @@ -3798,10 +3797,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>                          * Anonymous folio with swap cache.
>                          * NOTE: shmem in swap cache is not supported yet.
>                          */
> -                       if (swap_cache) {
> -                               __xa_store(&swap_cache->i_pages,
> -                                          swap_cache_index(new_folio->swap),
> -                                          new_folio, 0);
> +                       if (swp_ci) {
> +                               __swap_cache_replace_folio(swp_ci, new_folio->swap,
> +                                                          folio, new_folio);
>                                 continue;
>                         }
>
> @@ -3836,8 +3834,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>                 unlock_page_lruvec(lruvec);
>
> -               if (swap_cache)
> -                       xa_unlock(&swap_cache->i_pages);
> +               if (swp_ci)
> +                       swap_cluster_unlock(swp_ci);
>         } else {
>                 spin_unlock(&ds_queue->split_queue_lock);
>                 ret = -EAGAIN;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index c15ffee7d32b..bb92d0c72aec 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1199,7 +1199,7 @@ static int me_swapcache_clean(struct page_state *ps, struct page *p)
>         struct folio *folio = page_folio(p);
>         int ret;
>
> -       delete_from_swap_cache(folio);
> +       swap_cache_del_folio(folio);
>
>         ret = delete_from_lru_cache(folio) ? MF_FAILED : MF_RECOVERED;
>         folio_unlock(folio);
> diff --git a/mm/memory.c b/mm/memory.c
> index 9ca8e1873c6e..f81bf06e6ff5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4696,7 +4696,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>
>                                 memcg1_swapin(entry, nr_pages);
>
> -                               shadow = get_shadow_from_swap_cache(entry);
> +                               shadow = swap_cache_get_shadow(entry);
>                                 if (shadow)
>                                         workingset_refault(folio, shadow);
>
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 8e435a078fc3..74db32caba2d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -563,10 +563,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                 struct folio *newfolio, struct folio *folio, int expected_count)
>  {
>         XA_STATE(xas, &mapping->i_pages, folio_index(folio));
> +       struct swap_cluster_info *swp_ci = NULL;
>         struct zone *oldzone, *newzone;
>         int dirty;
>         long nr = folio_nr_pages(folio);
> -       long entries, i;
>
>         if (!mapping) {
>                 /* Take off deferred split queue while frozen and memcg set */
> @@ -592,9 +592,16 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         oldzone = folio_zone(folio);
>         newzone = folio_zone(newfolio);
>
> -       xas_lock_irq(&xas);
> +       if (folio_test_swapcache(folio))
> +               swp_ci = swap_cluster_lock_by_folio_irq(folio);
> +       else
> +               xas_lock_irq(&xas);
> +
>         if (!folio_ref_freeze(folio, expected_count)) {
> -               xas_unlock_irq(&xas);
> +               if (swp_ci)
> +                       swap_cluster_unlock(swp_ci);
> +               else
> +                       xas_unlock_irq(&xas);
>                 return -EAGAIN;
>         }
>
> @@ -615,9 +622,6 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>         if (folio_test_swapcache(folio)) {
>                 folio_set_swapcache(newfolio);
>                 newfolio->private = folio_get_private(folio);
> -               entries = nr;
> -       } else {
> -               entries = 1;
>         }
>
>         /* Move dirty while folio refs frozen and newfolio not yet exposed */
> @@ -627,11 +631,10 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>                 folio_set_dirty(newfolio);
>         }
>
> -       /* Swap cache still stores N entries instead of a high-order entry */
> -       for (i = 0; i < entries; i++) {
> +       if (folio_test_swapcache(folio))
> +               __swap_cache_replace_folio(swp_ci, folio->swap, folio, newfolio);
> +       else
>                 xas_store(&xas, newfolio);
> -               xas_next(&xas);
> -       }
>
>         /*
>          * Drop cache reference from old folio by unfreezing
> @@ -640,8 +643,11 @@ static int __folio_migrate_mapping(struct address_space *mapping,
>          */
>         folio_ref_unfreeze(folio, expected_count - nr);
>
> -       xas_unlock(&xas);
>         /* Leave irq disabled to prevent preemption while updating stats */
> +       if (swp_ci)
> +               swap_cluster_unlock(swp_ci);
> +       else
> +               xas_unlock(&xas);
>
>         /*
>          * If moved to a different zone then also account
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e03793cc5169..f088115cf209 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1698,13 +1698,13 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>                 }
>
>                 /*
> -                * The delete_from_swap_cache() below could be left for
> +                * The swap_cache_del_folio() below could be left for
>                  * shrink_folio_list()'s folio_free_swap() to dispose of;
>                  * but I'm a little nervous about letting this folio out of
>                  * shmem_writeout() in a hybrid half-tmpfs-half-swap state
>                  * e.g. folio_mapping(folio) might give an unexpected answer.
>                  */
> -               delete_from_swap_cache(folio);
> +               swap_cache_del_folio(folio);
>                 goto redirty;
>         }
>         if (nr_pages > 1)
> @@ -2082,7 +2082,7 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
>         new->swap = entry;
>
>         memcg1_swapin(entry, nr_pages);
> -       shadow = get_shadow_from_swap_cache(entry);
> +       shadow = swap_cache_get_shadow(entry);
>         if (shadow)
>                 workingset_refault(new, shadow);
>         folio_add_lru(new);
> @@ -2120,13 +2120,11 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>                                 struct shmem_inode_info *info, pgoff_t index,
>                                 struct vm_area_struct *vma)
>  {
> +       struct swap_cluster_info *ci;
>         struct folio *new, *old = *foliop;
>         swp_entry_t entry = old->swap;
> -       struct address_space *swap_mapping = swap_address_space(entry);
> -       pgoff_t swap_index = swap_cache_index(entry);
> -       XA_STATE(xas, &swap_mapping->i_pages, swap_index);
>         int nr_pages = folio_nr_pages(old);
> -       int error = 0, i;
> +       int error = 0;
>
>         /*
>          * We have arrived here because our zones are constrained, so don't
> @@ -2155,13 +2153,9 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>         new->swap = entry;
>         folio_set_swapcache(new);
>
> -       /* Swap cache still stores N entries instead of a high-order entry */
> -       xa_lock_irq(&swap_mapping->i_pages);
> -       for (i = 0; i < nr_pages; i++) {
> -               WARN_ON_ONCE(xas_store(&xas, new));
> -               xas_next(&xas);
> -       }
> -       xa_unlock_irq(&swap_mapping->i_pages);
> +       ci = swap_cluster_lock_by_folio_irq(old);
> +       __swap_cache_replace_folio(ci, entry, old, new);
> +       swap_cluster_unlock(ci);
>
>         folio_add_lru(new);
>         *foliop = new;
> @@ -2198,7 +2192,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
>         nr_pages = folio_nr_pages(folio);
>         folio_wait_writeback(folio);
>         if (!skip_swapcache)
> -               delete_from_swap_cache(folio);
> +               swap_cache_del_folio(folio);
>         /*
>          * Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
>          * won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
> @@ -2438,7 +2432,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>                 folio->swap.val = 0;
>                 swapcache_clear(si, swap, nr_pages);
>         } else {
> -               delete_from_swap_cache(folio);
> +               swap_cache_del_folio(folio);
>         }
>         folio_mark_dirty(folio);
>         swap_free_nr(swap, nr_pages);
> diff --git a/mm/swap.h b/mm/swap.h
> index 7b3efaa51624..4af42bc2cd72 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -2,6 +2,7 @@
>  #ifndef _MM_SWAP_H
>  #define _MM_SWAP_H
>
> +#include <linux/atomic.h> /* for atomic_long_t */
>  struct mempolicy;
>  struct swap_iocb;
>
> @@ -35,6 +36,7 @@ struct swap_cluster_info {
>         u16 count;
>         u8 flags;
>         u8 order;
> +       atomic_long_t *table;   /* Swap table entries, see mm/swap_table.h */
>         struct list_head list;
>  };
>
> @@ -80,22 +82,62 @@ static inline struct swap_cluster_info *swp_offset_cluster(
>         return &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  }
>
> -static inline struct swap_cluster_info *swap_cluster_lock(
> -               struct swap_info_struct *si,
> -               unsigned long offset)
> +static inline struct swap_cluster_info *swp_cluster(swp_entry_t entry)
> +{
> +       return swp_offset_cluster(swp_info(entry), swp_offset(entry));
> +}
> +
> +static inline unsigned int swp_cluster_offset(swp_entry_t entry)
> +{
> +       return swp_offset(entry) % SWAPFILE_CLUSTER;
> +}
> +
> +/*
> + * Lock the swap cluster of the given offset. The caller must ensure the swap
> + * offset is valid and that the following accesses won't go beyond the locked
> + * cluster. swap_cluster_lock_by_folio is preferred when possible
> + */
> +static __always_inline struct swap_cluster_info *__swap_cluster_lock(
> +               struct swap_info_struct *si, unsigned long offset, bool irq)
>  {
>         struct swap_cluster_info *ci = swp_offset_cluster(si, offset);
>
>         VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
> -       spin_lock(&ci->lock);
> +       if (irq)
> +               spin_lock_irq(&ci->lock);
> +       else
> +               spin_lock(&ci->lock);
>         return ci;
>  }
> +#define swap_cluster_lock(si, off) __swap_cluster_lock(si, off, false)
> +
> +/*
> + * Lock the swap cluster that holds a folio's swap entries. Caller needs to lock
> + * the folio and ensure it's in the swap cache, and only touch the folio's swap
> + * entries. A folio's entries are always in one cluster, and a locked folio lock
> + * ensures it won't be freed from the swap cache, hence stabilizing the device.
> + */
> +static inline struct swap_cluster_info *__swap_cluster_lock_by_folio(
> +               struct folio *folio, bool irq)
> +{
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> +       return __swap_cluster_lock(swp_info(folio->swap),
> +                                  swp_offset(folio->swap), irq);
> +}
> +#define swap_cluster_lock_by_folio(folio) __swap_cluster_lock_by_folio(folio, false)
> +#define swap_cluster_lock_by_folio_irq(folio) __swap_cluster_lock_by_folio(folio, true)
>
>  static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
>  {
>         spin_unlock(&ci->lock);
>  }
>
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +       spin_unlock_irq(&ci->lock);
> +}
> +
>  /* linux/mm/page_io.c */
>  int sio_pool_init(void);
>  struct swap_iocb;
> @@ -115,10 +157,11 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>  #define SWAP_ADDRESS_SPACE_SHIFT       14
>  #define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
>  #define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
> -extern struct address_space *swapper_spaces[];
> -#define swap_address_space(entry)                          \
> -       (&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> -               >> SWAP_ADDRESS_SPACE_SHIFT])
> +extern struct address_space swap_space __ro_after_init;
> +static inline struct address_space *swap_address_space(swp_entry_t entry)
> +{
> +       return &swap_space;
> +}
>
>  /*
>   * Return the swap device position of the swap entry.
> @@ -128,15 +171,6 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
>         return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
>  }
>
> -/*
> - * Return the swap cache index of the swap entry.
> - */
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -       BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
> -       return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> -}
> -
>  /**
>   * folio_contains_swap - Does this folio contain this swap entry?
>   * @folio: The folio.
> @@ -160,17 +194,31 @@ static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
>         return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
>  }
>
> +/*
> + * All swap cache helpers below require the caller to ensure the swap entries
> + * are valid and pin the device. This can be guaranteed by:
> + * - get_swap_device: this ensures a single entry is valid and increases the
> + *   swap device's refcount.
> + * - Locking a folio in the swap cache: this ensures the folio won't be freed
> + *   from the swap cache, stabilizes its entries, and the swap device.
> + * - Locking anything referencing the swap entry: e.g. locking the PTL that
> + *   protects swap entries in the page table, so they won't be freed.
> + */
> +extern struct folio *swap_cache_get_folio(swp_entry_t entry);
> +extern void *swap_cache_get_shadow(swp_entry_t entry);
> +extern int swap_cache_add_folio(swp_entry_t entry,
> +                               struct folio *folio, void **shadow);
> +extern void swap_cache_del_folio(struct folio *folio);
> +/* Below helpers also require the caller to lock the swap cluster. */
> +extern void __swap_cache_del_folio(swp_entry_t entry,
> +                                  struct folio *folio, void *shadow);
> +extern void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> +                                     swp_entry_t entry, struct folio *old,
> +                                     struct folio *new);
> +extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
> +
>  void show_swap_cache_info(void);
> -void *get_shadow_from_swap_cache(swp_entry_t entry);
> -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                     gfp_t gfp, void **shadowp);
> -void __delete_from_swap_cache(struct folio *folio,
> -                             swp_entry_t entry, void *shadow);
> -void delete_from_swap_cache(struct folio *folio);
> -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -                                 unsigned long end);
>  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> -struct folio *swap_cache_get_folio(swp_entry_t entry);
>  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                 struct vm_area_struct *vma, unsigned long addr,
>                 struct swap_iocb **plug);
> @@ -235,6 +283,33 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>
>  #else /* CONFIG_SWAP */
>  struct swap_iocb;
> +
> +static inline struct swap_cluster_info *swap_cluster_lock(
> +       struct swap_info_struct *si, pgoff_t offset, bool irq)
> +{
> +       return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> +               struct folio *folio)
> +{
> +       return NULL;
> +}
> +
> +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> +               struct folio *folio)
> +{
> +       return NULL;
> +}
> +
> +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> +{
> +}
> +
> +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> +{
> +}
> +
>  static inline struct swap_info_struct *swp_info(swp_entry_t entry)
>  {
>         return NULL;
> @@ -252,11 +327,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
>         return NULL;
>  }
>
> -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> -{
> -       return 0;
> -}
> -
>  static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
>  {
>         return false;
> @@ -298,28 +368,27 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>         return NULL;
>  }
>
> -static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> +static inline void *swap_cache_get_shadow(swp_entry_t end)
>  {
>         return NULL;
>  }
>
> -static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                                       gfp_t gfp_mask, void **shadowp)
> +static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
>  {
> -       return -1;
> +       return -EINVAL;
>  }
>
> -static inline void __delete_from_swap_cache(struct folio *folio,
> -                                       swp_entry_t entry, void *shadow)
> +static inline void swap_cache_del_folio(struct folio *folio)
>  {
>  }
>
> -static inline void delete_from_swap_cache(struct folio *folio)
> +static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
>  {
>  }
>
> -static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -                               unsigned long end)
> +static inline void __swap_cache_replace_folio(
> +               struct swap_cluster_info *ci, swp_entry_t entry,
> +               struct folio *old, struct folio *new)
>  {
>  }
>
> @@ -354,7 +423,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
>  static inline pgoff_t folio_index(struct folio *folio)
>  {
>         if (unlikely(folio_test_swapcache(folio)))
> -               return swap_cache_index(folio->swap);
> +               return swp_offset(folio->swap);
>         return folio->index;
>  }
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 721ff1a5e73a..c0342024b4a8 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -23,6 +23,7 @@
>  #include <linux/huge_mm.h>
>  #include <linux/shmem_fs.h>
>  #include "internal.h"
> +#include "swap_table.h"
>  #include "swap.h"
>
>  /*
> @@ -36,8 +37,11 @@ static const struct address_space_operations swap_aops = {
>  #endif
>  };
>
> -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> +/* Set swap_space is read only as swap cache is handled by swap table */
> +struct address_space swap_space __ro_after_init = {
> +       .a_ops = &swap_aops,
> +};
> +
>  static bool enable_vma_readahead __read_mostly = true;
>
>  #define SWAP_RA_ORDER_CEILING  5
> @@ -69,7 +73,7 @@ void show_swap_cache_info(void)
>         printk("Total swap = %lukB\n", K(total_swap_pages));
>  }
>
> -/*
> +/**
>   * swap_cache_get_folio - Lookup a swap entry in the swap cache.
>   *
>   * A found folio will be returned unlocked and with its refcount increased.
> @@ -79,155 +83,179 @@ void show_swap_cache_info(void)
>   */
>  struct folio *swap_cache_get_folio(swp_entry_t entry)
>  {
> -       struct folio *folio = filemap_get_folio(swap_address_space(entry),
> -                                               swap_cache_index(entry));
> -       if (!IS_ERR(folio))
> -               return folio;
> +       unsigned long swp_tb;
> +       struct folio *folio;
> +
> +       for (;;) {
> +               swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> +               if (!swp_tb_is_folio(swp_tb))
> +                       return NULL;
> +               folio = swp_tb_to_folio(swp_tb);
> +               if (folio_try_get(folio))
> +                       return folio;
> +       }
> +
>         return NULL;
>  }
>
> -void *get_shadow_from_swap_cache(swp_entry_t entry)
> +/**
> + * swap_cache_get_shadow - Lookup a shadow in the swap cache.
> + *
> + * Context: Caller must ensure @entry is valid and pin the swap device.
> + */
> +void *swap_cache_get_shadow(swp_entry_t entry)
>  {
> -       struct address_space *address_space = swap_address_space(entry);
> -       pgoff_t idx = swap_cache_index(entry);
> -       void *shadow;
> +       unsigned long swp_tb;
> +
> +       swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> +       if (swp_tb_is_shadow(swp_tb))
> +               return swp_tb_to_shadow(swp_tb);
>
> -       shadow = xa_load(&address_space->i_pages, idx);
> -       if (xa_is_value(shadow))
> -               return shadow;
>         return NULL;
>  }
>
> -/*
> - * add_to_swap_cache resembles filemap_add_folio on swapper_space,
> - * but sets SwapCache flag and 'swap' instead of mapping and index.
> +/**
> + * swap_cache_add_folio -  add a folio into the swap cache.
> + *
> + * The folio will be used for swapin or swapout of swap entries
> + * starting with @entry. May fail due to race.
> + *
> + * Context: Caller must ensure @entry is valid and pin the swap device.
>   */
> -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> -                       gfp_t gfp, void **shadowp)
> +int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **shadowp)
>  {
> -       struct address_space *address_space = swap_address_space(entry);
> -       pgoff_t idx = swap_cache_index(entry);
> -       XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> -       unsigned long i, nr = folio_nr_pages(folio);
> -       void *old;
> -
> -       xas_set_update(&xas, workingset_update_node);
> -
> -       VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> -       VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
> +       unsigned long exist;
> +       void *shadow = NULL;
> +       struct swap_cluster_info *ci;
> +       unsigned int ci_start, ci_off, ci_end;
> +       unsigned long nr_pages = folio_nr_pages(folio);
> +
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> +       ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> +       ci_start = swp_cluster_offset(entry);
> +       ci_end = ci_start + nr_pages;
> +       ci_off = ci_start;
> +       do {
> +               exist = __swap_table_get(ci, ci_off);
> +               if (unlikely(swp_tb_is_folio(exist)))
> +                       goto fail;
> +               if (swp_tb_is_shadow(exist))
> +                       shadow = swp_tb_to_shadow(exist);
> +       } while (++ci_off < ci_end);
> +
> +       ci_off = ci_start;
> +       do {
> +               __swap_table_set_folio(ci, ci_off, folio);
> +       } while (++ci_off < ci_end);
>
> -       folio_ref_add(folio, nr);
> +       folio_ref_add(folio, nr_pages);
>         folio_set_swapcache(folio);
>         folio->swap = entry;
> +       swap_cluster_unlock(ci);
>
> -       do {
> -               xas_lock_irq(&xas);
> -               xas_create_range(&xas);
> -               if (xas_error(&xas))
> -                       goto unlock;
> -               for (i = 0; i < nr; i++) {
> -                       VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
> -                       if (shadowp) {
> -                               old = xas_load(&xas);
> -                               if (xa_is_value(old))
> -                                       *shadowp = old;
> -                       }
> -                       xas_store(&xas, folio);
> -                       xas_next(&xas);
> -               }
> -               address_space->nrpages += nr;
> -               __node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
> -               __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
> -unlock:
> -               xas_unlock_irq(&xas);
> -       } while (xas_nomem(&xas, gfp));
> -
> -       if (!xas_error(&xas))
> -               return 0;
> +       node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> +       lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
>
> -       folio_clear_swapcache(folio);
> -       folio_ref_sub(folio, nr);
> -       return xas_error(&xas);
> +       if (shadowp)
> +               *shadowp = shadow;
> +       return 0;
> +fail:
> +       swap_cluster_unlock(ci);
> +       return -EEXIST;
>  }
>
>  /*
> - * This must be called only on folios that have
> - * been verified to be in the swap cache.
> + * Caller must ensure the folio is in the swap cache and locked,
> + * also lock the swap cluster.
>   */
> -void __delete_from_swap_cache(struct folio *folio,
> -                       swp_entry_t entry, void *shadow)
> +void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio,
> +                           void *shadow)
>  {
> -       struct address_space *address_space = swap_address_space(entry);
> -       int i;
> -       long nr = folio_nr_pages(folio);
> -       pgoff_t idx = swap_cache_index(entry);
> -       XA_STATE(xas, &address_space->i_pages, idx);
> -
> -       xas_set_update(&xas, workingset_update_node);
> -
> -       VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -       VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
> -       VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
> -
> -       for (i = 0; i < nr; i++) {
> -               void *entry = xas_store(&xas, shadow);
> -               VM_BUG_ON_PAGE(entry != folio, entry);
> -               xas_next(&xas);
> -       }
> +       unsigned long exist;
> +       struct swap_cluster_info *ci;
> +       unsigned int ci_start, ci_off, ci_end;
> +       unsigned long nr_pages = folio_nr_pages(folio);
> +
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> +       VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
> +
> +       ci = swp_offset_cluster(swp_info(entry), swp_offset(entry));
> +       ci_start = swp_cluster_offset(entry);
> +       ci_end = ci_start + nr_pages;
> +       ci_off = ci_start;
> +       do {
> +               exist = __swap_table_get(ci, ci_off);
> +               VM_WARN_ON_ONCE(swp_tb_to_folio(exist) != folio);
> +               /* If shadow is NULL, we sets an empty shadow */
> +               __swap_table_set_shadow(ci, ci_off, shadow);
> +       } while (++ci_off < ci_end);
> +
>         folio->swap.val = 0;
>         folio_clear_swapcache(folio);
> -       address_space->nrpages -= nr;
> -       __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
> -       __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
> +       node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
> +       lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
>  }
>
>  /*
> - * This must be called only on folios that have
> - * been verified to be in the swap cache and locked.
> - * It will never put the folio into the free list,
> - * the caller has a reference on the folio.
> + * Replace an old folio in the swap cache with a new one. The caller must
> + * hold the cluster lock and set the new folio's entry and flags.
>   */
> -void delete_from_swap_cache(struct folio *folio)
> +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> +                               struct folio *old, struct folio *new)
> +{
> +       unsigned int ci_off = swp_cluster_offset(entry);
> +       unsigned long nr_pages = folio_nr_pages(new);
> +       unsigned int ci_end = ci_off + nr_pages;
> +
> +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> +       do {
> +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> +               __swap_table_set_folio(ci, ci_off, new);

I recall in my original experiment swap cache replacement patch I used
atomic compare exchange somewhere. It has been a while. Is there a
reason to not use atomic cmpexchg() or that is in the later part of
the series?


> +       } while (++ci_off < ci_end);
> +
> +       /*
> +        * If the old folio is partially replaced (e.g., splitting a large
> +        * folio, the old folio is shrunk in place, and new split sub folios
> +        * are added to cache), ensure the new folio doesn't overlap it.
> +        */
> +       if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> +           folio_order(old) != folio_order(new)) {
> +               ci_off = swp_cluster_offset(old->swap);
> +               ci_end = ci_off + folio_nr_pages(old);
> +               while (ci_off++ < ci_end)
> +                       WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);

Will this cause the swap cache to replace less than full folio range
of the swap entry in range?
The swap cache set folio should atomically set the full range of swap
entries. If there is some one race to set some partial range. I
suspect it should fail and undo the particle set. I recall there are
some bugs on xarray accidentally fixed by one of your patches related
to that kind of atomic behavior.

I want to make sure a similar bug does not happen here.

It is worthwhile to double check if the atomic folio set behavior.

Looks good to me otherwise. Just waiting for confirmation of the swap
cache atomic set behavior.

Chris

> +       }
> +}
> +
> +void swap_cache_del_folio(struct folio *folio)
>  {
> +       struct swap_cluster_info *ci;
>         swp_entry_t entry = folio->swap;
> -       struct address_space *address_space = swap_address_space(entry);
>
> -       xa_lock_irq(&address_space->i_pages);
> -       __delete_from_swap_cache(folio, entry, NULL);
> -       xa_unlock_irq(&address_space->i_pages);
> +       ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> +       __swap_cache_del_folio(entry, folio, NULL);
> +       swap_cluster_unlock(ci);
>
>         put_swap_folio(folio, entry);
>         folio_ref_sub(folio, folio_nr_pages(folio));
>  }
>
> -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> -                               unsigned long end)
> +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
>  {
> -       unsigned long curr = begin;
> -       void *old;
> -
> -       for (;;) {
> -               swp_entry_t entry = swp_entry(type, curr);
> -               unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
> -               struct address_space *address_space = swap_address_space(entry);
> -               XA_STATE(xas, &address_space->i_pages, index);
> -
> -               xas_set_update(&xas, workingset_update_node);
> -
> -               xa_lock_irq(&address_space->i_pages);
> -               xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
> -                       if (!xa_is_value(old))
> -                               continue;
> -                       xas_store(&xas, NULL);
> -               }
> -               xa_unlock_irq(&address_space->i_pages);
> +       struct swap_cluster_info *ci = swp_cluster(entry);
> +       unsigned int ci_off = swp_cluster_offset(entry), ci_end;
>
> -               /* search the next swapcache until we meet end */
> -               curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
> -               if (curr > end)
> -                       break;
> -       }
> +       ci_end = ci_off + nr_ents;
> +       do {
> +               WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> +               __swap_table_init_null(ci, ci_off);
> +       } while (++ci_off < ci_end);
>  }
>
>  /*
> @@ -292,8 +320,7 @@ static inline bool swap_use_vma_readahead(void)
>  /*
>   * Update the readahead statistics of a vma or globally.
>   */
> -void swap_update_readahead(struct folio *folio,
> -                          struct vm_area_struct *vma,
> +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
>                            unsigned long addr)
>  {
>         bool readahead, vma_ra = swap_use_vma_readahead();
> @@ -387,7 +414,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>                         goto put_and_return;
>
>                 /*
> -                * We might race against __delete_from_swap_cache(), and
> +                * We might race against __swap_cache_del_folio(), and
>                  * stumble across a swap_map entry whose SWAP_HAS_CACHE
>                  * has not yet been cleared.  Or race against another
>                  * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> @@ -405,8 +432,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>         if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
>                 goto fail_unlock;
>
> -       /* May fail (-ENOMEM) if XArray node allocation failed. */
> -       if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
> +       if (swap_cache_add_folio(entry, new_folio, &shadow))

It feels so good we will not get ENOMEM here. The swap table page is
already allocated when allocating the entry from swap allocator.

>                 goto fail_unlock;
>
>         memcg1_swapin(entry, 1);
> @@ -572,11 +598,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>                 end_offset = si->max - 1;
>
>         blk_start_plug(&plug);
> -       for (offset = start_offset; offset <= end_offset ; offset++) {
> +       for (offset = start_offset; offset <= end_offset; offset++) {
>                 /* Ok, do the async read-ahead now */
>                 folio = __read_swap_cache_async(
> -                               swp_entry(swp_type(entry), offset),
> -                               gfp_mask, mpol, ilx, &page_allocated, false);
> +                               swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
> +                               &page_allocated, false);
>                 if (!folio)
>                         continue;
>                 if (page_allocated) {
> @@ -600,41 +626,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
>         return folio;
>  }
>
> -int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> -{
> -       struct address_space *spaces, *space;
> -       unsigned int i, nr;
> -
> -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> -       spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
> -       if (!spaces)
> -               return -ENOMEM;
> -       for (i = 0; i < nr; i++) {
> -               space = spaces + i;
> -               xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> -               atomic_set(&space->i_mmap_writable, 0);
> -               space->a_ops = &swap_aops;
> -               /* swap cache doesn't use writeback related tags */
> -               mapping_set_no_writeback_tags(space);
> -       }
> -       nr_swapper_spaces[type] = nr;
> -       swapper_spaces[type] = spaces;
> -
> -       return 0;
> -}
> -
> -void exit_swap_address_space(unsigned int type)
> -{
> -       int i;
> -       struct address_space *spaces = swapper_spaces[type];
> -
> -       for (i = 0; i < nr_swapper_spaces[type]; i++)
> -               VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
> -       kvfree(spaces);
> -       nr_swapper_spaces[type] = 0;
> -       swapper_spaces[type] = NULL;
> -}
> -
>  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
>                            unsigned long *end)
>  {
> @@ -807,7 +798,7 @@ static const struct attribute_group swap_attr_group = {
>         .attrs = swap_attrs,
>  };
>
> -static int __init swap_init_sysfs(void)
> +static int __init swap_init(void)
>  {
>         int err;
>         struct kobject *swap_kobj;
> @@ -822,11 +813,13 @@ static int __init swap_init_sysfs(void)
>                 pr_err("failed to register swap group\n");
>                 goto delete_obj;
>         }
> +       /* swap_space is set RO after init, so do it here before init ends. */
> +       mapping_set_no_writeback_tags(&swap_space);
>         return 0;
>
>  delete_obj:
>         kobject_put(swap_kobj);
>         return err;
>  }
> -subsys_initcall(swap_init_sysfs);
> +subsys_initcall(swap_init);
>  #endif
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> new file mode 100644
> index 000000000000..ed9676547071
> --- /dev/null
> +++ b/mm/swap_table.h
> @@ -0,0 +1,106 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _MM_SWAP_TABLE_H
> +#define _MM_SWAP_TABLE_H
> +
> +#include "swap.h"
> +
> +/*
> + * A swap table entry represents the status of a swap slot on a swap
> + * (physical or virtual) device. The swap table in each cluster is a
> + * 1:1 map of the swap slots in this cluster.
> + *
> + * Each swap table entry could be a pointer (folio), a XA_VALUE
> + * (shadow), or NULL.
> + */
> +
> +/*
> + * Helpers for casting one type of info into a swap table entry.
> + */
> +static inline unsigned long null_to_swp_tb(void)
> +{
> +       BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
> +       return 0;
> +}
> +
> +static inline unsigned long folio_to_swp_tb(struct folio *folio)
> +{
> +       BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
> +       return (unsigned long)folio;
> +}
> +
> +static inline unsigned long shadow_swp_to_tb(void *shadow)
> +{
> +       BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
> +                    BITS_PER_BYTE * sizeof(unsigned long));
> +       VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
> +       return (unsigned long)shadow;
> +}
> +
> +/*
> + * Helpers for swap table entry type checking.
> + */
> +static inline bool swp_tb_is_null(unsigned long swp_tb)
> +{
> +       return !swp_tb;
> +}
> +
> +static inline bool swp_tb_is_folio(unsigned long swp_tb)
> +{
> +       return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
> +}
> +
> +static inline bool swp_tb_is_shadow(unsigned long swp_tb)
> +{
> +       return xa_is_value((void *)swp_tb);
> +}
> +
> +/*
> + * Helpers for retrieving info from swap table.
> + */
> +static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
> +{
> +       VM_WARN_ON(!swp_tb_is_folio(swp_tb));
> +       return (void *)swp_tb;
> +}
> +
> +static inline void *swp_tb_to_shadow(unsigned long swp_tb)
> +{
> +       VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
> +       return (void *)swp_tb;
> +}
> +
> +/*
> + * Helpers for accessing or modifying the swap table of a cluster,
> + * the swap cluster must be locked.
> + */
> +static inline void __swap_table_set(struct swap_cluster_info *ci,
> +                                   unsigned int off, unsigned long swp_tb)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       atomic_long_set(&ci->table[off], swp_tb);
> +}
> +
> +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> +                                            unsigned int off)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       return atomic_long_read(&ci->table[off]);
> +}
> +
> +static inline void __swap_table_set_folio(struct swap_cluster_info *ci,
> +                                         unsigned int off, struct folio *folio)
> +{
> +       __swap_table_set(ci, off, folio_to_swp_tb(folio));
> +}
> +
> +static inline void __swap_table_set_shadow(struct swap_cluster_info *ci,
> +                                          unsigned int off, void *shadow)
> +{
> +       __swap_table_set(ci, off, shadow_swp_to_tb(shadow));
> +}
> +
> +static inline void __swap_table_init_null(struct swap_cluster_info *ci, unsigned int off)
> +{
> +       __swap_table_set(ci, off, null_to_swp_tb());
> +}
> +#endif
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 85606fbebf0f..df68b5e242a6 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -46,6 +46,7 @@
>  #include <asm/tlbflush.h>
>  #include <linux/swapops.h>
>  #include <linux/swap_cgroup.h>
> +#include "swap_table.h"
>  #include "internal.h"
>  #include "swap.h"
>
> @@ -268,7 +269,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>         if (!need_reclaim)
>                 goto out_unlock;
>
> -       delete_from_swap_cache(folio);
> +       swap_cache_del_folio(folio);
>         folio_set_dirty(folio);
>         ret = nr_pages;
>  out_unlock:
> @@ -422,6 +423,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> +static int swap_table_alloc_table(struct swap_cluster_info *ci)
> +{
> +       WARN_ON(ci->table);
> +       ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> +       if (!ci->table)
> +               return -ENOMEM;
> +       return 0;
> +}
> +
> +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> +{
> +       unsigned int ci_off;
> +       unsigned long swp_tb;
> +
> +       if (!ci->table)
> +               return;
> +
> +       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> +               swp_tb = __swap_table_get(ci, ci_off);
> +               if (!swp_tb_is_null(swp_tb))
> +                       pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> +                                   swp_tb);
> +       }
> +
> +       kfree(ci->table);
> +       ci->table = NULL;
> +}
> +
>  static void move_cluster(struct swap_info_struct *si,
>                          struct swap_cluster_info *ci, struct list_head *list,
>                          enum swap_cluster_flags new_flags)
> @@ -704,6 +733,25 @@ static bool cluster_scan_range(struct swap_info_struct *si,
>         return true;
>  }
>
> +/*
> + * Currently, the swap table is not used for count tracking,
> + * just do a sanity check to ensure nothing went wrong.
> + */
> +static void cluster_table_check(struct swap_cluster_info *ci,
> +                               unsigned int start, unsigned int nr)
> +{
> +       unsigned int ci_off = start % SWAPFILE_CLUSTER;
> +       unsigned int ci_end = ci_off + nr;
> +       unsigned long swp_tb;
> +
> +       if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> +               do {
> +                       swp_tb = __swap_table_get(ci, ci_off);
> +                       VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
> +               } while (++ci_off < ci_end);
> +       }
> +}
> +
>  static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
>                                 unsigned int start, unsigned char usage,
>                                 unsigned int order)
> @@ -723,6 +771,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>                 ci->order = order;
>
>         memset(si->swap_map + start, usage, nr_pages);
> +       cluster_table_check(ci, start, nr_pages);
>         swap_range_alloc(si, nr_pages);
>         ci->count += nr_pages;
>
> @@ -1100,8 +1149,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
>  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>                             unsigned int nr_entries)
>  {
> -       unsigned long begin = offset;
> -       unsigned long end = offset + nr_entries - 1;
> +       unsigned long start = offset, end = offset + nr_entries - 1;
>         void (*swap_slot_free_notify)(struct block_device *, unsigned long);
>         unsigned int i;
>
> @@ -1125,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>                         swap_slot_free_notify(si->bdev, offset);
>                 offset++;
>         }
> -       clear_shadow_from_swap_cache(si->type, begin, end);
> +       __swap_cache_clear_shadow(swp_entry(si->type, start), nr_entries);
>
>         /*
>          * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
> @@ -1282,15 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
>         if (!entry.val)
>                 return -ENOMEM;
>
> -       /*
> -        * XArray node allocations from PF_MEMALLOC contexts could
> -        * completely exhaust the page allocator. __GFP_NOMEMALLOC
> -        * stops emergency reserves from being allocated.
> -        *
> -        * TODO: this could cause a theoretical memory reclaim
> -        * deadlock in the swap out path.
> -        */
> -       if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> +       if (swap_cache_add_folio(entry, folio, NULL))
>                 goto out_free;
>
>         return 0;
> @@ -1557,6 +1597,7 @@ static void swap_entries_free(struct swap_info_struct *si,
>
>         mem_cgroup_uncharge_swap(entry, nr_pages);
>         swap_range_free(si, offset, nr_pages);
> +       cluster_table_check(ci, offset, nr_pages);
>
>         if (!ci->count)
>                 free_cluster(si, ci);
> @@ -1760,7 +1801,7 @@ bool folio_free_swap(struct folio *folio)
>         if (folio_swapped(folio))
>                 return false;
>
> -       delete_from_swap_cache(folio);
> +       swap_cache_del_folio(folio);
>         folio_set_dirty(folio);
>         return true;
>  }
> @@ -2634,6 +2675,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
>         }
>  }
>
> +static void free_cluster_info(struct swap_cluster_info *cluster_info,
> +                             unsigned long maxpages)
> +{
> +       int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> +
> +       if (!cluster_info)
> +               return;
> +       for (i = 0; i < nr_clusters; i++)
> +               swap_cluster_free_table(&cluster_info[i]);
> +       kvfree(cluster_info);
> +}
> +
>  /*
>   * Called after swap device's reference count is dead, so
>   * neither scan nor allocation will use it.
> @@ -2768,12 +2821,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>
>         swap_file = p->swap_file;
>         p->swap_file = NULL;
> -       p->max = 0;
>         swap_map = p->swap_map;
>         p->swap_map = NULL;
>         zeromap = p->zeromap;
>         p->zeromap = NULL;
>         cluster_info = p->cluster_info;
> +       free_cluster_info(cluster_info, p->max);
> +       p->max = 0;
>         p->cluster_info = NULL;
>         spin_unlock(&p->lock);
>         spin_unlock(&swap_lock);
> @@ -2784,10 +2838,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->global_cluster = NULL;
>         vfree(swap_map);
>         kvfree(zeromap);
> -       kvfree(cluster_info);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
> -       exit_swap_address_space(p->type);
>
>         inode = mapping->host;
>
> @@ -3171,8 +3223,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         if (!cluster_info)
>                 goto err;
>
> -       for (i = 0; i < nr_clusters; i++)
> +       for (i = 0; i < nr_clusters; i++) {
>                 spin_lock_init(&cluster_info[i].lock);
> +               if (swap_table_alloc_table(&cluster_info[i]))
> +                       goto err_free;
> +       }
>
>         if (!(si->flags & SWP_SOLIDSTATE)) {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> @@ -3233,9 +3288,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         }
>
>         return cluster_info;
> -
>  err_free:
> -       kvfree(cluster_info);
> +       free_cluster_info(cluster_info, maxpages);
>  err:
>         return ERR_PTR(err);
>  }
> @@ -3429,13 +3483,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>                 }
>         }
>
> -       error = init_swap_address_space(si->type, maxpages);
> -       if (error)
> -               goto bad_swap_unlock_inode;
> -
>         error = zswap_swapon(si->type, maxpages);
>         if (error)
> -               goto free_swap_address_space;
> +               goto bad_swap_unlock_inode;
>
>         /*
>          * Flush any pending IO and dirty mappings before we start using this
> @@ -3470,8 +3520,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         goto out;
>  free_swap_zswap:
>         zswap_swapoff(si->type);
> -free_swap_address_space:
> -       exit_swap_address_space(si->type);
>  bad_swap_unlock_inode:
>         inode_unlock(inode);
>  bad_swap:
> @@ -3486,7 +3534,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>         spin_unlock(&swap_lock);
>         vfree(swap_map);
>         kvfree(zeromap);
> -       kvfree(cluster_info);
> +       if (cluster_info)
> +               free_cluster_info(cluster_info, maxpages);
>         if (inced_nr_rotate_swap)
>                 atomic_dec(&nr_rotate_swap);
>         if (swap_file)
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index b0afd7f41a22..1ed3cf9dac4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>  {
>         int refcount;
>         void *shadow = NULL;
> +       struct swap_cluster_info *ci;
>
>         BUG_ON(!folio_test_locked(folio));
>         BUG_ON(mapping != folio_mapping(folio));
>
> -       if (!folio_test_swapcache(folio))
> +       if (folio_test_swapcache(folio)) {
> +               ci = swap_cluster_lock_by_folio_irq(folio);

One line does not require "{"

> +       } else {
>                 spin_lock(&mapping->host->i_lock);
> -       xa_lock_irq(&mapping->i_pages);
> +               xa_lock_irq(&mapping->i_pages);
> +       }
> +
>         /*
>          * The non racy check for a busy folio.
>          *
> @@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>
>                 if (reclaimed && !mapping_exiting(mapping))
>                         shadow = workingset_eviction(folio, target_memcg);
> -               __delete_from_swap_cache(folio, swap, shadow);
> +               __swap_cache_del_folio(swap, folio, shadow);
>                 memcg1_swapout(folio, swap);
> -               xa_unlock_irq(&mapping->i_pages);
> +               swap_cluster_unlock_irq(ci);
>                 put_swap_folio(folio, swap);
>         } else {
>                 void (*free_folio)(struct folio *);
> @@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
>         return 1;
>
>  cannot_free:
> -       xa_unlock_irq(&mapping->i_pages);
> -       if (!folio_test_swapcache(folio))
> +       if (folio_test_swapcache(folio)) {
> +               swap_cluster_unlock_irq(ci);
> +       } else {
> +               xa_unlock_irq(&mapping->i_pages);
>                 spin_unlock(&mapping->host->i_lock);
> +       }
>         return 0;
>  }
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index ee443b317ac7..c869859eec77 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1166,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>
>  out:
>         if (ret && ret != -EEXIST) {
> -               delete_from_swap_cache(folio);
> +               swap_cache_del_folio(folio);
>                 folio_unlock(folio);
>         }
>         folio_put(folio);
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-30  1:54   ` Baoquan He
@ 2025-08-30  3:40     ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  3:40 UTC (permalink / raw)
  To: Baoquan He
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Fri, Aug 29, 2025 at 6:55 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 08/23/25 at 03:20am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> ......snip...
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 7b3efaa51624..4af42bc2cd72 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> ......snip...
> > +/*
> > + * All swap cache helpers below require the caller to ensure the swap entries
> > + * are valid and pin the device. This can be guaranteed by:
> > + * - get_swap_device: this ensures a single entry is valid and increases the
> > + *   swap device's refcount.
> > + * - Locking a folio in the swap cache: this ensures the folio won't be freed
> > + *   from the swap cache, stabilizes its entries, and the swap device.
> > + * - Locking anything referencing the swap entry: e.g. locking the PTL that
> > + *   protects swap entries in the page table, so they won't be freed.
> > + */
> > +extern struct folio *swap_cache_get_folio(swp_entry_t entry);
> > +extern void *swap_cache_get_shadow(swp_entry_t entry);
> > +extern int swap_cache_add_folio(swp_entry_t entry,
> > +                             struct folio *folio, void **shadow);
> > +extern void swap_cache_del_folio(struct folio *folio);
> > +/* Below helpers also require the caller to lock the swap cluster. */
> > +extern void __swap_cache_del_folio(swp_entry_t entry,
> > +                                struct folio *folio, void *shadow);
> > +extern void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> > +                                   swp_entry_t entry, struct folio *old,
> > +                                   struct folio *new);
> > +extern void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents);
> > +
> >  void show_swap_cache_info(void);
> > -void *get_shadow_from_swap_cache(swp_entry_t entry);
> > -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > -                   gfp_t gfp, void **shadowp);
> > -void __delete_from_swap_cache(struct folio *folio,
> > -                           swp_entry_t entry, void *shadow);
> > -void delete_from_swap_cache(struct folio *folio);
> > -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > -                               unsigned long end);
> >  void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > -struct folio *swap_cache_get_folio(swp_entry_t entry);
> >  struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >               struct vm_area_struct *vma, unsigned long addr,
> >               struct swap_iocb **plug);
>
> I would put this function renaming change to another standalone patch,
> then let this key patch focus on swap table introducing.
>
> > @@ -235,6 +283,33 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >
> >  #else /* CONFIG_SWAP */
> >  struct swap_iocb;
> > +
> > +static inline struct swap_cluster_info *swap_cluster_lock(
> > +     struct swap_info_struct *si, pgoff_t offset, bool irq)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct swap_cluster_info *swap_cluster_lock_by_folio(
> > +             struct folio *folio)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline struct swap_cluster_info *swap_cluster_lock_by_folio_irq(
> > +             struct folio *folio)
> > +{
> > +     return NULL;
> > +}
> > +
> > +static inline void swap_cluster_unlock(struct swap_cluster_info *ci)
> > +{
> > +}
> > +
> > +static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
> > +{
> > +}
> > +
> >  static inline struct swap_info_struct *swp_info(swp_entry_t entry)
> >  {
> >       return NULL;
> > @@ -252,11 +327,6 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
> >       return NULL;
> >  }
> >
> > -static inline pgoff_t swap_cache_index(swp_entry_t entry)
> > -{
> > -     return 0;
> > -}
> > -
> >  static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> >  {
> >       return false;
> > @@ -298,28 +368,27 @@ static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
> >       return NULL;
> >  }
> >
> > -static inline void *get_shadow_from_swap_cache(swp_entry_t entry)
> > +static inline void *swap_cache_get_shadow(swp_entry_t end)
> >  {
> >       return NULL;
> >  }
> >
> > -static inline int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > -                                     gfp_t gfp_mask, void **shadowp)
> > +static inline int swap_cache_add_folio(swp_entry_t end, struct folio *folio, void **shadow)
> >  {
> > -     return -1;
> > +     return -EINVAL;
> >  }
> >
> > -static inline void __delete_from_swap_cache(struct folio *folio,
> > -                                     swp_entry_t entry, void *shadow)
> > +static inline void swap_cache_del_folio(struct folio *folio)
> >  {
> >  }
> >
> > -static inline void delete_from_swap_cache(struct folio *folio)
> > +static inline void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio, void *shadow)
> >  {
> >  }
> >
> > -static inline void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > -                             unsigned long end)
> > +static inline void __swap_cache_replace_folio(
> > +             struct swap_cluster_info *ci, swp_entry_t entry,
> > +             struct folio *old, struct folio *new)
> >  {
> >  }
> >
> > @@ -354,7 +423,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> >  static inline pgoff_t folio_index(struct folio *folio)
> >  {
> >       if (unlikely(folio_test_swapcache(folio)))
> > -             return swap_cache_index(folio->swap);
> > +             return swp_offset(folio->swap);
> >       return folio->index;
> >  }
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 721ff1a5e73a..c0342024b4a8 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -23,6 +23,7 @@
> >  #include <linux/huge_mm.h>
> >  #include <linux/shmem_fs.h>
> >  #include "internal.h"
> > +#include "swap_table.h"
> >  #include "swap.h"
> >
> >  /*
> > @@ -36,8 +37,11 @@ static const struct address_space_operations swap_aops = {
> >  #endif
> >  };
> >
> > -struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
> > -static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
> > +/* Set swap_space is read only as swap cache is handled by swap table */
> > +struct address_space swap_space __ro_after_init = {
> > +     .a_ops = &swap_aops,
> > +};
> > +
> >  static bool enable_vma_readahead __read_mostly = true;
> >
> >  #define SWAP_RA_ORDER_CEILING        5
> > @@ -69,7 +73,7 @@ void show_swap_cache_info(void)
> >       printk("Total swap = %lukB\n", K(total_swap_pages));
> >  }
> >
> > -/*
> > +/**
> >   * swap_cache_get_folio - Lookup a swap entry in the swap cache.
> >   *
> >   * A found folio will be returned unlocked and with its refcount increased.
> > @@ -79,155 +83,179 @@ void show_swap_cache_info(void)
> >   */
> >  struct folio *swap_cache_get_folio(swp_entry_t entry)
> >  {
> > -     struct folio *folio = filemap_get_folio(swap_address_space(entry),
> > -                                             swap_cache_index(entry));
> > -     if (!IS_ERR(folio))
> > -             return folio;
> > +     unsigned long swp_tb;
> > +     struct folio *folio;
> > +
> > +     for (;;) {
> > +             swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> > +             if (!swp_tb_is_folio(swp_tb))
> > +                     return NULL;
> > +             folio = swp_tb_to_folio(swp_tb);
> > +             if (folio_try_get(folio))
> > +                     return folio;
> > +     }
> > +
> >       return NULL;
> >  }
> >
> > -void *get_shadow_from_swap_cache(swp_entry_t entry)
> > +/**
> > + * swap_cache_get_shadow - Lookup a shadow in the swap cache.
> > + *
> > + * Context: Caller must ensure @entry is valid and pin the swap device.
> > + */
> > +void *swap_cache_get_shadow(swp_entry_t entry)
> >  {
> > -     struct address_space *address_space = swap_address_space(entry);
> > -     pgoff_t idx = swap_cache_index(entry);
> > -     void *shadow;
> > +     unsigned long swp_tb;
> > +
> > +     swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> > +     if (swp_tb_is_shadow(swp_tb))
> > +             return swp_tb_to_shadow(swp_tb);
> >
> > -     shadow = xa_load(&address_space->i_pages, idx);
> > -     if (xa_is_value(shadow))
> > -             return shadow;
> >       return NULL;
> >  }
> >
> > -/*
> > - * add_to_swap_cache resembles filemap_add_folio on swapper_space,
> > - * but sets SwapCache flag and 'swap' instead of mapping and index.
> > +/**
> > + * swap_cache_add_folio -  add a folio into the swap cache.
> > + *
> > + * The folio will be used for swapin or swapout of swap entries
> > + * starting with @entry. May fail due to race.
> > + *
> > + * Context: Caller must ensure @entry is valid and pin the swap device.
> >   */
> > -int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> > -                     gfp_t gfp, void **shadowp)
> > +int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **shadowp)
> >  {
> > -     struct address_space *address_space = swap_address_space(entry);
> > -     pgoff_t idx = swap_cache_index(entry);
> > -     XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
> > -     unsigned long i, nr = folio_nr_pages(folio);
> > -     void *old;
> > -
> > -     xas_set_update(&xas, workingset_update_node);
> > -
> > -     VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> > -     VM_BUG_ON_FOLIO(folio_test_swapcache(folio), folio);
> > -     VM_BUG_ON_FOLIO(!folio_test_swapbacked(folio), folio);
> > +     unsigned long exist;
> > +     void *shadow = NULL;
> > +     struct swap_cluster_info *ci;
> > +     unsigned int ci_start, ci_off, ci_end;
> > +     unsigned long nr_pages = folio_nr_pages(folio);
> > +
> > +     VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> > +     VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> > +     VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> > +
> > +     ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> > +     ci_start = swp_cluster_offset(entry);
> > +     ci_end = ci_start + nr_pages;
> > +     ci_off = ci_start;
> > +     do {
> > +             exist = __swap_table_get(ci, ci_off);
> > +             if (unlikely(swp_tb_is_folio(exist)))
> > +                     goto fail;
> > +             if (swp_tb_is_shadow(exist))
> > +                     shadow = swp_tb_to_shadow(exist);
> > +     } while (++ci_off < ci_end);
> > +
> > +     ci_off = ci_start;
> > +     do {
> > +             __swap_table_set_folio(ci, ci_off, folio);
> > +     } while (++ci_off < ci_end);
> >
> > -     folio_ref_add(folio, nr);
> > +     folio_ref_add(folio, nr_pages);
> >       folio_set_swapcache(folio);
> >       folio->swap = entry;
> > +     swap_cluster_unlock(ci);
> >
> > -     do {
> > -             xas_lock_irq(&xas);
> > -             xas_create_range(&xas);
> > -             if (xas_error(&xas))
> > -                     goto unlock;
> > -             for (i = 0; i < nr; i++) {
> > -                     VM_BUG_ON_FOLIO(xas.xa_index != idx + i, folio);
> > -                     if (shadowp) {
> > -                             old = xas_load(&xas);
> > -                             if (xa_is_value(old))
> > -                                     *shadowp = old;
> > -                     }
> > -                     xas_store(&xas, folio);
> > -                     xas_next(&xas);
> > -             }
> > -             address_space->nrpages += nr;
> > -             __node_stat_mod_folio(folio, NR_FILE_PAGES, nr);
> > -             __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr);
> > -unlock:
> > -             xas_unlock_irq(&xas);
> > -     } while (xas_nomem(&xas, gfp));
> > -
> > -     if (!xas_error(&xas))
> > -             return 0;
> > +     node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> > +     lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> >
> > -     folio_clear_swapcache(folio);
> > -     folio_ref_sub(folio, nr);
> > -     return xas_error(&xas);
> > +     if (shadowp)
> > +             *shadowp = shadow;
> > +     return 0;
> > +fail:
> > +     swap_cluster_unlock(ci);
> > +     return -EEXIST;
> >  }
> >
> >  /*
> > - * This must be called only on folios that have
> > - * been verified to be in the swap cache.
> > + * Caller must ensure the folio is in the swap cache and locked,
> > + * also lock the swap cluster.
> >   */
> > -void __delete_from_swap_cache(struct folio *folio,
> > -                     swp_entry_t entry, void *shadow)
> > +void __swap_cache_del_folio(swp_entry_t entry, struct folio *folio,
> > +                         void *shadow)
> >  {
> > -     struct address_space *address_space = swap_address_space(entry);
> > -     int i;
> > -     long nr = folio_nr_pages(folio);
> > -     pgoff_t idx = swap_cache_index(entry);
> > -     XA_STATE(xas, &address_space->i_pages, idx);
> > -
> > -     xas_set_update(&xas, workingset_update_node);
> > -
> > -     VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> > -     VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
> > -     VM_BUG_ON_FOLIO(folio_test_writeback(folio), folio);
> > -
> > -     for (i = 0; i < nr; i++) {
> > -             void *entry = xas_store(&xas, shadow);
> > -             VM_BUG_ON_PAGE(entry != folio, entry);
> > -             xas_next(&xas);
> > -     }
> > +     unsigned long exist;
> > +     struct swap_cluster_info *ci;
> > +     unsigned int ci_start, ci_off, ci_end;
> > +     unsigned long nr_pages = folio_nr_pages(folio);
> > +
> > +     VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> > +     VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
> > +     VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
> > +
> > +     ci = swp_offset_cluster(swp_info(entry), swp_offset(entry));
> > +     ci_start = swp_cluster_offset(entry);
> > +     ci_end = ci_start + nr_pages;
> > +     ci_off = ci_start;
> > +     do {
> > +             exist = __swap_table_get(ci, ci_off);
> > +             VM_WARN_ON_ONCE(swp_tb_to_folio(exist) != folio);
> > +             /* If shadow is NULL, we sets an empty shadow */
> > +             __swap_table_set_shadow(ci, ci_off, shadow);
> > +     } while (++ci_off < ci_end);
> > +
> >       folio->swap.val = 0;
> >       folio_clear_swapcache(folio);
> > -     address_space->nrpages -= nr;
> > -     __node_stat_mod_folio(folio, NR_FILE_PAGES, -nr);
> > -     __lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
> > +     node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
> > +     lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
> >  }
> >
> >  /*
> > - * This must be called only on folios that have
> > - * been verified to be in the swap cache and locked.
> > - * It will never put the folio into the free list,
> > - * the caller has a reference on the folio.
> > + * Replace an old folio in the swap cache with a new one. The caller must
> > + * hold the cluster lock and set the new folio's entry and flags.
> >   */
> > -void delete_from_swap_cache(struct folio *folio)
> > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > +                             struct folio *old, struct folio *new)
> > +{
> > +     unsigned int ci_off = swp_cluster_offset(entry);
> > +     unsigned long nr_pages = folio_nr_pages(new);
> > +     unsigned int ci_end = ci_off + nr_pages;
> > +
> > +     VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > +     VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > +     VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > +     do {
> > +             WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > +             __swap_table_set_folio(ci, ci_off, new);
> > +     } while (++ci_off < ci_end);
> > +
> > +     /*
> > +      * If the old folio is partially replaced (e.g., splitting a large
> > +      * folio, the old folio is shrunk in place, and new split sub folios
> > +      * are added to cache), ensure the new folio doesn't overlap it.
> > +      */
> > +     if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> > +         folio_order(old) != folio_order(new)) {
> > +             ci_off = swp_cluster_offset(old->swap);
> > +             ci_end = ci_off + folio_nr_pages(old);
> > +             while (ci_off++ < ci_end)
> > +                     WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > +     }
> > +}
> > +
> > +void swap_cache_del_folio(struct folio *folio)
> >  {
> > +     struct swap_cluster_info *ci;
> >       swp_entry_t entry = folio->swap;
> > -     struct address_space *address_space = swap_address_space(entry);
> >
> > -     xa_lock_irq(&address_space->i_pages);
> > -     __delete_from_swap_cache(folio, entry, NULL);
> > -     xa_unlock_irq(&address_space->i_pages);
> > +     ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> > +     __swap_cache_del_folio(entry, folio, NULL);
> > +     swap_cluster_unlock(ci);
> >
> >       put_swap_folio(folio, entry);
> >       folio_ref_sub(folio, folio_nr_pages(folio));
> >  }
> >
> > -void clear_shadow_from_swap_cache(int type, unsigned long begin,
> > -                             unsigned long end)
> > +void __swap_cache_clear_shadow(swp_entry_t entry, int nr_ents)
> >  {
> > -     unsigned long curr = begin;
> > -     void *old;
> > -
> > -     for (;;) {
> > -             swp_entry_t entry = swp_entry(type, curr);
> > -             unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
> > -             struct address_space *address_space = swap_address_space(entry);
> > -             XA_STATE(xas, &address_space->i_pages, index);
> > -
> > -             xas_set_update(&xas, workingset_update_node);
> > -
> > -             xa_lock_irq(&address_space->i_pages);
> > -             xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
> > -                     if (!xa_is_value(old))
> > -                             continue;
> > -                     xas_store(&xas, NULL);
> > -             }
> > -             xa_unlock_irq(&address_space->i_pages);
> > +     struct swap_cluster_info *ci = swp_cluster(entry);
> > +     unsigned int ci_off = swp_cluster_offset(entry), ci_end;
> >
> > -             /* search the next swapcache until we meet end */
> > -             curr = ALIGN((curr + 1), SWAP_ADDRESS_SPACE_PAGES);
> > -             if (curr > end)
> > -                     break;
> > -     }
> > +     ci_end = ci_off + nr_ents;
> > +     do {
> > +             WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> > +             __swap_table_init_null(ci, ci_off);
> > +     } while (++ci_off < ci_end);
> >  }
> >
> >  /*
> > @@ -292,8 +320,7 @@ static inline bool swap_use_vma_readahead(void)
> >  /*
> >   * Update the readahead statistics of a vma or globally.
> >   */
> > -void swap_update_readahead(struct folio *folio,
> > -                        struct vm_area_struct *vma,
> > +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> >                          unsigned long addr)
> >  {
> >       bool readahead, vma_ra = swap_use_vma_readahead();
> > @@ -387,7 +414,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >                       goto put_and_return;
> >
> >               /*
> > -              * We might race against __delete_from_swap_cache(), and
> > +              * We might race against __swap_cache_del_folio(), and
> >                * stumble across a swap_map entry whose SWAP_HAS_CACHE
> >                * has not yet been cleared.  Or race against another
> >                * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> > @@ -405,8 +432,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >       if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
> >               goto fail_unlock;
> >
> > -     /* May fail (-ENOMEM) if XArray node allocation failed. */
> > -     if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
> > +     if (swap_cache_add_folio(entry, new_folio, &shadow))
> >               goto fail_unlock;
> >
> >       memcg1_swapin(entry, 1);
> > @@ -572,11 +598,11 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >               end_offset = si->max - 1;
> >
> >       blk_start_plug(&plug);
> > -     for (offset = start_offset; offset <= end_offset ; offset++) {
> > +     for (offset = start_offset; offset <= end_offset; offset++) {
> >               /* Ok, do the async read-ahead now */
> >               folio = __read_swap_cache_async(
> > -                             swp_entry(swp_type(entry), offset),
> > -                             gfp_mask, mpol, ilx, &page_allocated, false);
> > +                             swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
> > +                             &page_allocated, false);
> >               if (!folio)
> >                       continue;
> >               if (page_allocated) {
> > @@ -600,41 +626,6 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> >       return folio;
> >  }
> >
> > -int init_swap_address_space(unsigned int type, unsigned long nr_pages)
> > -{
> > -     struct address_space *spaces, *space;
> > -     unsigned int i, nr;
> > -
> > -     nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> > -     spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
> > -     if (!spaces)
> > -             return -ENOMEM;
> > -     for (i = 0; i < nr; i++) {
> > -             space = spaces + i;
> > -             xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
> > -             atomic_set(&space->i_mmap_writable, 0);
> > -             space->a_ops = &swap_aops;
> > -             /* swap cache doesn't use writeback related tags */
> > -             mapping_set_no_writeback_tags(space);
> > -     }
> > -     nr_swapper_spaces[type] = nr;
> > -     swapper_spaces[type] = spaces;
> > -
> > -     return 0;
> > -}
> > -
> > -void exit_swap_address_space(unsigned int type)
> > -{
> > -     int i;
> > -     struct address_space *spaces = swapper_spaces[type];
> > -
> > -     for (i = 0; i < nr_swapper_spaces[type]; i++)
> > -             VM_WARN_ON_ONCE(!mapping_empty(&spaces[i]));
> > -     kvfree(spaces);
> > -     nr_swapper_spaces[type] = 0;
> > -     swapper_spaces[type] = NULL;
> > -}
> > -
> >  static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
> >                          unsigned long *end)
> >  {
> > @@ -807,7 +798,7 @@ static const struct attribute_group swap_attr_group = {
> >       .attrs = swap_attrs,
> >  };
> >
> > -static int __init swap_init_sysfs(void)
> > +static int __init swap_init(void)
> >  {
> >       int err;
> >       struct kobject *swap_kobj;
> > @@ -822,11 +813,13 @@ static int __init swap_init_sysfs(void)
> >               pr_err("failed to register swap group\n");
> >               goto delete_obj;
> >       }
> > +     /* swap_space is set RO after init, so do it here before init ends. */
> > +     mapping_set_no_writeback_tags(&swap_space);
> >       return 0;
> >
> >  delete_obj:
> >       kobject_put(swap_kobj);
> >       return err;
> >  }
> > -subsys_initcall(swap_init_sysfs);
> > +subsys_initcall(swap_init);
> >  #endif
> > diff --git a/mm/swap_table.h b/mm/swap_table.h
> > new file mode 100644
> > index 000000000000..ed9676547071
> > --- /dev/null
> > +++ b/mm/swap_table.h
> > @@ -0,0 +1,106 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _MM_SWAP_TABLE_H
> > +#define _MM_SWAP_TABLE_H
> > +
> > +#include "swap.h"
> > +
> > +/*
> > + * A swap table entry represents the status of a swap slot on a swap
> > + * (physical or virtual) device. The swap table in each cluster is a
> > + * 1:1 map of the swap slots in this cluster.
> > + *
> > + * Each swap table entry could be a pointer (folio), a XA_VALUE
> > + * (shadow), or NULL.
> > + */
> > +
> > +/*
> > + * Helpers for casting one type of info into a swap table entry.
> > + */
> > +static inline unsigned long null_to_swp_tb(void)
> > +{
> > +     BUILD_BUG_ON(sizeof(unsigned long) != sizeof(atomic_long_t));
> > +     return 0;
> > +}
> > +
> > +static inline unsigned long folio_to_swp_tb(struct folio *folio)
> > +{
> > +     BUILD_BUG_ON(sizeof(unsigned long) != sizeof(void *));
> > +     return (unsigned long)folio;
> > +}
> > +
> > +static inline unsigned long shadow_swp_to_tb(void *shadow)
> > +{
> > +     BUILD_BUG_ON((BITS_PER_XA_VALUE + 1) !=
> > +                  BITS_PER_BYTE * sizeof(unsigned long));
> > +     VM_WARN_ON_ONCE(shadow && !xa_is_value(shadow));
> > +     return (unsigned long)shadow;
> > +}
> > +
> > +/*
> > + * Helpers for swap table entry type checking.
> > + */
> > +static inline bool swp_tb_is_null(unsigned long swp_tb)
> > +{
> > +     return !swp_tb;
> > +}
> > +
> > +static inline bool swp_tb_is_folio(unsigned long swp_tb)
> > +{
> > +     return !xa_is_value((void *)swp_tb) && !swp_tb_is_null(swp_tb);
> > +}
> > +
> > +static inline bool swp_tb_is_shadow(unsigned long swp_tb)
> > +{
> > +     return xa_is_value((void *)swp_tb);
> > +}
> > +
> > +/*
> > + * Helpers for retrieving info from swap table.
> > + */
> > +static inline struct folio *swp_tb_to_folio(unsigned long swp_tb)
> > +{
> > +     VM_WARN_ON(!swp_tb_is_folio(swp_tb));
> > +     return (void *)swp_tb;
> > +}
> > +
> > +static inline void *swp_tb_to_shadow(unsigned long swp_tb)
> > +{
> > +     VM_WARN_ON(!swp_tb_is_shadow(swp_tb));
> > +     return (void *)swp_tb;
> > +}
> > +
> > +/*
> > + * Helpers for accessing or modifying the swap table of a cluster,
> > + * the swap cluster must be locked.
> > + */
> > +static inline void __swap_table_set(struct swap_cluster_info *ci,
> > +                                 unsigned int off, unsigned long swp_tb)
> > +{
> > +     VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > +     atomic_long_set(&ci->table[off], swp_tb);
> > +}
> > +
> > +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> > +                                          unsigned int off)
> > +{
> > +     VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > +     return atomic_long_read(&ci->table[off]);
> > +}
> > +
> > +static inline void __swap_table_set_folio(struct swap_cluster_info *ci,
> > +                                       unsigned int off, struct folio *folio)
> > +{
> > +     __swap_table_set(ci, off, folio_to_swp_tb(folio));
> > +}
> > +
> > +static inline void __swap_table_set_shadow(struct swap_cluster_info *ci,
> > +                                        unsigned int off, void *shadow)
> > +{
> > +     __swap_table_set(ci, off, shadow_swp_to_tb(shadow));
> > +}
> > +
> > +static inline void __swap_table_init_null(struct swap_cluster_info *ci, unsigned int off)
> > +{
> > +     __swap_table_set(ci, off, null_to_swp_tb());
> > +}
> > +#endif
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 85606fbebf0f..df68b5e242a6 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -46,6 +46,7 @@
> >  #include <asm/tlbflush.h>
> >  #include <linux/swapops.h>
> >  #include <linux/swap_cgroup.h>
> > +#include "swap_table.h"
> >  #include "internal.h"
> >  #include "swap.h"
> >
> > @@ -268,7 +269,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >       if (!need_reclaim)
> >               goto out_unlock;
> >
> > -     delete_from_swap_cache(folio);
> > +     swap_cache_del_folio(folio);
> >       folio_set_dirty(folio);
> >       ret = nr_pages;
> >  out_unlock:
> > @@ -422,6 +423,34 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
> >       return cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >  }
> >
> > +static int swap_table_alloc_table(struct swap_cluster_info *ci)
> > +{
> > +     WARN_ON(ci->table);
> > +     ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> > +     if (!ci->table)
> > +             return -ENOMEM;
> > +     return 0;
> > +}
> > +
> > +static void swap_cluster_free_table(struct swap_cluster_info *ci)
> > +{
> > +     unsigned int ci_off;
> > +     unsigned long swp_tb;
> > +
> > +     if (!ci->table)
> > +             return;
> > +
> > +     for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> > +             swp_tb = __swap_table_get(ci, ci_off);
> > +             if (!swp_tb_is_null(swp_tb))
> > +                     pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> > +                                 swp_tb);
> > +     }
> > +
> > +     kfree(ci->table);
> > +     ci->table = NULL;
> > +}
> > +
> >  static void move_cluster(struct swap_info_struct *si,
> >                        struct swap_cluster_info *ci, struct list_head *list,
> >                        enum swap_cluster_flags new_flags)
> > @@ -704,6 +733,25 @@ static bool cluster_scan_range(struct swap_info_struct *si,
> >       return true;
> >  }
> >
> > +/*
> > + * Currently, the swap table is not used for count tracking,
> > + * just do a sanity check to ensure nothing went wrong.
> > + */
> > +static void cluster_table_check(struct swap_cluster_info *ci,
> > +                             unsigned int start, unsigned int nr)
> > +{
> > +     unsigned int ci_off = start % SWAPFILE_CLUSTER;
> > +     unsigned int ci_end = ci_off + nr;
> > +     unsigned long swp_tb;
> > +
> > +     if (IS_ENABLED(CONFIG_DEBUG_VM)) {
> > +             do {
> > +                     swp_tb = __swap_table_get(ci, ci_off);
> > +                     VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
> > +             } while (++ci_off < ci_end);
> > +     }
> > +}
> > +
> >  static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
> >                               unsigned int start, unsigned char usage,
> >                               unsigned int order)
> > @@ -723,6 +771,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> >               ci->order = order;
> >
> >       memset(si->swap_map + start, usage, nr_pages);
> > +     cluster_table_check(ci, start, nr_pages);
> >       swap_range_alloc(si, nr_pages);
> >       ci->count += nr_pages;
> >
> > @@ -1100,8 +1149,7 @@ static void swap_range_alloc(struct swap_info_struct *si,
> >  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> >                           unsigned int nr_entries)
> >  {
> > -     unsigned long begin = offset;
> > -     unsigned long end = offset + nr_entries - 1;
> > +     unsigned long start = offset, end = offset + nr_entries - 1;
>
> And this kind of clean up or code style adjustment, adding them here will
> distract people from focusing on swap table introducing.

+1. The "begin" to "start" change is not necessary. The other reason
to nitpick is that I might be the one writing the original "begin". It
sounds like a word I would use to pair with "end". I do recall I wrote
"begin" and "end" somewhere before, I forget if this function or not.
Pure nitpick anyway. Renaming itself is trivial. Just try to not do
unnecessary rename.

> >       void (*swap_slot_free_notify)(struct block_device *, unsigned long);
> >       unsigned int i;
> >
> > @@ -1125,7 +1173,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> >                       swap_slot_free_notify(si->bdev, offset);
> >               offset++;
> >       }
> > -     clear_shadow_from_swap_cache(si->type, begin, end);
> > +     __swap_cache_clear_shadow(swp_entry(si->type, start), nr_entries);
> >
> >       /*
> >        * Make sure that try_to_unuse() observes si->inuse_pages reaching 0
> > @@ -1282,15 +1330,7 @@ int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> >       if (!entry.val)
> >               return -ENOMEM;
> >
> > -     /*
> > -      * XArray node allocations from PF_MEMALLOC contexts could
> > -      * completely exhaust the page allocator. __GFP_NOMEMALLOC
> > -      * stops emergency reserves from being allocated.
> > -      *
> > -      * TODO: this could cause a theoretical memory reclaim
> > -      * deadlock in the swap out path.
> > -      */
> > -     if (add_to_swap_cache(folio, entry, gfp | __GFP_NOMEMALLOC, NULL))
> > +     if (swap_cache_add_folio(entry, folio, NULL))
> >               goto out_free;
> >
> >       return 0;
> > @@ -1557,6 +1597,7 @@ static void swap_entries_free(struct swap_info_struct *si,
> >
> >       mem_cgroup_uncharge_swap(entry, nr_pages);
> >       swap_range_free(si, offset, nr_pages);
> > +     cluster_table_check(ci, offset, nr_pages);
> >
> >       if (!ci->count)
> >               free_cluster(si, ci);
> > @@ -1760,7 +1801,7 @@ bool folio_free_swap(struct folio *folio)
> >       if (folio_swapped(folio))
> >               return false;
> >
> > -     delete_from_swap_cache(folio);
> > +     swap_cache_del_folio(folio);
> >       folio_set_dirty(folio);
> >       return true;
> >  }
> > @@ -2634,6 +2675,18 @@ static void wait_for_allocation(struct swap_info_struct *si)
> >       }
> >  }
> >
> > +static void free_cluster_info(struct swap_cluster_info *cluster_info,
> > +                           unsigned long maxpages)
> > +{
> > +     int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> > +
> > +     if (!cluster_info)
> > +             return;
> > +     for (i = 0; i < nr_clusters; i++)
> > +             swap_cluster_free_table(&cluster_info[i]);
> > +     kvfree(cluster_info);
> > +}
> > +
> >  /*
> >   * Called after swap device's reference count is dead, so
> >   * neither scan nor allocation will use it.
> > @@ -2768,12 +2821,13 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >
> >       swap_file = p->swap_file;
> >       p->swap_file = NULL;
> > -     p->max = 0;
> >       swap_map = p->swap_map;
> >       p->swap_map = NULL;
> >       zeromap = p->zeromap;
> >       p->zeromap = NULL;
> >       cluster_info = p->cluster_info;
> > +     free_cluster_info(cluster_info, p->max);
> > +     p->max = 0;
> >       p->cluster_info = NULL;
> >       spin_unlock(&p->lock);
> >       spin_unlock(&swap_lock);
> > @@ -2784,10 +2838,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
> >       p->global_cluster = NULL;
> >       vfree(swap_map);
> >       kvfree(zeromap);
> > -     kvfree(cluster_info);
> >       /* Destroy swap account information */
> >       swap_cgroup_swapoff(p->type);
> > -     exit_swap_address_space(p->type);
> >
> >       inode = mapping->host;
> >
> > @@ -3171,8 +3223,11 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >       if (!cluster_info)
> >               goto err;
> >
> > -     for (i = 0; i < nr_clusters; i++)
> > +     for (i = 0; i < nr_clusters; i++) {
> >               spin_lock_init(&cluster_info[i].lock);
> > +             if (swap_table_alloc_table(&cluster_info[i]))
> > +                     goto err_free;
> > +     }
> >
> >       if (!(si->flags & SWP_SOLIDSTATE)) {
> >               si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> > @@ -3233,9 +3288,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >       }
> >
> >       return cluster_info;
> > -
> >  err_free:
> > -     kvfree(cluster_info);
> > +     free_cluster_info(cluster_info, maxpages);
> >  err:
> >       return ERR_PTR(err);
> >  }
> > @@ -3429,13 +3483,9 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >               }
> >       }
> >
> > -     error = init_swap_address_space(si->type, maxpages);
> > -     if (error)
> > -             goto bad_swap_unlock_inode;
> > -
> >       error = zswap_swapon(si->type, maxpages);
> >       if (error)
> > -             goto free_swap_address_space;
> > +             goto bad_swap_unlock_inode;
> >
> >       /*
> >        * Flush any pending IO and dirty mappings before we start using this
> > @@ -3470,8 +3520,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >       goto out;
> >  free_swap_zswap:
> >       zswap_swapoff(si->type);
> > -free_swap_address_space:
> > -     exit_swap_address_space(si->type);
> >  bad_swap_unlock_inode:
> >       inode_unlock(inode);
> >  bad_swap:
> > @@ -3486,7 +3534,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
> >       spin_unlock(&swap_lock);
> >       vfree(swap_map);
> >       kvfree(zeromap);
> > -     kvfree(cluster_info);
> > +     if (cluster_info)
> > +             free_cluster_info(cluster_info, maxpages);
> >       if (inced_nr_rotate_swap)
> >               atomic_dec(&nr_rotate_swap);
> >       if (swap_file)
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index b0afd7f41a22..1ed3cf9dac4e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -730,13 +730,18 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> >  {
> >       int refcount;
> >       void *shadow = NULL;
> > +     struct swap_cluster_info *ci;
> >
> >       BUG_ON(!folio_test_locked(folio));
> >       BUG_ON(mapping != folio_mapping(folio));
> >
> > -     if (!folio_test_swapcache(folio))
> > +     if (folio_test_swapcache(folio)) {
> > +             ci = swap_cluster_lock_by_folio_irq(folio);
> > +     } else {
> >               spin_lock(&mapping->host->i_lock);
> > -     xa_lock_irq(&mapping->i_pages);
> > +             xa_lock_irq(&mapping->i_pages);
> > +     }
> > +
> >       /*
> >        * The non racy check for a busy folio.
> >        *
> > @@ -776,9 +781,9 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> >
> >               if (reclaimed && !mapping_exiting(mapping))
> >                       shadow = workingset_eviction(folio, target_memcg);
> > -             __delete_from_swap_cache(folio, swap, shadow);
> > +             __swap_cache_del_folio(swap, folio, shadow);
> >               memcg1_swapout(folio, swap);
> > -             xa_unlock_irq(&mapping->i_pages);
> > +             swap_cluster_unlock_irq(ci);
> >               put_swap_folio(folio, swap);
> >       } else {
> >               void (*free_folio)(struct folio *);
> > @@ -816,9 +821,12 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
> >       return 1;
> >
> >  cannot_free:
> > -     xa_unlock_irq(&mapping->i_pages);
> > -     if (!folio_test_swapcache(folio))
> > +     if (folio_test_swapcache(folio)) {
> > +             swap_cluster_unlock_irq(ci);
> > +     } else {
> > +             xa_unlock_irq(&mapping->i_pages);
> >               spin_unlock(&mapping->host->i_lock);
> > +     }
> >       return 0;
> >  }
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index ee443b317ac7..c869859eec77 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1166,7 +1166,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >
> >  out:
> >       if (ret && ret != -EEXIST) {
> > -             delete_from_swap_cache(folio);
> > +             swap_cache_del_folio(folio);
> >               folio_unlock(folio);
> >       }
> >       folio_put(folio);
> > --
> > 2.51.0
> >
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-22 19:20 ` [PATCH 7/9] mm, swap: remove contention workaround for swap cache Kairui Song
@ 2025-08-30  4:07   ` Chris Li
  2025-08-30 15:24     ` Kairui Song
  2025-09-02 10:06   ` Barry Song
  1 sibling, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-08-30  4:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

Hi Kairui,

It feels so good to remove that 64M swap cache space. Thank you for
making it happen.

Some nitpick follows. I am fine as is as well.

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cluster setup will try to shuffle the clusters on initialization.
> It was helpful to avoid contention for the swap cache space. The cluster
> size (2M) was much smaller than each swap cache space (64M), so shuffling
> the cluster means the allocator will try to allocate swap slots that are
> in different swap cache spaces for each CPU, reducing the chance of two
> CPUs using the same swap cache space, and hence reducing the contention.
>
> Now, swap cache is managed by swap clusters, this shuffle is pointless.
> Just remove it, and clean up related macros.
>
> This should also improve the HDD swap performance as shuffling IO is a
> bad idea for HDD, and now the shuffling is gone.

Did you have any numbers to prove that :-). Last time the swap
allocator stress testing has already destroyed two of my SAS drives
dedicated for testing. So I am not very keen on running the HDD swap
stress test. The HDD swap stress test are super slow to run, it takes
ages.

>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h     |  4 ----
>  mm/swapfile.c | 32 ++++++++------------------------
>  mm/zswap.c    |  7 +++++--
>  3 files changed, 13 insertions(+), 30 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 4af42bc2cd72..ce3ec62cc05e 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
>  void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
>
>  /* linux/mm/swap_state.c */
> -/* One swap address space for each 64M swap space */
> -#define SWAP_ADDRESS_SPACE_SHIFT       14
> -#define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
> -#define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
>  extern struct address_space swap_space __ro_after_init;
>  static inline struct address_space *swap_address_space(swp_entry_t entry)
>  {
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index df68b5e242a6..0c8001c99f30 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *si,
>         return 0;
>  }
>
> -#define SWAP_CLUSTER_INFO_COLS                                         \
> -       DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
> -#define SWAP_CLUSTER_SPACE_COLS                                                \
> -       DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
> -#define SWAP_CLUSTER_COLS                                              \
> -       max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
> -
>  static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                                                 union swap_header *swap_header,
>                                                 unsigned long maxpages)
>  {
>         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>         struct swap_cluster_info *cluster_info;
> -       unsigned long i, j, idx;
>         int err = -ENOMEM;
> +       unsigned long i;

Nitpick: This line location change is not necessary.

>
>         cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
>         if (!cluster_info)
> @@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>                 INIT_LIST_HEAD(&si->frag_clusters[i]);
>         }
>
> -       /*
> -        * Reduce false cache line sharing between cluster_info and
> -        * sharing same address space.
> -        */
> -       for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
> -               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> -                       struct swap_cluster_info *ci;
> -                       idx = i * SWAP_CLUSTER_COLS + j;
> -                       ci = cluster_info + idx;
> -                       if (idx >= nr_clusters)
> -                               continue;
> -                       if (ci->count) {
> -                               ci->flags = CLUSTER_FLAG_NONFULL;
> -                               list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> -                               continue;
> -                       }
> +       for (i = 0; i < nr_clusters; i++) {
> +               struct swap_cluster_info *ci = &cluster_info[i];

struct swap_cluster_info *ci = cluster_info + i;
looks simpler. Pure nitpick and personal preference, you don't have to
follow it.

> +
> +               if (ci->count) {
> +                       ci->flags = CLUSTER_FLAG_NONFULL;
> +                       list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> +               } else {
>                         ci->flags = CLUSTER_FLAG_FREE;
>                         list_add_tail(&ci->list, &si->free_clusters);
>                 }
> diff --git a/mm/zswap.c b/mm/zswap.c
> index c869859eec77..c0a9be14a725 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -237,10 +237,13 @@ static bool zswap_has_pool;
>  * helpers and fwd declarations
>  **********************************/
>
> +/* One swap address space for each 64M swap space */
> +#define ZSWAP_ADDRESS_SPACE_SHIFT 14
> +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
>  static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
>  {
>         return &zswap_trees[swp_type(swp)][swp_offset(swp)
> -               >> SWAP_ADDRESS_SPACE_SHIFT];
> +               >> ZSWAP_ADDRESS_SPACE_SHIFT];
>  }
>
>  #define zswap_pool_debug(msg, p)                               \
> @@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
>         struct xarray *trees, *tree;
>         unsigned int nr, i;
>
> -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> +       nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
>         trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
>         if (!trees) {
>                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-08-22 19:20 ` [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Kairui Song
@ 2025-08-30  4:17   ` Chris Li
  2025-09-02 11:15   ` Barry Song
  1 sibling, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  4:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

PS, this version already has my feedback incorporated.

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now swap table is cluster based, which means free clusters can free its
> table since no one should modify it.
>
> There could be speculative readers, like swap cache look up, protect
> them by making them RCU safe. All swap table should be filled with null
> entries before free, so such readers will either see a NULL pointer or
> a null filled table being lazy freed.
>
> On allocation, allocate the table when a cluster is used by any order.
>
> This way, we can reduce the memory usage of large swap device
> significantly.
>
> This idea to dynamically release unused swap cluster data was initially
> suggested by Chris Li while proposing the cluster swap allocator and
> I found it suits the swap table idea very well.
>
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap.h       |   2 +-
>  mm/swap_state.c |   9 ++-
>  mm/swap_table.h |  32 +++++++-
>  mm/swapfile.c   | 202 ++++++++++++++++++++++++++++++++++++++----------
>  4 files changed, 197 insertions(+), 48 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index ce3ec62cc05e..ee33733027f4 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -36,7 +36,7 @@ struct swap_cluster_info {
>         u16 count;
>         u8 flags;
>         u8 order;
> -       atomic_long_t *table;   /* Swap table entries, see mm/swap_table.h */
> +       atomic_long_t __rcu *table;     /* Swap table entries, see mm/swap_table.h */
>         struct list_head list;
>  };
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index c0342024b4a8..a0120d822fbe 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -87,7 +87,8 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
>         struct folio *folio;
>
>         for (;;) {
> -               swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> +               swp_tb = swap_table_get(swp_cluster(entry),
> +                                       swp_cluster_offset(entry));
>                 if (!swp_tb_is_folio(swp_tb))
>                         return NULL;
>                 folio = swp_tb_to_folio(swp_tb);
> @@ -107,10 +108,9 @@ void *swap_cache_get_shadow(swp_entry_t entry)
>  {
>         unsigned long swp_tb;
>
> -       swp_tb = __swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
> +       swp_tb = swap_table_get(swp_cluster(entry), swp_cluster_offset(entry));
>         if (swp_tb_is_shadow(swp_tb))
>                 return swp_tb_to_shadow(swp_tb);
> -
>         return NULL;
>  }
>
> @@ -135,6 +135,9 @@ int swap_cache_add_folio(swp_entry_t entry, struct folio *folio, void **shadowp)
>         VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
>
>         ci = swap_cluster_lock(swp_info(entry), swp_offset(entry));
> +       if (unlikely(!ci->table))
> +               goto fail;
> +
>         ci_start = swp_cluster_offset(entry);
>         ci_end = ci_start + nr_pages;
>         ci_off = ci_start;
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index ed9676547071..4e97513b11ef 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -2,8 +2,15 @@
>  #ifndef _MM_SWAP_TABLE_H
>  #define _MM_SWAP_TABLE_H
>
> +#include <linux/rcupdate.h>
> +#include <linux/atomic.h>
>  #include "swap.h"
>
> +/* A typical flat array in each cluster as swap table */
> +struct swap_table {
> +       atomic_long_t entries[SWAPFILE_CLUSTER];
> +};
> +
>  /*
>   * A swap table entry represents the status of a swap slot on a swap
>   * (physical or virtual) device. The swap table in each cluster is a
> @@ -76,15 +83,36 @@ static inline void *swp_tb_to_shadow(unsigned long swp_tb)
>  static inline void __swap_table_set(struct swap_cluster_info *ci,
>                                     unsigned int off, unsigned long swp_tb)
>  {
> +       atomic_long_t *table = rcu_dereference_protected(ci->table, true);
> +
> +       lockdep_assert_held(&ci->lock);
>         VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> -       atomic_long_set(&ci->table[off], swp_tb);
> +       atomic_long_set(&table[off], swp_tb);
>  }
>
>  static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
>                                              unsigned int off)
>  {
> +       atomic_long_t *table;
> +
>         VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> -       return atomic_long_read(&ci->table[off]);
> +       table = rcu_dereference_check(ci->table, lockdep_is_held(&ci->lock));
> +
> +       return atomic_long_read(&table[off]);
> +}
> +
> +static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
> +                                       unsigned int off)
> +{
> +       atomic_long_t *table;
> +       unsigned long swp_tb;
> +
> +       rcu_read_lock();
> +       table = rcu_dereference(ci->table);
> +       swp_tb = table ? atomic_long_read(&table[off]) : null_to_swp_tb();
> +       rcu_read_unlock();
> +
> +       return swp_tb;
>  }
>
>  static inline void __swap_table_set_folio(struct swap_cluster_info *ci,
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0c8001c99f30..00651e947eb2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -105,6 +105,8 @@ static DEFINE_SPINLOCK(swap_avail_lock);
>
>  struct swap_info_struct *swap_info[MAX_SWAPFILES];
>
> +static struct kmem_cache *swap_table_cachep;
> +
>  static DEFINE_MUTEX(swapon_mutex);
>
>  static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait);
> @@ -402,10 +404,17 @@ static inline bool cluster_is_discard(struct swap_cluster_info *info)
>         return info->flags == CLUSTER_FLAG_DISCARD;
>  }
>
> +static inline bool cluster_table_is_alloced(struct swap_cluster_info *ci)
> +{
> +       return rcu_dereference_protected(ci->table, lockdep_is_held(&ci->lock));
> +}
> +
>  static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
>  {
>         if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
>                 return false;
> +       if (!cluster_table_is_alloced(ci))
> +               return false;
>         if (!order)
>                 return true;
>         return cluster_is_empty(ci) || order == ci->order;
> @@ -423,32 +432,98 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> -static int swap_table_alloc_table(struct swap_cluster_info *ci)
> +static void swap_cluster_free_table(struct swap_cluster_info *ci)
>  {
> -       WARN_ON(ci->table);
> -       ci->table = kzalloc(sizeof(unsigned long) * SWAPFILE_CLUSTER, GFP_KERNEL);
> -       if (!ci->table)
> -               return -ENOMEM;
> -       return 0;
> +       unsigned int ci_off;
> +       struct swap_table *table;
> +
> +       /* Only empty cluster's table is allow to be freed  */
> +       lockdep_assert_held(&ci->lock);
> +       VM_WARN_ON_ONCE(!cluster_is_empty(ci));
> +       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++)
> +               VM_WARN_ON_ONCE(!swp_tb_is_null(__swap_table_get(ci, ci_off)));
> +       table = (void *)rcu_dereference_protected(ci->table, true);
> +       rcu_assign_pointer(ci->table, NULL);
> +
> +       kmem_cache_free(swap_table_cachep, table);
>  }
>
> -static void swap_cluster_free_table(struct swap_cluster_info *ci)
> +/*
> + * Allocate a swap table may need to sleep, which leads to migration,
> + * so attempt an atomic allocation first then fallback and handle
> + * potential race.
> + */
> +static struct swap_cluster_info *
> +swap_cluster_alloc_table(struct swap_info_struct *si,
> +                        struct swap_cluster_info *ci,
> +                        int order)
>  {
> -       unsigned int ci_off;
> -       unsigned long swp_tb;
> +       struct swap_cluster_info *pcp_ci;
> +       struct swap_table *table;
> +       unsigned long offset;
>
> -       if (!ci->table)
> -               return;
> +       /*
> +        * Only cluster isolation from the allocator does table allocation.
> +        * Swap allocator uses a percpu cluster and holds the local lock.
> +        */
> +       lockdep_assert_held(&ci->lock);
> +       lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
> +
> +       table = kmem_cache_zalloc(swap_table_cachep,
> +                                 __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +       if (table) {
> +               rcu_assign_pointer(ci->table, table);
> +               return ci;
> +       }
> +
> +       /*
> +        * Try a sleep allocation. Each isolated free cluster may cause
> +        * a sleep allocation, but there is a limited number of them, so
> +        * the potential recursive allocation should be limited.
> +        */
> +       spin_unlock(&ci->lock);
> +       if (!(si->flags & SWP_SOLIDSTATE))
> +               spin_unlock(&si->global_cluster_lock);
> +       local_unlock(&percpu_swap_cluster.lock);
> +       table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
>
> -       for (ci_off = 0; ci_off < SWAPFILE_CLUSTER; ci_off++) {
> -               swp_tb = __swap_table_get(ci, ci_off);
> -               if (!swp_tb_is_null(swp_tb))
> -                       pr_err_once("swap: unclean swap space on swapoff: 0x%lx",
> -                                   swp_tb);
> +       local_lock(&percpu_swap_cluster.lock);
> +       if (!(si->flags & SWP_SOLIDSTATE))
> +               spin_lock(&si->global_cluster_lock);
> +       /*
> +        * Back to atomic context. First, check if we migrated to a new
> +        * CPU with a usable percpu cluster. If so, try using that instead.
> +        * No need to check it for the spinning device, as swap is
> +        * serialized by the global lock on them.
> +        *
> +        * The is_usable check is a bit rough, but ensures order 0 success.
> +        */
> +       offset = this_cpu_read(percpu_swap_cluster.offset[order]);
> +       if ((si->flags & SWP_SOLIDSTATE) && offset) {
> +               pcp_ci = swap_cluster_lock(si, offset);
> +               if (cluster_is_usable(pcp_ci, order) &&
> +                   pcp_ci->count < SWAPFILE_CLUSTER) {
> +                       ci = pcp_ci;
> +                       goto free_table;
> +               }
> +               swap_cluster_unlock(pcp_ci);
>         }
>
> -       kfree(ci->table);
> -       ci->table = NULL;
> +       if (!table)
> +               return NULL;
> +
> +       spin_lock(&ci->lock);
> +       /* Nothing should have touched the dangling empty cluster. */
> +       if (WARN_ON_ONCE(cluster_table_is_alloced(ci)))
> +               goto free_table;
> +
> +       rcu_assign_pointer(ci->table, table);
> +       return ci;
> +
> +free_table:
> +       if (table)
> +               kmem_cache_free(swap_table_cachep, table);
> +       return ci;
>  }
>
>  static void move_cluster(struct swap_info_struct *si,
> @@ -480,7 +555,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -       lockdep_assert_held(&ci->lock);
> +       swap_cluster_free_table(ci);
>         move_cluster(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>         ci->order = 0;
>  }
> @@ -495,15 +570,11 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
>   * this returns NULL for an non-empty list.
>   */
>  static struct swap_cluster_info *isolate_lock_cluster(
> -               struct swap_info_struct *si, struct list_head *list)
> +               struct swap_info_struct *si, struct list_head *list, int order)
>  {
> -       struct swap_cluster_info *ci, *ret = NULL;
> +       struct swap_cluster_info *ci, *found = NULL;
>
>         spin_lock(&si->lock);
> -
> -       if (unlikely(!(si->flags & SWP_WRITEOK)))
> -               goto out;
> -
>         list_for_each_entry(ci, list, list) {
>                 if (!spin_trylock(&ci->lock))
>                         continue;
> @@ -515,13 +586,19 @@ static struct swap_cluster_info *isolate_lock_cluster(
>
>                 list_del(&ci->list);
>                 ci->flags = CLUSTER_FLAG_NONE;
> -               ret = ci;
> +               found = ci;
>                 break;
>         }
> -out:
>         spin_unlock(&si->lock);
>
> -       return ret;
> +       if (found && !cluster_table_is_alloced(found)) {
> +               /* Only an empty free cluster's swap table can be freed. */
> +               VM_WARN_ON_ONCE(list != &si->free_clusters);
> +               VM_WARN_ON_ONCE(!cluster_is_empty(found));
> +               return swap_cluster_alloc_table(si, found, order);
> +       }
> +
> +       return found;
>  }
>
>  /*
> @@ -654,17 +731,27 @@ static void relocate_cluster(struct swap_info_struct *si,
>   * added to free cluster list and its usage counter will be increased by 1.
>   * Only used for initialization.
>   */
> -static void inc_cluster_info_page(struct swap_info_struct *si,
> +static int inc_cluster_info_page(struct swap_info_struct *si,
>         struct swap_cluster_info *cluster_info, unsigned long page_nr)
>  {
>         unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> +       struct swap_table *table;
>         struct swap_cluster_info *ci;
>
>         ci = cluster_info + idx;
> +       if (!ci->table) {
> +               table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
> +               if (!table)
> +                       return -ENOMEM;
> +               rcu_assign_pointer(ci->table, table);
> +       }
> +
>         ci->count++;
>
>         VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
>         VM_BUG_ON(ci->flags);
> +
> +       return 0;
>  }
>
>  static bool cluster_reclaim_range(struct swap_info_struct *si,
> @@ -845,7 +932,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
>         unsigned int found = SWAP_ENTRY_INVALID;
>
>         do {
> -               struct swap_cluster_info *ci = isolate_lock_cluster(si, list);
> +               struct swap_cluster_info *ci = isolate_lock_cluster(si, list, order);
>                 unsigned long offset;
>
>                 if (!ci)
> @@ -870,7 +957,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>         if (force)
>                 to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
>
> -       while ((ci = isolate_lock_cluster(si, &si->full_clusters))) {
> +       while ((ci = isolate_lock_cluster(si, &si->full_clusters, 0))) {
>                 offset = cluster_offset(si, ci);
>                 end = min(si->max, offset + SWAPFILE_CLUSTER);
>                 to_scan--;
> @@ -1018,6 +1105,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  done:
>         if (!(si->flags & SWP_SOLIDSTATE))
>                 spin_unlock(&si->global_cluster_lock);
> +
>         return found;
>  }
>
> @@ -1885,7 +1973,13 @@ swp_entry_t get_swap_page_of_type(int type)
>         /* This is called for allocating swap entry, not cache */
>         if (get_swap_device_info(si)) {
>                 if (si->flags & SWP_WRITEOK) {
> +                       /*
> +                        * Grab the local lock to be complaint
> +                        * with swap table allocation.
> +                        */
> +                       local_lock(&percpu_swap_cluster.lock);
>                         offset = cluster_alloc_swap_entry(si, 0, 1);
> +                       local_unlock(&percpu_swap_cluster.lock);
>                         if (offset) {
>                                 entry = swp_entry(si->type, offset);
>                                 atomic_long_dec(&nr_swap_pages);
> @@ -2678,12 +2772,21 @@ static void wait_for_allocation(struct swap_info_struct *si)
>  static void free_cluster_info(struct swap_cluster_info *cluster_info,
>                               unsigned long maxpages)
>  {
> +       struct swap_cluster_info *ci;
>         int i, nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
>
>         if (!cluster_info)
>                 return;
> -       for (i = 0; i < nr_clusters; i++)
> -               swap_cluster_free_table(&cluster_info[i]);
> +       for (i = 0; i < nr_clusters; i++) {
> +               ci = cluster_info + i;
> +               /* Cluster with bad marks count will have a remaining table */
> +               spin_lock(&ci->lock);
> +               if (rcu_dereference_protected(ci->table, true)) {
> +                       ci->count = 0;
> +                       swap_cluster_free_table(ci);
> +               }
> +               spin_unlock(&ci->lock);
> +       }
>         kvfree(cluster_info);
>  }
>
> @@ -2719,6 +2822,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         struct address_space *mapping;
>         struct inode *inode;
>         struct filename *pathname;
> +       unsigned int maxpages;
>         int err, found = 0;
>
>         if (!capable(CAP_SYS_ADMIN))
> @@ -2825,8 +2929,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->swap_map = NULL;
>         zeromap = p->zeromap;
>         p->zeromap = NULL;
> +       maxpages = p->max;
>         cluster_info = p->cluster_info;
> -       free_cluster_info(cluster_info, p->max);
>         p->max = 0;
>         p->cluster_info = NULL;
>         spin_unlock(&p->lock);
> @@ -2838,6 +2942,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>         p->global_cluster = NULL;
>         vfree(swap_map);
>         kvfree(zeromap);
> +       free_cluster_info(cluster_info, maxpages);
>         /* Destroy swap account information */
>         swap_cgroup_swapoff(p->type);
>
> @@ -3216,11 +3321,8 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         if (!cluster_info)
>                 goto err;
>
> -       for (i = 0; i < nr_clusters; i++) {
> +       for (i = 0; i < nr_clusters; i++)
>                 spin_lock_init(&cluster_info[i].lock);
> -               if (swap_table_alloc_table(&cluster_info[i]))
> -                       goto err_free;
> -       }
>
>         if (!(si->flags & SWP_SOLIDSTATE)) {
>                 si->global_cluster = kmalloc(sizeof(*si->global_cluster),
> @@ -3239,16 +3341,23 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>          * See setup_swap_map(): header page, bad pages,
>          * and the EOF part of the last cluster.
>          */
> -       inc_cluster_info_page(si, cluster_info, 0);
> +       err = inc_cluster_info_page(si, cluster_info, 0);
> +       if (err)
> +               goto err;
>         for (i = 0; i < swap_header->info.nr_badpages; i++) {
>                 unsigned int page_nr = swap_header->info.badpages[i];
>
>                 if (page_nr >= maxpages)
>                         continue;
> -               inc_cluster_info_page(si, cluster_info, page_nr);
> +               err = inc_cluster_info_page(si, cluster_info, page_nr);
> +               if (err)
> +                       goto err;
> +       }
> +       for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++) {
> +               err = inc_cluster_info_page(si, cluster_info, i);
> +               if (err)
> +                       goto err;
>         }
> -       for (i = maxpages; i < round_up(maxpages, SWAPFILE_CLUSTER); i++)
> -               inc_cluster_info_page(si, cluster_info, i);
>
>         INIT_LIST_HEAD(&si->free_clusters);
>         INIT_LIST_HEAD(&si->full_clusters);
> @@ -3962,6 +4071,15 @@ static int __init swapfile_init(void)
>
>         swapfile_maximum_size = arch_max_swapfile_size();
>
> +       /*
> +        * Once a cluster is freed, it's swap table content is read
> +        * only, and all swap cache readers (swap_cache_*) verifies
> +        * the content before use. So it's safe to use RCU slab here.
> +        */
> +       swap_table_cachep = kmem_cache_create("swap_table",
> +                           sizeof(struct swap_table),
> +                           0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
> +
>  #ifdef CONFIG_MIGRATION
>         if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
>                 swap_migration_ad_supported = true;
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 9/9] mm, swap: use a single page for swap table when the size fits
  2025-08-22 19:20 ` [PATCH 9/9] mm, swap: use a single page for swap table when the size fits Kairui Song
@ 2025-08-30  4:23   ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  4:23 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We have a cluster size of 512 slots. Each slot consumes 8 bytes in swap
> table so the swap table size of each cluster is exactly one page (4K).
>
> If that condition is true, allocate one page direct and disable the slab
> cache to reduce the memory usage of swap table and avoid fragmentation.
>
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_table.h |  2 ++
>  mm/swapfile.c   | 50 ++++++++++++++++++++++++++++++++++++++++---------
>  2 files changed, 43 insertions(+), 9 deletions(-)
>
> diff --git a/mm/swap_table.h b/mm/swap_table.h
> index 4e97513b11ef..984474e37dd7 100644
> --- a/mm/swap_table.h
> +++ b/mm/swap_table.h
> @@ -11,6 +11,8 @@ struct swap_table {
>         atomic_long_t entries[SWAPFILE_CLUSTER];
>  };
>
> +#define SWP_TABLE_USE_PAGE (sizeof(struct swap_table) == PAGE_SIZE)
> +
>  /*
>   * A swap table entry represents the status of a swap slot on a swap
>   * (physical or virtual) device. The swap table in each cluster is a
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 00651e947eb2..7539ee26d59a 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -432,6 +432,38 @@ static inline unsigned int cluster_offset(struct swap_info_struct *si,
>         return cluster_index(si, ci) * SWAPFILE_CLUSTER;
>  }
>
> +static struct swap_table *swap_table_alloc(gfp_t gfp)
> +{
> +       struct folio *folio;
> +
> +       if (!SWP_TABLE_USE_PAGE)
> +               return kmem_cache_zalloc(swap_table_cachep, gfp);
> +
> +       folio = folio_alloc(gfp | __GFP_ZERO, 0);
> +       if (folio)
> +               return folio_address(folio);
> +       return NULL;
> +}
> +
> +static void swap_table_free_folio_rcu_cb(struct rcu_head *head)
> +{
> +       struct folio *folio;
> +
> +       folio = page_folio(container_of(head, struct page, rcu_head));
> +       folio_put(folio);
> +}
> +
> +static void swap_table_free(struct swap_table *table)
> +{
> +       if (!SWP_TABLE_USE_PAGE) {
> +               kmem_cache_free(swap_table_cachep, table);
> +               return;
> +       }
> +
> +       call_rcu(&(folio_page(virt_to_folio(table), 0)->rcu_head),
> +                swap_table_free_folio_rcu_cb);
> +}
> +
>  static void swap_cluster_free_table(struct swap_cluster_info *ci)
>  {
>         unsigned int ci_off;
> @@ -445,7 +477,7 @@ static void swap_cluster_free_table(struct swap_cluster_info *ci)
>         table = (void *)rcu_dereference_protected(ci->table, true);
>         rcu_assign_pointer(ci->table, NULL);
>
> -       kmem_cache_free(swap_table_cachep, table);
> +       swap_table_free(table);
>  }
>
>  /*
> @@ -469,8 +501,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>         lockdep_assert_held(&ci->lock);
>         lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock);
>
> -       table = kmem_cache_zalloc(swap_table_cachep,
> -                                 __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +       table = swap_table_alloc(__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
>         if (table) {
>                 rcu_assign_pointer(ci->table, table);
>                 return ci;
> @@ -485,7 +516,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>         if (!(si->flags & SWP_SOLIDSTATE))
>                 spin_unlock(&si->global_cluster_lock);
>         local_unlock(&percpu_swap_cluster.lock);
> -       table = kmem_cache_zalloc(swap_table_cachep, __GFP_HIGH | GFP_KERNEL);
> +       table = swap_table_alloc(__GFP_HIGH | GFP_KERNEL);
>
>         local_lock(&percpu_swap_cluster.lock);
>         if (!(si->flags & SWP_SOLIDSTATE))
> @@ -522,7 +553,7 @@ swap_cluster_alloc_table(struct swap_info_struct *si,
>
>  free_table:
>         if (table)
> -               kmem_cache_free(swap_table_cachep, table);
> +               swap_table_free(table);
>         return ci;
>  }
>
> @@ -740,7 +771,7 @@ static int inc_cluster_info_page(struct swap_info_struct *si,
>
>         ci = cluster_info + idx;
>         if (!ci->table) {
> -               table = kmem_cache_zalloc(swap_table_cachep, GFP_KERNEL);
> +               table = swap_table_alloc(GFP_KERNEL);
>                 if (!table)
>                         return -ENOMEM;
>                 rcu_assign_pointer(ci->table, table);
> @@ -4076,9 +4107,10 @@ static int __init swapfile_init(void)
>          * only, and all swap cache readers (swap_cache_*) verifies
>          * the content before use. So it's safe to use RCU slab here.
>          */
> -       swap_table_cachep = kmem_cache_create("swap_table",
> -                           sizeof(struct swap_table),
> -                           0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
> +       if (!SWP_TABLE_USE_PAGE)
> +               swap_table_cachep = kmem_cache_create("swap_table",
> +                                   sizeof(struct swap_table),
> +                                   0, SLAB_PANIC | SLAB_TYPESAFE_BY_RCU, NULL);
>
>  #ifdef CONFIG_MIGRATION
>         if (swapfile_maximum_size >= (1UL << SWP_MIG_TOTAL_BITS))
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I)
  2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
                   ` (9 preceding siblings ...)
  2025-08-26 22:00 ` [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Chris Li
@ 2025-08-30  5:44 ` Chris Li
  10 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30  5:44 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

Hi Kairui,

I give one pass of review on your series already. I Ack a portion of
it. I expect some clarification or update on the rest.

I especially want to double check the swap cache atomic set a range of
swap entries to folio.
I want to make sure this bug does not happen to swap table:
https://lore.kernel.org/linux-mm/5bee194c-9cd3-47e7-919b-9f352441f855@kernel.dk/

I just double checked, the swap table should be fine in this regard.
The bug is triggered by memory allocation failure in the middle of
insert folio. Swap tables already populated the table when the swap
entry is allocated and handed out to the caller. We don't do memory
allocation when inserting folio into swap cache, which is a good
thing. We should not have that bug.

I also want some extra pair of eyes on those subtle behavior change
patches, I expect you to split them out in the next version.
I will need to go through the split out subtle patch one more time as
well. This pass I only catch the behavior change, haven't got a chance
to reason those behavior changes patches are indeed fine. If you can
defer those split out patches, that will save me some time to reason
them on the next round. Your call.

Oh, I also want to write a design document for the swap table idea. I
will send it your way to incorporate into your next version of the
series.

Thanks for the great work! I am very excited about this.

Chris

On Fri, Aug 22, 2025 at 12:20 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This is the first phase of the bigger series implementing basic
> infrastructures for the Swap Table idea proposed at the LSF/MM/BPF
> topic "Integrate swap cache, swap maps with swap allocator" [1].
>
> This phase I contains 9 patches, introduces the swap table infrastructure
> and uses it as the swap cache backend. By doing so, we have up to ~5-20%
> performance gain in throughput, RPS or build time for benchmark and
> workload tests. This is based on Chris Li's idea of using cluster size
> atomic arrays to implement swap cache. It has less contention on the swap
> cache access. The cluster size is much finer-grained than the 64M address
> space split, which is removed in this phase I. It also unifies and cleans
> up the swap code base.
>
> Each swap cluster will dynamically allocate the swap table, which is an
> atomic array to cover every swap slot in the cluster. It replaces the swap
> cache back by Xarray. In phase I, the static allocated swap_map still
> co-exists with the swap table. The memory usage is about the same as the
> original on average. A few exception test cases show about 1% higher in
> memory usage. In the following phases of the series, swap_map will merge
> into the swap table without additional memory allocation. It will result
> in net memory reduction compared to the original swap cache.
>
> Testing has shown that phase I has a significant performance improvement
> from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
> workloads.
>
> The full picture with a summary can be found at [2]. An older bigger
> series of 28 patches is posted at [3].
>
> vm-scability test:
> ==================
> Test with:
> usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
>                 Before:         After:
> System time:    220.86s         160.42s      (-27.36%)
> Throughput:     4775.18 MB/s    6381.43 MB/s (+33.63%)
> Free latency:   174492 us       132122 us    (+24.28%)
>
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> PMEM as swap)
>                 Before:         After:
> System time:    355.23s         295.28s      (-16.87%)
> Throughput:     4659.89 MB/s    5765.80 MB/s (+23.73%)
> Free latency:   500417 us       477098 us     (-4.66%)
>
> This shows an improvement of more than 20% improvement in most readings.
>
> Build kernel test:
> ==================
> Building kernel with defconfig on tmpfs with ZSWAP / ZRAM is looking
> good. The results below show a test matrix using different memory
> pressure and setups. Tests are done with shmem as filesystem and
> using the same build config, measuring sys and real time in seconds
> (user time is almost identical as expected):
>
>  -j<NR> / Mem  | Sys before / after  | Real before / after
> Using 16G ZRAM with memcg limit:
>      12 / 256M | 6475 / 6232  -3.75% | 814 / 793   -2.58%
>      24 / 384M | 5904 / 5560  -5.82% | 413 / 397   -3.87%
>      48 / 768M | 4762 / 4242  -10.9% | 187 / 179   -4.27%
> With 64k folio:
>      24 / 512M | 4196 / 4062  -3.19% | 325 / 319   -1.84%
>      48 / 1G   | 3622 / 3544  -2.15% | 148 / 146   -1.37%
> With ZSWAP with 3G memcg (using higher limit due to kmem account):
>      48 / 3G   |  605 /  571  -5.61% |  81 /  79   -2.47%
>
> For extremely high pressure global pressure, using ZSWAP with 32G
> NVMEs in a 48c VM that has 4G memory globally, no memcg limit, system
> components take up about 1.5G so the pressure is high, using make -j48:
>
> Before:  sys time: 2061.72s            real time: 135.61s
> After:   sys time: 1990.96s (-3.43%)   real time: 134.03s (-1.16%)
>
> All cases are faster, and no regression even under heavy global
> memory pressure.
>
> Redis / Valkey bench:
> =====================
> The test machine is a ARM64 VM with 1.5G memory, redis is set to
> use 2.5G memory:
>
> Testing with:
> redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get
>
>                 no BGSAVE                with BGSAVE
> Before:         433015.08 RPS            271421.15 RPS
> After:          431537.61 RPS (-0.34%)   290441.79 RPS (+7.0%)
>
> Testing with:
> redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get
>                 no BGSAVE                with BGSAVE
> Before:         446339.45 RPS            274845.19 RPS
> After:          442697.29 RPS (-0.81%)   293053.59 RPS (+6.6%)
>
> With BGSAVE enabled, most Redis memory will have a swap count > 1 so
> swap cache is heavily in use. We can see a >5% performance. No BGSAVE
> is very slightly slower (<1%) due to the higher memory pressure of the
> co-existence of swap_map and swap table. This will be optimzed into a
> net gain and up to 20% gain in BGSAVE case in the following phases.
>
> Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
> Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Kairui Song (9):
>   mm, swap: use unified helper for swap cache look up
>   mm, swap: always lock and check the swap cache folio before use
>   mm, swap: rename and move some swap cluster definition and helpers
>   mm, swap: tidy up swap device and cluster info helpers
>   mm/shmem, swap: remove redundant error handling for replacing folio
>   mm, swap: use the swap table for the swap cache and switch API
>   mm, swap: remove contention workaround for swap cache
>   mm, swap: implement dynamic allocation of swap table
>   mm, swap: use a single page for swap table when the size fits
>
>  MAINTAINERS          |   1 +
>  include/linux/swap.h |  42 ----
>  mm/filemap.c         |   2 +-
>  mm/huge_memory.c     |  16 +-
>  mm/memory-failure.c  |   2 +-
>  mm/memory.c          |  30 +--
>  mm/migrate.c         |  28 +--
>  mm/mincore.c         |   3 +-
>  mm/page_io.c         |  12 +-
>  mm/shmem.c           |  56 ++----
>  mm/swap.h            | 268 +++++++++++++++++++++----
>  mm/swap_state.c      | 404 +++++++++++++++++++-------------------
>  mm/swap_table.h      | 136 +++++++++++++
>  mm/swapfile.c        | 456 ++++++++++++++++++++++++++++---------------
>  mm/userfaultfd.c     |   5 +-
>  mm/vmscan.c          |  20 +-
>  mm/zswap.c           |   9 +-
>  17 files changed, 954 insertions(+), 536 deletions(-)
>  create mode 100644 mm/swap_table.h
>
> ---
>
> I was trying some new tools like b4 for branch management, and it seems
> a draft version was sent out by accident, but seems got rejected. I'm
> not sure if anyone is seeing duplicated or a malformed email. If so,
> please accept my apology and use this series for review, discussion
> or merge.
>
> --
> 2.51.0
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-30  1:53       ` Chris Li
@ 2025-08-30 15:15         ` Kairui Song
  2025-08-30 17:17           ` Chris Li
  2025-09-01 18:17         ` Kairui Song
  1 sibling, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-08-30 15:15 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 30, 2025 at 9:54 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Aug 27, 2025 at 7:36 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Aug 27, 2025 at 4:21 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index e9d0d2784cd5..b4d39f2a1e0a 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > > >                         count_vm_event(PGMAJFAULT);
> > > >                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
> > > >                 }
> > > > -       } else {
> > > > -               swap_update_readahead(folio, NULL, 0);
> > >
> > > Also this update readahead move to later might have a similar problem.
> > > All the bail out in the move will lose the readahead status update.
> > >
> > > The readahead deed is already done. Missing the status update seems
> > > incorrect.
> >
> > Thanks for the detailed review.
> >
> > The only change I wanted here is that swap readahead update should be
> > done after checking the folio still corresponds to the swap entry
> > triggering the swapin. That should have slight to none effect compared
> > to before considering the extremely tiny time window. We are only
> > following the convention more strictly.
> >
> > In theory it might even help to reduce false updates: if the folio no
> > longer corresponds to the swap entry, we are hitting an unrelated
> > folio, doing a readahead update will either mislead vma readahead's
> > address hint, or could clean up the readahead flag of an unrelated
> > folio without actually using it. If the folio does get hit in the
> > future, due to the missing readahead flag, the statistic will go
> > wrong.
>
> So the missing readahead stats update behavior is the correct and
> better behavior. I suggest you spit that out as a separate patch with
> appropriate comments about it too. It is also easier to bisect the
> commit if that kind of the subtle change which is considered safe
> turns out causing a problem. Causing problem not happen very often but
> it does happen before.

Yes, I'm planning to split one patch out for the readahead change.

> > > >  /*
> > > > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > > > - * unlocked and with its refcount incremented.
> > > > + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
> > > >   *
> > > > - * Caller must lock the swap device or hold a reference to keep it valid.
> > > > + * A found folio will be returned unlocked and with its refcount increased.
> > > > + *
> > > > + * Context: Caller must ensure @entry is valid and pin the swap device, also
> > > Is the "pin" the same as  "lock the swap device or hold a reference"?
> > > Not sure why you changed that comment to "pin".
> >
> > Yes it's the same thing. We don't lock the device though, the device
> > can be pinned by the refcounf (get_swap_device) or locking anything
> > that is referencing the device (locking PTL the a PTE that contains an
> > swap entry pointing to the device, or locking a swap cache folio of a
> > swap entry that points to the device). So I juse used the word "pin".
> > I added some comments in mm/swap.h in later commits about what the
> > "pin" means.
>
> In that case why not reuse the previous comment keeping "lock the swap
> device or hold a reference" instead of "pin"?

I'm worried that the sentence "lock the swap device" is kind of fuzzy,
people may misunderstand that they need to hold si->lock. Actually
they only need to hold si->user or lock anything. It's not wrong but
kind of overkill.

>
> > > It seems to me that you want to add the comment for the return value check.
> > > Is that it?
> >
> > Right, the caller has to check the folio before use, so I'm trying to
> > document this convention.
>
> Again, I recommend reducing the unnecessary impact to the code, make
> it more obvious what you did actually change. I spend quite some time
> there trying to figure out what you are trying to accomplish with the
> comments.
>
> > > > + * check the returned folio after locking it (e.g. folio_swap_contains).
> > > >   */
> > > >  struct folio *swap_cache_get_folio(swp_entry_t entry)
> > > >  {
> > > > @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > >         for (;;) {
> > > >                 int err;
> > > >
> > > > -               /* Check the swap cache in case the folio is already there */
> > > > +               /*
> > > > +                * Check the swap cache first, if a cached folio is found,
> > > > +                * return it unlocked. The caller will lock and check it.
> > > > +                */
> > > >                 folio = swap_cache_get_folio(entry);
> > > >                 if (folio)
> > > >                         goto got_folio;
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index 4b8ab2cb49ca..12f2580ebe8d 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> > > > @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > > >          * Offset could point to the middle of a large folio, or folio
> > > >          * may no longer point to the expected offset before it's locked.
> > > >          */
> > > > -       entry = folio->swap;
> > > > -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> > > > +       if (!folio_contains_swap(folio, entry)) {
> > > >                 folio_unlock(folio);
> > > >                 folio_put(folio);
> > > >                 goto again;
> > > >         }
> > > > +       entry = folio->swap;
> > >
> > > Can you also check this as well? The "goto again" will have entries
> > > not assigned compared to previously.
> > > Too late for me to think straight now if that will cause a problem.
> >
> > Oh, thanks for pointing this part out. This patch is correct, it's the
> > original behaviour that is not correct. If the folio is no longer
> > valid (the if check here failed), changing the `entry` value before
> > could lead to a wrong look in the next attempt with `goto again`. That
> > could lead to reclaim of an unrelated folio. It's a trivial issue
> > though, only might marginally slow down the performance. Maybe I
> > should make a seperate patch to fix this issue first in case anyone
> > wants to backport it.
>
> Thanks for the explanation, please do split this subtle behavior
> change out with appropriate commit messages documenting your change,
> why it is safe and the better behavior.
>
> Thanks

Thanks for the review, I think separating 2 patches (one for
__try_to_reclaim_swap and one for readahead) out of this one should be
good enough and make everyone happy, overall the code is still the
same.

>
> Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-30  4:07   ` Chris Li
@ 2025-08-30 15:24     ` Kairui Song
  2025-08-31 15:54       ` Kairui Song
  2025-08-31 20:04       ` Chris Li
  0 siblings, 2 replies; 90+ messages in thread
From: Kairui Song @ 2025-08-30 15:24 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

On Sat, Aug 30, 2025 at 1:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Kairui,
>
> It feels so good to remove that 64M swap cache space. Thank you for
> making it happen.
>
> Some nitpick follows. I am fine as is as well.
>
> Acked-by: Chris Li <chrisl@kernel.org>

Thanks.

>
> Chris
>
> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Swap cluster setup will try to shuffle the clusters on initialization.
> > It was helpful to avoid contention for the swap cache space. The cluster
> > size (2M) was much smaller than each swap cache space (64M), so shuffling
> > the cluster means the allocator will try to allocate swap slots that are
> > in different swap cache spaces for each CPU, reducing the chance of two
> > CPUs using the same swap cache space, and hence reducing the contention.
> >
> > Now, swap cache is managed by swap clusters, this shuffle is pointless.
> > Just remove it, and clean up related macros.
> >
> > This should also improve the HDD swap performance as shuffling IO is a
> > bad idea for HDD, and now the shuffling is gone.
>
> Did you have any numbers to prove that :-). Last time the swap
> allocator stress testing has already destroyed two of my SAS drives
> dedicated for testing. So I am not very keen on running the HDD swap
> stress test. The HDD swap stress test are super slow to run, it takes
> ages.

I did some test months before, removing the cluster shuffle did help.
I didn't test it again this time, only did some stress test. Doing
performance test on HDD is really not a good experience as my HDD
drives are too old so a long running test kills them easily.

And I couldn't find any other factor that is causing a serial HDD IO
regression, maybe the bot can help verify. If this doesn't help, we'll
think of something else. But I don't think HDD based SWAP will ever
have a practical good performance as they are terrible at rand read...

Anyway, let me try again with HDD today, maybe I'll get some useful data.

>
> >
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap.h     |  4 ----
> >  mm/swapfile.c | 32 ++++++++------------------------
> >  mm/zswap.c    |  7 +++++--
> >  3 files changed, 13 insertions(+), 30 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 4af42bc2cd72..ce3ec62cc05e 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -153,10 +153,6 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
> >  void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
> >
> >  /* linux/mm/swap_state.c */
> > -/* One swap address space for each 64M swap space */
> > -#define SWAP_ADDRESS_SPACE_SHIFT       14
> > -#define SWAP_ADDRESS_SPACE_PAGES       (1 << SWAP_ADDRESS_SPACE_SHIFT)
> > -#define SWAP_ADDRESS_SPACE_MASK                (SWAP_ADDRESS_SPACE_PAGES - 1)
> >  extern struct address_space swap_space __ro_after_init;
> >  static inline struct address_space *swap_address_space(swp_entry_t entry)
> >  {
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index df68b5e242a6..0c8001c99f30 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -3203,21 +3203,14 @@ static int setup_swap_map(struct swap_info_struct *si,
> >         return 0;
> >  }
> >
> > -#define SWAP_CLUSTER_INFO_COLS                                         \
> > -       DIV_ROUND_UP(L1_CACHE_BYTES, sizeof(struct swap_cluster_info))
> > -#define SWAP_CLUSTER_SPACE_COLS                                                \
> > -       DIV_ROUND_UP(SWAP_ADDRESS_SPACE_PAGES, SWAPFILE_CLUSTER)
> > -#define SWAP_CLUSTER_COLS                                              \
> > -       max_t(unsigned int, SWAP_CLUSTER_INFO_COLS, SWAP_CLUSTER_SPACE_COLS)
> > -
> >  static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >                                                 union swap_header *swap_header,
> >                                                 unsigned long maxpages)
> >  {
> >         unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> >         struct swap_cluster_info *cluster_info;
> > -       unsigned long i, j, idx;
> >         int err = -ENOMEM;
> > +       unsigned long i;
>
> Nitpick: This line location change is not necessary.
>
> >
> >         cluster_info = kvcalloc(nr_clusters, sizeof(*cluster_info), GFP_KERNEL);
> >         if (!cluster_info)
> > @@ -3266,22 +3259,13 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
> >                 INIT_LIST_HEAD(&si->frag_clusters[i]);
> >         }
> >
> > -       /*
> > -        * Reduce false cache line sharing between cluster_info and
> > -        * sharing same address space.
> > -        */
> > -       for (j = 0; j < SWAP_CLUSTER_COLS; j++) {
> > -               for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> > -                       struct swap_cluster_info *ci;
> > -                       idx = i * SWAP_CLUSTER_COLS + j;
> > -                       ci = cluster_info + idx;
> > -                       if (idx >= nr_clusters)
> > -                               continue;
> > -                       if (ci->count) {
> > -                               ci->flags = CLUSTER_FLAG_NONFULL;
> > -                               list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> > -                               continue;
> > -                       }
> > +       for (i = 0; i < nr_clusters; i++) {
> > +               struct swap_cluster_info *ci = &cluster_info[i];
>
> struct swap_cluster_info *ci = cluster_info + i;
> looks simpler. Pure nitpick and personal preference, you don't have to
> follow it.
>
> > +
> > +               if (ci->count) {
> > +                       ci->flags = CLUSTER_FLAG_NONFULL;
> > +                       list_add_tail(&ci->list, &si->nonfull_clusters[0]);
> > +               } else {
> >                         ci->flags = CLUSTER_FLAG_FREE;
> >                         list_add_tail(&ci->list, &si->free_clusters);
> >                 }
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index c869859eec77..c0a9be14a725 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -237,10 +237,13 @@ static bool zswap_has_pool;
> >  * helpers and fwd declarations
> >  **********************************/
> >
> > +/* One swap address space for each 64M swap space */
> > +#define ZSWAP_ADDRESS_SPACE_SHIFT 14
> > +#define ZSWAP_ADDRESS_SPACE_PAGES (1 << ZSWAP_ADDRESS_SPACE_SHIFT)
> >  static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
> >  {
> >         return &zswap_trees[swp_type(swp)][swp_offset(swp)
> > -               >> SWAP_ADDRESS_SPACE_SHIFT];
> > +               >> ZSWAP_ADDRESS_SPACE_SHIFT];
> >  }
> >
> >  #define zswap_pool_debug(msg, p)                               \
> > @@ -1771,7 +1774,7 @@ int zswap_swapon(int type, unsigned long nr_pages)
> >         struct xarray *trees, *tree;
> >         unsigned int nr, i;
> >
> > -       nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
> > +       nr = DIV_ROUND_UP(nr_pages, ZSWAP_ADDRESS_SPACE_PAGES);
> >         trees = kvcalloc(nr, sizeof(*tree), GFP_KERNEL);
> >         if (!trees) {
> >                 pr_err("alloc failed, zswap disabled for swap type %d\n", type);
> > --
> > 2.51.0
> >
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-30  3:34   ` Chris Li
@ 2025-08-30 16:52     ` Kairui Song
  2025-08-31  1:00       ` Chris Li
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-08-30 16:52 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 30, 2025 at 11:43 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Introduce basic swap table infrastructures, which are now just a
> > fixed-sized flat array inside each swap cluster, with access wrappers.
> >
> > Each cluster contains a swap table of 512 entries. Each table entry is
> > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > a folio type (pointer), or NULL.
> >
> > In this first step, it only supports storing a folio or shadow, and it
> > is a drop-in replacement for the current swap cache. Convert all swap
> > cache users to use the new sets of APIs. Chris Li has been suggesting
> > using a new infrastructure for swap cache for better performance, and
> > that idea combined well with the swap table as the new backing
> > structure. Now the lock contention range is reduced to 2M clusters,
> > which is much smaller than the 64M address_space. And we can also drop
> > the multiple address_space design.
> >
> > All the internal works are done with swap_cache_get_* helpers. Swap
> > cache lookup is still lock-less like before, and the helper's contexts
> > are same with original swap cache helpers. They still require a pin
> > on the swap device to prevent the backing data from being freed.
> >
> > Swap cache updates are now protected by the swap cluster lock
> > instead of the Xarray lock. This is mostly handled internally, but new
> > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > few new cluster access and locking helpers are also introduced.
> >
> > A fully cluster-based unified swap table can be implemented on top
> > of this to take care of all count tracking and synchronization work,
> > with dynamic allocation. It should reduce the memory usage while
> > making the performance even better.
> >
> > Co-developed-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  /*
> > - * This must be called only on folios that have
> > - * been verified to be in the swap cache and locked.
> > - * It will never put the folio into the free list,
> > - * the caller has a reference on the folio.
> > + * Replace an old folio in the swap cache with a new one. The caller must
> > + * hold the cluster lock and set the new folio's entry and flags.
> >   */
> > -void delete_from_swap_cache(struct folio *folio)
> > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > +                               struct folio *old, struct folio *new)
> > +{
> > +       unsigned int ci_off = swp_cluster_offset(entry);
> > +       unsigned long nr_pages = folio_nr_pages(new);
> > +       unsigned int ci_end = ci_off + nr_pages;
> > +
> > +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > +       do {
> > +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > +               __swap_table_set_folio(ci, ci_off, new);
>
> I recall in my original experiment swap cache replacement patch I used
> atomic compare exchange somewhere. It has been a while. Is there a
> reason to not use atomic cmpexchg() or that is in the later part of
> the series?

For now all swap table modifications are protected by ci lock, extra
atomic / cmpxchg is not needed.

We might be able to make use of cmpxchg in later phases. e.g. when
locking a folio is enough to ensure the final consistency of swap
count, cmpxchg can be used as a fast path to increase the swap count.

We can't do that now as the swap count is still managed by swap_map,
not swap table. And swap allocation / dup does not have a clear
definition about how they interact with folios, and range operations
all need the ci lock...  We might be able to figure out a stable way
to handle range operations too once we sort out how folios interact
with SWAP in a later phase, I tried that in the previous long series
and this part seems doable.

I'm not sure if that will benefit a lot, or will it make it more
complex for the high order swap table to be implemented. The cluster
lock is already very fine grained. We can do some experiments in the
future to verify it.

But the good thing is in either case, this is on the right path :)

> > +       } while (++ci_off < ci_end);
> > +
> > +       /*
> > +        * If the old folio is partially replaced (e.g., splitting a large
> > +        * folio, the old folio is shrunk in place, and new split sub folios
> > +        * are added to cache), ensure the new folio doesn't overlap it.
> > +        */
> > +       if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> > +           folio_order(old) != folio_order(new)) {
> > +               ci_off = swp_cluster_offset(old->swap);
> > +               ci_end = ci_off + folio_nr_pages(old);
> > +               while (ci_off++ < ci_end)
> > +                       WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
>
> Will this cause the swap cache to replace less than full folio range
> of the swap entry in range?
> The swap cache set folio should atomically set the full range of swap
> entries. If there is some one race to set some partial range. I
> suspect it should fail and undo the particle set. I recall there are
> some bugs on xarray accidentally fixed by one of your patches related
> to that kind of atomic behavior.
>
> I want to make sure a similar bug does not happen here.
>
> It is worthwhile to double check if the atomic folio set behavior.

Right, some callers that hold the ci lock by themselves (migration /
huge_mm split) have to ensure they do the folio replacement in a
correct way by themselves.

This is the same story for Xarray. These callers just used to hold the
xa lock and manipulate the xarray directly: e.g. split generates new
folios, new sub folios have to be added to swap cache in the right
place to override the old folio. The behavior is the same before /
after this commit, I just added a sanity check here to ensure nothing
went wrong, only to make it more reliable by adding checks in the
debug build.

I checked the logic here multiple times and tested it on multiple
kernel versions that have slightly different code for huge_mm split,
all went well.

>
> Looks good to me otherwise. Just waiting for confirmation of the swap
> cache atomic set behavior.
>
> Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-30 15:15         ` Kairui Song
@ 2025-08-30 17:17           ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-30 17:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 30, 2025 at 8:16 AM Kairui Song <ryncsn@gmail.com> wrote:
> > So the missing readahead stats update behavior is the correct and
> > better behavior. I suggest you spit that out as a separate patch with
> > appropriate comments about it too. It is also easier to bisect the
> > commit if that kind of the subtle change which is considered safe
> > turns out causing a problem. Causing problem not happen very often but
> > it does happen before.
>
> Yes, I'm planning to split one patch out for the readahead change.

Ack.
>
> > > > >  /*
> > > > > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > > > > - * unlocked and with its refcount incremented.
> > > > > + * swap_cache_get_folio - Lookup a swap entry in the swap cache.
> > > > >   *
> > > > > - * Caller must lock the swap device or hold a reference to keep it valid.
> > > > > + * A found folio will be returned unlocked and with its refcount increased.
> > > > > + *
> > > > > + * Context: Caller must ensure @entry is valid and pin the swap device, also
> > > > Is the "pin" the same as  "lock the swap device or hold a reference"?
> > > > Not sure why you changed that comment to "pin".
> > >
> > > Yes it's the same thing. We don't lock the device though, the device
> > > can be pinned by the refcounf (get_swap_device) or locking anything
> > > that is referencing the device (locking PTL the a PTE that contains an
> > > swap entry pointing to the device, or locking a swap cache folio of a
> > > swap entry that points to the device). So I juse used the word "pin".
> > > I added some comments in mm/swap.h in later commits about what the
> > > "pin" means.
> >
> > In that case why not reuse the previous comment keeping "lock the swap
> > device or hold a reference" instead of "pin"?
>
> I'm worried that the sentence "lock the swap device" is kind of fuzzy,
> people may misunderstand that they need to hold si->lock. Actually
> they only need to hold si->user or lock anything. It's not wrong but
> kind of overkill.

What you just told me is a lot more useful than what was previously there.
Try to incorporate that into the comments. e.g. instead of "lock the
swap device", do "lock si->user or any other lock" or something like
that.

> > > > It seems to me that you want to add the comment for the return value check.
> > > > Is that it?
> > >
> > > Right, the caller has to check the folio before use, so I'm trying to
> > > document this convention.
> >
> > Again, I recommend reducing the unnecessary impact to the code, make
> > it more obvious what you did actually change. I spend quite some time
> > there trying to figure out what you are trying to accomplish with the
> > comments.
> >
> > > > > + * check the returned folio after locking it (e.g. folio_swap_contains).
> > > > >   */
> > > > >  struct folio *swap_cache_get_folio(swp_entry_t entry)
> > > > >  {
> > > > > @@ -338,7 +340,10 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > > >         for (;;) {
> > > > >                 int err;
> > > > >
> > > > > -               /* Check the swap cache in case the folio is already there */
> > > > > +               /*
> > > > > +                * Check the swap cache first, if a cached folio is found,
> > > > > +                * return it unlocked. The caller will lock and check it.
> > > > > +                */
> > > > >                 folio = swap_cache_get_folio(entry);
> > > > >                 if (folio)
> > > > >                         goto got_folio;
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index 4b8ab2cb49ca..12f2580ebe8d 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > > > > @@ -240,12 +240,12 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > > > >          * Offset could point to the middle of a large folio, or folio
> > > > >          * may no longer point to the expected offset before it's locked.
> > > > >          */
> > > > > -       entry = folio->swap;
> > > > > -       if (offset < swp_offset(entry) || offset >= swp_offset(entry) + nr_pages) {
> > > > > +       if (!folio_contains_swap(folio, entry)) {
> > > > >                 folio_unlock(folio);
> > > > >                 folio_put(folio);
> > > > >                 goto again;
> > > > >         }
> > > > > +       entry = folio->swap;
> > > >
> > > > Can you also check this as well? The "goto again" will have entries
> > > > not assigned compared to previously.
> > > > Too late for me to think straight now if that will cause a problem.
> > >
> > > Oh, thanks for pointing this part out. This patch is correct, it's the
> > > original behaviour that is not correct. If the folio is no longer
> > > valid (the if check here failed), changing the `entry` value before
> > > could lead to a wrong look in the next attempt with `goto again`. That
> > > could lead to reclaim of an unrelated folio. It's a trivial issue
> > > though, only might marginally slow down the performance. Maybe I
> > > should make a seperate patch to fix this issue first in case anyone
> > > wants to backport it.
> >
> > Thanks for the explanation, please do split this subtle behavior
> > change out with appropriate commit messages documenting your change,
> > why it is safe and the better behavior.
> >
> > Thanks
>
> Thanks for the review, I think separating 2 patches (one for
> __try_to_reclaim_swap and one for readahead) out of this one should be
> good enough and make everyone happy, overall the code is still the
> same.

It is your call. I am happy to review them the same. It might take me
more time to reason about it and slightly delay your series merge to
mm-unstable, that is all.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-30 16:52     ` Kairui Song
@ 2025-08-31  1:00       ` Chris Li
  2025-09-02 11:51         ` Kairui Song
  0 siblings, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-08-31  1:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 30, 2025 at 9:53 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 30, 2025 at 11:43 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Introduce basic swap table infrastructures, which are now just a
> > > fixed-sized flat array inside each swap cluster, with access wrappers.
> > >
> > > Each cluster contains a swap table of 512 entries. Each table entry is
> > > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > > a folio type (pointer), or NULL.
> > >
> > > In this first step, it only supports storing a folio or shadow, and it
> > > is a drop-in replacement for the current swap cache. Convert all swap
> > > cache users to use the new sets of APIs. Chris Li has been suggesting
> > > using a new infrastructure for swap cache for better performance, and
> > > that idea combined well with the swap table as the new backing
> > > structure. Now the lock contention range is reduced to 2M clusters,
> > > which is much smaller than the 64M address_space. And we can also drop
> > > the multiple address_space design.
> > >
> > > All the internal works are done with swap_cache_get_* helpers. Swap
> > > cache lookup is still lock-less like before, and the helper's contexts
> > > are same with original swap cache helpers. They still require a pin
> > > on the swap device to prevent the backing data from being freed.
> > >
> > > Swap cache updates are now protected by the swap cluster lock
> > > instead of the Xarray lock. This is mostly handled internally, but new
> > > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > > few new cluster access and locking helpers are also introduced.
> > >
> > > A fully cluster-based unified swap table can be implemented on top
> > > of this to take care of all count tracking and synchronization work,
> > > with dynamic allocation. It should reduce the memory usage while
> > > making the performance even better.
> > >
> > > Co-developed-by: Chris Li <chrisl@kernel.org>
> > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  /*
> > > - * This must be called only on folios that have
> > > - * been verified to be in the swap cache and locked.
> > > - * It will never put the folio into the free list,
> > > - * the caller has a reference on the folio.
> > > + * Replace an old folio in the swap cache with a new one. The caller must
> > > + * hold the cluster lock and set the new folio's entry and flags.
> > >   */
> > > -void delete_from_swap_cache(struct folio *folio)
> > > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > > +                               struct folio *old, struct folio *new)
> > > +{
> > > +       unsigned int ci_off = swp_cluster_offset(entry);
> > > +       unsigned long nr_pages = folio_nr_pages(new);
> > > +       unsigned int ci_end = ci_off + nr_pages;
> > > +
> > > +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > > +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > > +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > > +       do {
> > > +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > > +               __swap_table_set_folio(ci, ci_off, new);
> >
> > I recall in my original experiment swap cache replacement patch I used
> > atomic compare exchange somewhere. It has been a while. Is there a
> > reason to not use atomic cmpexchg() or that is in the later part of
> > the series?
>
> For now all swap table modifications are protected by ci lock, extra
> atomic / cmpxchg is not needed.
>
> We might be able to make use of cmpxchg in later phases. e.g. when
> locking a folio is enough to ensure the final consistency of swap
> count, cmpxchg can be used as a fast path to increase the swap count.

You did not get what I am asking. Let me clarify.

I mean even if we keep the ci lock, not change that locking
requirement part. In the above code. Why can't we use cmpexchge to
make sure that we only overwrite the form "old" -> "new".
I am not saying we need to do the lockless part here.

I mean in the possible sequence
WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old); //
still "old" here, not warning issued
/// another CPU race writes "old" to "old2" because of a bug.
__swap_table_set_folio(ci, ci_off, new); // now "new" overwrites
"old2" without warning.

Has the typical race that you check the value old, then you overwrite
value new. But what if the old changes to "old2" before you overwrite
it with "new"?
You overwrite "old2" silently.

I mean to catch that.

Using cmpxchg will make sure we only change "old" -> "new". We can
catch the buggy situation above by overwriting "old2" -> "new".
Also when we find out the entry is "old2" not "old" there, is WARN_ONCE enough?

I also want to discuss what we should do if we did catch the "old2"
there in the swap cache instead of "old".
I feel that continuing with WARN_ONCE might not be good enough. It
will make data corruption popergate.

Should we roll back the new value and fail the swap cache folio set
function to avoid the possible data corruption?
if we found "old2", The new guy can't set the folio to the new value.
Deal with that error. Will that avoid data corruption? Not being able
to make forward progress is still much better than forward progress
with data corruption.

I just don't want silent overwritten values we aren't expecting.

> We can't do that now as the swap count is still managed by swap_map,
> not swap table. And swap allocation / dup does not have a clear
> definition about how they interact with folios, and range operations
> all need the ci lock...  We might be able to figure out a stable way
> to handle range operations too once we sort out how folios interact
> with SWAP in a later phase, I tried that in the previous long series
> and this part seems doable.

See above I don't mean to change the locking logic here. Only assert
the previous value is old when overwritten to new.

>
> I'm not sure if that will benefit a lot, or will it make it more
> complex for the high order swap table to be implemented. The cluster
> lock is already very fine grained. We can do some experiments in the
> future to verify it.

Not asking to change lock logic. See above.

I feel that we should always use cmpxchg to assign value to the swap
table just to be paranoid. It is data corruption we are risking here.

> But the good thing is in either case, this is on the right path :)
>
> > > +       } while (++ci_off < ci_end);
> > > +
> > > +       /*
> > > +        * If the old folio is partially replaced (e.g., splitting a large
> > > +        * folio, the old folio is shrunk in place, and new split sub folios
> > > +        * are added to cache), ensure the new folio doesn't overlap it.
> > > +        */
> > > +       if (IS_ENABLED(CONFIG_DEBUG_VM) &&
> > > +           folio_order(old) != folio_order(new)) {
> > > +               ci_off = swp_cluster_offset(old->swap);
> > > +               ci_end = ci_off + folio_nr_pages(old);
> > > +               while (ci_off++ < ci_end)
> > > +                       WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> >
> > Will this cause the swap cache to replace less than full folio range
> > of the swap entry in range?
> > The swap cache set folio should atomically set the full range of swap
> > entries. If there is some one race to set some partial range. I
> > suspect it should fail and undo the particle set. I recall there are
> > some bugs on xarray accidentally fixed by one of your patches related
> > to that kind of atomic behavior.
> >
> > I want to make sure a similar bug does not happen here.
> >
> > It is worthwhile to double check if the atomic folio set behavior.
>
> Right, some callers that hold the ci lock by themselves (migration /
> huge_mm split) have to ensure they do the folio replacement in a
> correct way by themselves.
>
> This is the same story for Xarray. These callers just used to hold the
> xa lock and manipulate the xarray directly: e.g. split generates new
> folios, new sub folios have to be added to swap cache in the right
> place to override the old folio. The behavior is the same before /
> after this commit, I just added a sanity check here to ensure nothing
> went wrong, only to make it more reliable by adding checks in the
> debug build.
>
> I checked the logic here multiple times and tested it on multiple
> kernel versions that have slightly different code for huge_mm split,
> all went well.

Thanks for the double checking.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-30 15:24     ` Kairui Song
@ 2025-08-31 15:54       ` Kairui Song
  2025-08-31 20:06         ` Chris Li
  2025-08-31 20:04       ` Chris Li
  1 sibling, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-08-31 15:54 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

On Sat, Aug 30, 2025 at 11:24 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 30, 2025 at 1:03 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Kairui,
> >
> > It feels so good to remove that 64M swap cache space. Thank you for
> > making it happen.
> >
> > Some nitpick follows. I am fine as is as well.
> >
> > Acked-by: Chris Li <chrisl@kernel.org>
>
> Thanks.
>
> >
> > Chris
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Swap cluster setup will try to shuffle the clusters on initialization.
> > > It was helpful to avoid contention for the swap cache space. The cluster
> > > size (2M) was much smaller than each swap cache space (64M), so shuffling
> > > the cluster means the allocator will try to allocate swap slots that are
> > > in different swap cache spaces for each CPU, reducing the chance of two
> > > CPUs using the same swap cache space, and hence reducing the contention.
> > >
> > > Now, swap cache is managed by swap clusters, this shuffle is pointless.
> > > Just remove it, and clean up related macros.
> > >
> > > This should also improve the HDD swap performance as shuffling IO is a
> > > bad idea for HDD, and now the shuffling is gone.
> >
> > Did you have any numbers to prove that :-). Last time the swap
> > allocator stress testing has already destroyed two of my SAS drives
> > dedicated for testing. So I am not very keen on running the HDD swap
> > stress test. The HDD swap stress test are super slow to run, it takes
> > ages.
>
> I did some test months before, removing the cluster shuffle did help.
> I didn't test it again this time, only did some stress test. Doing
> performance test on HDD is really not a good experience as my HDD
> drives are too old so a long running test kills them easily.
>
> And I couldn't find any other factor that is causing a serial HDD IO
> regression, maybe the bot can help verify. If this doesn't help, we'll
> think of something else. But I don't think HDD based SWAP will ever
> have a practical good performance as they are terrible at rand read...
>
> Anyway, let me try again with HDD today, maybe I'll get some useful data.

So I tried to run some HDD test for many rounds, basically doing the
test in the URL below manually. Test is done using nr_task = 8. The
HDD swap partition size is 8G.

Do the preparation following:
https://github.com/intel/lkp-tests/blob/master/setup/swapin_setup
(Make usemem hold 8G memory and push them to swap)

And do the test with:
https://github.com/intel/lkp-tests/blob/master/programs/swapin/run
(Use SIGUSR1 to make usemem to read its memory and swapin)

Before this patch:
Test run 1:
1073741824 bytes / 878662493 usecs = 1193 KB/s
33019 usecs to free memory
1073741824 bytes / 891315681 usecs = 1176 KB/s
35144 usecs to free memory
1073741824 bytes / 898801090 usecs = 1166 KB/s
36305 usecs to free memory
1073741824 bytes / 925899753 usecs = 1132 KB/s
20498 usecs to free memory
1073741824 bytes / 927522592 usecs = 1130 KB/s
34397 usecs to free memory
1073741824 bytes / 928164994 usecs = 1129 KB/s
35908 usecs to free memory
1073741824 bytes / 929890294 usecs = 1127 KB/s
35014 usecs to free memory
1073741824 bytes / 929997808 usecs = 1127 KB/s
30491 usecs to free memory
test done

Test run 2:
1073741824 bytes / 771932432 usecs = 1358 KB/s
31194 usecs to free memory
1073741824 bytes / 788739551 usecs = 1329 KB/s
25714 usecs to free memory
1073741824 bytes / 795853979 usecs = 1317 KB/s
33809 usecs to free memory
1073741824 bytes / 798019211 usecs = 1313 KB/s
32019 usecs to free memory
1073741824 bytes / 798771141 usecs = 1312 KB/s
31689 usecs to free memory
1073741824 bytes / 800384757 usecs = 1310 KB/s
32622 usecs to free memory
1073741824 bytes / 800822764 usecs = 1309 KB/s
1073741824 bytes / 800882227 usecs = 1309 KB/s
32789 usecs to free memory
30577 usecs to free memory
test done

Test run 3:
1073741824 bytes / 775202370 usecs = 1352 KB/s
31832 usecs to free memory
1073741824 bytes / 777618372 usecs = 1348 KB/s
30172 usecs to free memory
1073741824 bytes / 778180006 usecs = 1347 KB/s
32482 usecs to free memory
1073741824 bytes / 778521023 usecs = 1346 KB/s
30188 usecs to free memory
1073741824 bytes / 779207791 usecs = 1345 KB/s
29364 usecs to free memory
1073741824 bytes / 780753200 usecs = 1343 KB/s
29860 usecs to free memory
1073741824 bytes / 781078362 usecs = 1342 KB/s
30449 usecs to free memory
1073741824 bytes / 781224993 usecs = 1342 KB/s
19557 usecs to free memory
test done


After this patch:
Test run 1:
1073741824 bytes / 569803736 usecs = 1840 KB/s
29032 usecs to free memory
1073741824 bytes / 573718349 usecs = 1827 KB/s
30399 usecs to free memory
1073741824 bytes / 592070142 usecs = 1771 KB/s
31896 usecs to free memory
1073741824 bytes / 593484694 usecs = 1766 KB/s
30650 usecs to free memory
1073741824 bytes / 596693866 usecs = 1757 KB/s
31582 usecs to free memory
1073741824 bytes / 597359263 usecs = 1755 KB/s
26436 usecs to free memory
1073741824 bytes / 598339187 usecs = 1752 KB/s
30697 usecs to free memory
1073741824 bytes / 598674138 usecs = 1751 KB/s
29791 usecs to free memory
test done

Test run 2:
1073741824 bytes / 578821803 usecs = 1811 KB/s
28433 usecs to free memory
1073741824 bytes / 584262760 usecs = 1794 KB/s
28565 usecs to free memory
1073741824 bytes / 586118970 usecs = 1789 KB/s
27365 usecs to free memory
1073741824 bytes / 589159154 usecs = 1779 KB/s
42645 usecs to free memory
1073741824 bytes / 593487980 usecs = 1766 KB/s
28684 usecs to free memory
1073741824 bytes / 606025290 usecs = 1730 KB/s
28974 usecs to free memory
1073741824 bytes / 607547362 usecs = 1725 KB/s
33221 usecs to free memory
1073741824 bytes / 607882511 usecs = 1724 KB/s
31393 usecs to free memory
test done

Test run 3:
1073741824 bytes / 487637856 usecs = 2150 KB/s
28022 usecs to free memory
1073741824 bytes / 491211037 usecs = 2134 KB/s
28229 usecs to free memory
1073741824 bytes / 527698561 usecs = 1987 KB/s
30265 usecs to free memory
1073741824 bytes / 531719920 usecs = 1972 KB/s
30373 usecs to free memory
1073741824 bytes / 532555758 usecs = 1968 KB/s
30019 usecs to free memory
1073741824 bytes / 532942789 usecs = 1967 KB/s
29354 usecs to free memory
1073741824 bytes / 540793872 usecs = 1938 KB/s
32703 usecs to free memory
1073741824 bytes / 541343777 usecs = 1936 KB/s
33428 usecs to free memory
test done

It seems to match the ~33% swapin.throughput regression reported by
the bot, it's about ~40% faster with this patch applied. I'll add this
test result to V2.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-30 15:24     ` Kairui Song
  2025-08-31 15:54       ` Kairui Song
@ 2025-08-31 20:04       ` Chris Li
  1 sibling, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-31 20:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

On Sat, Aug 30, 2025 at 8:25 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 30, 2025 at 1:03 PM Chris Li <chrisl@kernel.org> wrote:
> >
> > Hi Kairui,
> >
> > It feels so good to remove that 64M swap cache space. Thank you for
> > making it happen.
> >
> > Some nitpick follows. I am fine as is as well.
> >
> > Acked-by: Chris Li <chrisl@kernel.org>
>
> Thanks.
>
> >
> > Chris
> >
> > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Swap cluster setup will try to shuffle the clusters on initialization.
> > > It was helpful to avoid contention for the swap cache space. The cluster
> > > size (2M) was much smaller than each swap cache space (64M), so shuffling
> > > the cluster means the allocator will try to allocate swap slots that are
> > > in different swap cache spaces for each CPU, reducing the chance of two
> > > CPUs using the same swap cache space, and hence reducing the contention.
> > >
> > > Now, swap cache is managed by swap clusters, this shuffle is pointless.
> > > Just remove it, and clean up related macros.
> > >
> > > This should also improve the HDD swap performance as shuffling IO is a
> > > bad idea for HDD, and now the shuffling is gone.
> >
> > Did you have any numbers to prove that :-). Last time the swap
> > allocator stress testing has already destroyed two of my SAS drives
> > dedicated for testing. So I am not very keen on running the HDD swap
> > stress test. The HDD swap stress test are super slow to run, it takes
> > ages.
>
> I did some test months before, removing the cluster shuffle did help.
> I didn't test it again this time, only did some stress test. Doing
> performance test on HDD is really not a good experience as my HDD
> drives are too old so a long running test kills them easily.
>
> And I couldn't find any other factor that is causing a serial HDD IO
> regression, maybe the bot can help verify. If this doesn't help, we'll
> think of something else. But I don't think HDD based SWAP will ever
> have a practical good performance as they are terrible at rand read...
>
> Anyway, let me try again with HDD today, maybe I'll get some useful data.

I am pulling your leg about the HDD number :-) I know that thing is
hard to get and I'm counting on you having a dead HDD on the way to
get it. Evil Chris that is.

I think we don't have to make claims like that HDD is faster. Just say
HDD might or might not swap faster, because we haven't really test it.
People who are about HDD swap can test themself and report back to the
mailing list.
I think we should use SSD to simulate HDD, testing the HDD code works
in the simulated mode, no crash, no data corruption that is good
enough.

HDD swap is so slow that most people don't really care. I asked in the
LPC a while back, nobody there is using HDD swap seriously. At most,
just in case my SSD swap overflows, allow it to overflow to HDD.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-31 15:54       ` Kairui Song
@ 2025-08-31 20:06         ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-08-31 20:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

On Sun, Aug 31, 2025 at 8:55 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> So I tried to run some HDD test for many rounds, basically doing the
> test in the URL below manually. Test is done using nr_task = 8. The
> HDD swap partition size is 8G.
>
> Do the preparation following:
> https://github.com/intel/lkp-tests/blob/master/setup/swapin_setup
> (Make usemem hold 8G memory and push them to swap)
>
> And do the test with:
> https://github.com/intel/lkp-tests/blob/master/programs/swapin/run
> (Use SIGUSR1 to make usemem to read its memory and swapin)
>
> Before this patch:
> Test run 1:
> 1073741824 bytes / 878662493 usecs = 1193 KB/s
> 33019 usecs to free memory
> 1073741824 bytes / 891315681 usecs = 1176 KB/s
> 35144 usecs to free memory
> 1073741824 bytes / 898801090 usecs = 1166 KB/s
> 36305 usecs to free memory
> 1073741824 bytes / 925899753 usecs = 1132 KB/s
> 20498 usecs to free memory
> 1073741824 bytes / 927522592 usecs = 1130 KB/s
> 34397 usecs to free memory
> 1073741824 bytes / 928164994 usecs = 1129 KB/s
> 35908 usecs to free memory
> 1073741824 bytes / 929890294 usecs = 1127 KB/s
> 35014 usecs to free memory
> 1073741824 bytes / 929997808 usecs = 1127 KB/s
> 30491 usecs to free memory
> test done
>
> Test run 2:
> 1073741824 bytes / 771932432 usecs = 1358 KB/s
> 31194 usecs to free memory
> 1073741824 bytes / 788739551 usecs = 1329 KB/s
> 25714 usecs to free memory
> 1073741824 bytes / 795853979 usecs = 1317 KB/s
> 33809 usecs to free memory
> 1073741824 bytes / 798019211 usecs = 1313 KB/s
> 32019 usecs to free memory
> 1073741824 bytes / 798771141 usecs = 1312 KB/s
> 31689 usecs to free memory
> 1073741824 bytes / 800384757 usecs = 1310 KB/s
> 32622 usecs to free memory
> 1073741824 bytes / 800822764 usecs = 1309 KB/s
> 1073741824 bytes / 800882227 usecs = 1309 KB/s
> 32789 usecs to free memory
> 30577 usecs to free memory
> test done
>
> Test run 3:
> 1073741824 bytes / 775202370 usecs = 1352 KB/s
> 31832 usecs to free memory
> 1073741824 bytes / 777618372 usecs = 1348 KB/s
> 30172 usecs to free memory
> 1073741824 bytes / 778180006 usecs = 1347 KB/s
> 32482 usecs to free memory
> 1073741824 bytes / 778521023 usecs = 1346 KB/s
> 30188 usecs to free memory
> 1073741824 bytes / 779207791 usecs = 1345 KB/s
> 29364 usecs to free memory
> 1073741824 bytes / 780753200 usecs = 1343 KB/s
> 29860 usecs to free memory
> 1073741824 bytes / 781078362 usecs = 1342 KB/s
> 30449 usecs to free memory
> 1073741824 bytes / 781224993 usecs = 1342 KB/s
> 19557 usecs to free memory
> test done
>
>
> After this patch:
> Test run 1:
> 1073741824 bytes / 569803736 usecs = 1840 KB/s
> 29032 usecs to free memory
> 1073741824 bytes / 573718349 usecs = 1827 KB/s
> 30399 usecs to free memory
> 1073741824 bytes / 592070142 usecs = 1771 KB/s
> 31896 usecs to free memory
> 1073741824 bytes / 593484694 usecs = 1766 KB/s
> 30650 usecs to free memory
> 1073741824 bytes / 596693866 usecs = 1757 KB/s
> 31582 usecs to free memory
> 1073741824 bytes / 597359263 usecs = 1755 KB/s
> 26436 usecs to free memory
> 1073741824 bytes / 598339187 usecs = 1752 KB/s
> 30697 usecs to free memory
> 1073741824 bytes / 598674138 usecs = 1751 KB/s
> 29791 usecs to free memory
> test done
>
> Test run 2:
> 1073741824 bytes / 578821803 usecs = 1811 KB/s
> 28433 usecs to free memory
> 1073741824 bytes / 584262760 usecs = 1794 KB/s
> 28565 usecs to free memory
> 1073741824 bytes / 586118970 usecs = 1789 KB/s
> 27365 usecs to free memory
> 1073741824 bytes / 589159154 usecs = 1779 KB/s
> 42645 usecs to free memory
> 1073741824 bytes / 593487980 usecs = 1766 KB/s
> 28684 usecs to free memory
> 1073741824 bytes / 606025290 usecs = 1730 KB/s
> 28974 usecs to free memory
> 1073741824 bytes / 607547362 usecs = 1725 KB/s
> 33221 usecs to free memory
> 1073741824 bytes / 607882511 usecs = 1724 KB/s
> 31393 usecs to free memory
> test done
>
> Test run 3:
> 1073741824 bytes / 487637856 usecs = 2150 KB/s
> 28022 usecs to free memory
> 1073741824 bytes / 491211037 usecs = 2134 KB/s
> 28229 usecs to free memory
> 1073741824 bytes / 527698561 usecs = 1987 KB/s
> 30265 usecs to free memory
> 1073741824 bytes / 531719920 usecs = 1972 KB/s
> 30373 usecs to free memory
> 1073741824 bytes / 532555758 usecs = 1968 KB/s
> 30019 usecs to free memory
> 1073741824 bytes / 532942789 usecs = 1967 KB/s
> 29354 usecs to free memory
> 1073741824 bytes / 540793872 usecs = 1938 KB/s
> 32703 usecs to free memory
> 1073741824 bytes / 541343777 usecs = 1936 KB/s
> 33428 usecs to free memory
> test done
>
> It seems to match the ~33% swapin.throughput regression reported by
> the bot, it's about ~40% faster with this patch applied. I'll add this
> test result to V2.

Oh, wow you do have the HDD number, congrates. Now we can make the
claim with numbers.
I hope you did not cripple a HDD to get that number.

Thanks.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-30  1:53       ` Chris Li
  2025-08-30 15:15         ` Kairui Song
@ 2025-09-01 18:17         ` Kairui Song
  2025-09-01 21:10           ` Chris Li
  1 sibling, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-09-01 18:17 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 30, 2025 at 9:54 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Wed, Aug 27, 2025 at 7:36 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Wed, Aug 27, 2025 at 4:21 PM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index e9d0d2784cd5..b4d39f2a1e0a 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -2379,8 +2379,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> > > >                         count_vm_event(PGMAJFAULT);
> > > >                         count_memcg_event_mm(fault_mm, PGMAJFAULT);
> > > >                 }
> > > > -       } else {
> > > > -               swap_update_readahead(folio, NULL, 0);
> > >
> > > Also this update readahead move to later might have a similar problem.
> > > All the bail out in the move will lose the readahead status update.
> > >
> > > The readahead deed is already done. Missing the status update seems
> > > incorrect.
> >
> > Thanks for the detailed review.
> >
> > The only change I wanted here is that swap readahead update should be
> > done after checking the folio still corresponds to the swap entry
> > triggering the swapin. That should have slight to none effect compared
> > to before considering the extremely tiny time window. We are only
> > following the convention more strictly.
> >
> > In theory it might even help to reduce false updates: if the folio no
> > longer corresponds to the swap entry, we are hitting an unrelated
> > folio, doing a readahead update will either mislead vma readahead's
> > address hint, or could clean up the readahead flag of an unrelated
> > folio without actually using it. If the folio does get hit in the
> > future, due to the missing readahead flag, the statistic will go
> > wrong.
>
> So the missing readahead stats update behavior is the correct and
> better behavior. I suggest you spit that out as a separate patch with
> appropriate comments about it too. It is also easier to bisect the
> commit if that kind of the subtle change which is considered safe
> turns out causing a problem. Causing problem not happen very often but
> it does happen before.
>

Hmm, after a second thought, maybe we should keep it as it is for now.

I just realized moving the swap_update_readahead after folio_lock is
more than just ensuring the folio is still valid. It will also cause
every swapin to do a readahead update. Previously, only cache hit
swapin will do a swap readahead update.

I did some tests, and didn't see any measurable performance difference
between putting it before / after the folio_lock. But changing it for
no good reason seems not a good idea after all.

So I think I'll keep it before the folio_lock. There is no evidence of
which strategy is better, just keep the current behaviour.

Calling swap_update_readahead even if the swap cache folio
is already invalidated is not really harmful, the only thing it does
that may effect the folio is the folio_test_clear_readahead call in
it, and we have been doing for years with no problem. Calling
swap_update_readahead for every folio might not be a good idea.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-09-01 18:17         ` Kairui Song
@ 2025-09-01 21:10           ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-09-01 21:10 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 1, 2025 at 11:18 AM Kairui Song <ryncsn@gmail.com> wrote:
> > So the missing readahead stats update behavior is the correct and
> > better behavior. I suggest you spit that out as a separate patch with
> > appropriate comments about it too. It is also easier to bisect the
> > commit if that kind of the subtle change which is considered safe
> > turns out causing a problem. Causing problem not happen very often but
> > it does happen before.
> >
>
> Hmm, after a second thought, maybe we should keep it as it is for now.
>
> I just realized moving the swap_update_readahead after folio_lock is
> more than just ensuring the folio is still valid. It will also cause
> every swapin to do a readahead update. Previously, only cache hit
> swapin will do a swap readahead update.
>
> I did some tests, and didn't see any measurable performance difference
> between putting it before / after the folio_lock. But changing it for
> no good reason seems not a good idea after all.
>
> So I think I'll keep it before the folio_lock. There is no evidence of
> which strategy is better, just keep the current behaviour.
>
> Calling swap_update_readahead even if the swap cache folio
> is already invalidated is not really harmful, the only thing it does
> that may effect the folio is the folio_test_clear_readahead call in
> it, and we have been doing for years with no problem. Calling
> swap_update_readahead for every folio might not be a good idea.

Thanks for the update. That is what I originally felt as well. It
caught me by surprise by this detour code path. I need to spend extra
time for the detour. I don't see it necessary to contribute to the
phase I goal of merging the swap table.

We can still add it later as a separate patch if we want.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
                     ` (2 preceding siblings ...)
  2025-08-28  3:20   ` Baolin Wang
@ 2025-09-01 23:50   ` Barry Song
  2025-09-02  6:12     ` Kairui Song
  2025-09-02 10:06   ` David Hildenbrand
  2025-09-02 10:10   ` David Hildenbrand
  5 siblings, 1 reply; 90+ messages in thread
From: Barry Song @ 2025-09-01 23:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 23, 2025 at 3:20 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Always use swap_cache_get_folio for swap cache folio look up. The reason
> we are not using it in all places is that it also updates the readahead
> info, and some callsites want to avoid that.
>
> So decouple readahead update with swap cache lookup into a standalone
> helper, let the caller call the readahead update helper if that's
> needed. And convert all swap cache lookups to use swap_cache_get_folio.
>
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration and shmem replacing,
> because they need to lock the Xarray. Following commits will wrap their
> accesses to the swap cache too with special helpers.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Nice! This has cleaned up the confusing mix of using
swap_cache_get_folio with VMA, filemap_get_entry,
swap_cache_get_folio without VMA, and filemap_get_folio.

Reviewed-by: Barry Song <baohua@kernel.org>

Do we have any potential "dropbehind" cases for anonymous folios?
I guess not for now.

__filemap_get_folio()
{
        if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
                folio_clear_dropbehind(folio);
}

Can we mention something about it in the changelog?

Best regards
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
  2025-08-27  6:13   ` Chris Li
  2025-08-27  7:03   ` Chris Li
@ 2025-09-02  5:40   ` Barry Song
  2025-09-02 10:18   ` David Hildenbrand
  3 siblings, 0 replies; 90+ messages in thread
From: Barry Song @ 2025-09-02  5:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

> +/**
> + * folio_contains_swap - Does this folio contain this swap entry?
> + * @folio: The folio.
> + * @entry: The swap entry to check against.
> + *
> + * Swap version of folio_contains()
> + *
> + * Context: The caller should have the folio locked to ensure
> + * nothing will move it out of the swap cache.
> + * Return: true or false.
> + */
> +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> +{
> +       pgoff_t offset = swp_offset(entry);
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));

VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);    ?

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers
  2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
  2025-08-30  2:31   ` Chris Li
@ 2025-09-02  5:53   ` Barry Song
  2025-09-02 10:20   ` David Hildenbrand
  2 siblings, 0 replies; 90+ messages in thread
From: Barry Song @ 2025-09-02  5:53 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> No feature change, move cluster related definitions and helpers to
> mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
> helpers, so they can be used outside of swap files.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Barry Song <baohua@kernel.org>

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-27 17:44     ` Chris Li
  2025-08-27 23:46       ` Baoquan He
@ 2025-09-02  6:01       ` Barry Song
  2025-09-03  9:28       ` David Hildenbrand
  2 siblings, 0 replies; 90+ messages in thread
From: Barry Song @ 2025-09-02  6:01 UTC (permalink / raw)
  To: Chris Li
  Cc: Baoquan He, Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Thu, Aug 28, 2025 at 1:45 AM Chris Li <chrisl@kernel.org> wrote:
>
>
> BTW, off topic here. I really don't like the "_info" suffix. Anything
> you can put into a C struct by definition is some kind of information.
> Same to the _struct. Anything defined by a struct is a struct. Don't
> need to say that.
> The "struct swap_info_struct" gets two of the unnecessary words. It
> should be something like  "struct swap_file" or "struct swap_device".
> Renaming it is too invasive to the code base and it will mess up the
> git annotation history.

Also, `sis` and `si` are being mixed up all the time.

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
  2025-08-27  3:47   ` Baoquan He
@ 2025-09-02  6:02   ` Barry Song
  2025-09-02 13:33   ` David Hildenbrand
  2 siblings, 0 replies; 90+ messages in thread
From: Barry Song @ 2025-09-02  6:02 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> swp_swap_info is the most commonly used helper for retrieving swap info.
> It has an internal check that may lead to a NULL return value, but
> almost none of its caller checks the return value, making the internal
> check pointless. In fact, most of these callers already ensured the
> entry is valid and never expect a NULL value.
>
> Tidy this up and shorten the name. If the caller can make sure the
> swap entry/type is valid and the device is pinned, use the new introduced
> swp_info/swp_type_info instead. They have more debug sanity checks and
> lower overhead as they are inlined.
>
> Callers that may expect a NULL value should use
> swp_get_info/swp_type_get_info instead.
>
> No feature change. The rearranged codes should have had no effect, or
> they should have been hitting NULL de-ref bugs already. Some new sanity
> checks have been added to the debug build to catch potential misuse.
> And the new helpers will be used by swap cache when working with locked
> swap cache folios, as a locked swap cache ensures the entries are valid
> and stable.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Barry Song <baohua@kernel.org>

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-01 23:50   ` Barry Song
@ 2025-09-02  6:12     ` Kairui Song
  2025-09-02  6:52       ` Chris Li
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-09-02  6:12 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 9:14 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Aug 23, 2025 at 3:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Always use swap_cache_get_folio for swap cache folio look up. The reason
> > we are not using it in all places is that it also updates the readahead
> > info, and some callsites want to avoid that.
> >
> > So decouple readahead update with swap cache lookup into a standalone
> > helper, let the caller call the readahead update helper if that's
> > needed. And convert all swap cache lookups to use swap_cache_get_folio.
> >
> > After this commit, there are only three special cases for accessing swap
> > cache space now: huge memory splitting, migration and shmem replacing,
> > because they need to lock the Xarray. Following commits will wrap their
> > accesses to the swap cache too with special helpers.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Nice! This has cleaned up the confusing mix of using
> swap_cache_get_folio with VMA, filemap_get_entry,
> swap_cache_get_folio without VMA, and filemap_get_folio.
>
> Reviewed-by: Barry Song <baohua@kernel.org>

Thanks!

>
> Do we have any potential "dropbehind" cases for anonymous folios?
> I guess not for now.
>

Right, dropbehind doesn't apply to anon yet.

> __filemap_get_folio()
> {
>         if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
>                 folio_clear_dropbehind(folio);
> }
>
> Can we mention something about it in the changelog?

I can add some words about it in the commit message. One can easily
tell that if we want dropbehind for anon, swap_caceh_get_folio will be
the right place to handle related logics.

Now with swap_cache_get_folio being the only place for swap cache
lookup, and in the next phase we'll make the swap cache layer the
unified way to do swap synchronization and never bypass it, maybe
dropbehind will be easier to do too.

>
> Best regards
> Barry
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-02  6:12     ` Kairui Song
@ 2025-09-02  6:52       ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-09-02  6:52 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Mon, Sep 1, 2025 at 11:13 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 9:14 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 3:20 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Always use swap_cache_get_folio for swap cache folio look up. The reason
> > > we are not using it in all places is that it also updates the readahead
> > > info, and some callsites want to avoid that.
> > >
> > > So decouple readahead update with swap cache lookup into a standalone
> > > helper, let the caller call the readahead update helper if that's
> > > needed. And convert all swap cache lookups to use swap_cache_get_folio.
> > >
> > > After this commit, there are only three special cases for accessing swap
> > > cache space now: huge memory splitting, migration and shmem replacing,
> > > because they need to lock the Xarray. Following commits will wrap their
> > > accesses to the swap cache too with special helpers.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> >
> > Nice! This has cleaned up the confusing mix of using
> > swap_cache_get_folio with VMA, filemap_get_entry,
> > swap_cache_get_folio without VMA, and filemap_get_folio.
> >
> > Reviewed-by: Barry Song <baohua@kernel.org>
>
> Thanks!
>
> >
> > Do we have any potential "dropbehind" cases for anonymous folios?
> > I guess not for now.
> >
>
> Right, dropbehind doesn't apply to anon yet.
>
> > __filemap_get_folio()
> > {
> >         if (folio_test_dropbehind(folio) && !(fgp_flags & FGP_DONTCACHE))
> >                 folio_clear_dropbehind(folio);
> > }
> >
> > Can we mention something about it in the changelog?
>
> I can add some words about it in the commit message. One can easily
> tell that if we want dropbehind for anon, swap_caceh_get_folio will be
> the right place to handle related logics.
>
> Now with swap_cache_get_folio being the only place for swap cache
> lookup, and in the next phase we'll make the swap cache layer the
> unified way to do swap synchronization and never bypass it, maybe
> dropbehind will be easier to do too.

Thanks for the cleaning up and unified swap cache synchronization.

Really appreciate it.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
  2025-08-30  1:54   ` Baoquan He
  2025-08-30  3:34   ` Chris Li
@ 2025-09-02  9:55   ` Barry Song
  2025-09-02 11:58     ` Kairui Song
  2025-09-03 11:41   ` David Hildenbrand
  3 siblings, 1 reply; 90+ messages in thread
From: Barry Song @ 2025-09-02  9:55 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

> +
> +/*
> + * Helpers for accessing or modifying the swap table of a cluster,
> + * the swap cluster must be locked.
> + */
> +static inline void __swap_table_set(struct swap_cluster_info *ci,
> +                                   unsigned int off, unsigned long swp_tb)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       atomic_long_set(&ci->table[off], swp_tb);
> +}
> +
> +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> +                                            unsigned int off)
> +{
> +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> +       return atomic_long_read(&ci->table[off]);
> +}
> +

Why should this use atomic_long instead of just WRITE_ONCE and
READ_ONCE?

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
                     ` (3 preceding siblings ...)
  2025-09-01 23:50   ` Barry Song
@ 2025-09-02 10:06   ` David Hildenbrand
  2025-09-02 12:32     ` Chris Li
  2025-09-02 16:38     ` Kairui Song
  2025-09-02 10:10   ` David Hildenbrand
  5 siblings, 2 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 10:06 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Always use swap_cache_get_folio for swap cache folio look up. The reason
> we are not using it in all places is that it also updates the readahead
> info, and some callsites want to avoid that.
> 
> So decouple readahead update with swap cache lookup into a standalone
> helper, let the caller call the readahead update helper if that's
> needed. And convert all swap cache lookups to use swap_cache_get_folio.
> 
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration and shmem replacing,
> because they need to lock the Xarray. Following commits will wrap their
> accesses to the swap cache too with special helpers.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/memory.c      |  6 ++-
>   mm/mincore.c     |  3 +-
>   mm/shmem.c       |  4 +-
>   mm/swap.h        | 13 +++++--
>   mm/swap_state.c  | 99 +++++++++++++++++++++++-------------------------
>   mm/swapfile.c    | 11 +++---
>   mm/userfaultfd.c |  5 +--
>   7 files changed, 72 insertions(+), 69 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index d9de6c056179..10ef528a5f44 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   	if (unlikely(!si))
>   		goto out;
>   
> -	folio = swap_cache_get_folio(entry, vma, vmf->address);
> -	if (folio)
> +	folio = swap_cache_get_folio(entry);
> +	if (folio) {
> +		swap_update_readahead(folio, vma, vmf->address);
>   		page = folio_file_page(folio, swp_offset(entry));
> +	}
>   	swapcache = folio;
>   
>   	if (!folio) {
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 2f3e1816a30d..8ec4719370e1 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
>   		if (!si)
>   			return 0;
>   	}
> -	folio = filemap_get_entry(swap_address_space(entry),
> -				  swap_cache_index(entry));
> +	folio = swap_cache_get_folio(entry);
>   	if (shmem)
>   		put_swap_device(si);
>   	/* The swap cache space contains either folio, shadow or NULL */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 13cc51df3893..e9d0d2784cd5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   	}
>   
>   	/* Look it up and read it in.. */
> -	folio = swap_cache_get_folio(swap, NULL, 0);
> +	folio = swap_cache_get_folio(swap);
>   	if (!folio) {
>   		if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
>   			/* Direct swapin skipping swap cache & readahead */
> @@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>   			count_vm_event(PGMAJFAULT);
>   			count_memcg_event_mm(fault_mm, PGMAJFAULT);
>   		}
> +	} else {
> +		swap_update_readahead(folio, NULL, 0);
>   	}
>   
>   	if (order > folio_order(folio)) {
> diff --git a/mm/swap.h b/mm/swap.h
> index 1ae44d4193b1..efb6d7ff9f30 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
>   void clear_shadow_from_swap_cache(int type, unsigned long begin,
>   				  unsigned long end);
>   void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr);
> +struct folio *swap_cache_get_folio(swp_entry_t entry);
>   struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>   		struct vm_area_struct *vma, unsigned long addr,
>   		struct swap_iocb **plug);
> @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
>   		struct mempolicy *mpol, pgoff_t ilx);
>   struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
>   		struct vm_fault *vmf);
> +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> +			   unsigned long addr);
>   
>   static inline unsigned int folio_swap_flags(struct folio *folio)
>   {
> @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
>   	return NULL;
>   }
>   
> +static inline void swap_update_readahead(struct folio *folio,
> +		struct vm_area_struct *vma, unsigned long addr)
> +{
> +}
> +
>   static inline int swap_writeout(struct folio *folio,
>   		struct swap_iocb **swap_plug)
>   {
> @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
>   {
>   }
>   
> -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr)
> +static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
>   {
>   	return NULL;
>   }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 99513b74b5d8..ff9eb761a103 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
>   	printk("Total swap = %lukB\n", K(total_swap_pages));
>   }
>   
> +/*

While at it, proper kerneldoc?

/**

etc.

Also documenting that it will only return a valid folio pointer or NULL

> + * Lookup a swap entry in the swap cache. A found folio will be returned
> + * unlocked and with its refcount incremented.
> + *
> + * Caller must lock the swap device or hold a reference to keep it valid.
> + */
> +struct folio *swap_cache_get_folio(swp_entry_t entry)
> +{
> +	struct folio *folio = filemap_get_folio(swap_address_space(entry),
> +						swap_cache_index(entry));
> +	if (!IS_ERR(folio))
> +		return folio;
> +	return NULL;

Maybe better as (avoid one !)

if (IS_ERR(folio))
	return NULL;
return folio;

or simply

return IS_ERR(folio) ? NULL : folio.

> +}
> +
>   void *get_shadow_from_swap_cache(swp_entry_t entry)
>   {
>   	struct address_space *address_space = swap_address_space(entry);
> @@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void)
>   }
>   
>   /*
> - * Lookup a swap entry in the swap cache. A found folio will be returned
> - * unlocked and with its refcount incremented - we rely on the kernel
> - * lock getting page table operations atomic even if we drop the folio
> - * lock before returning.
> - *
> - * Caller must lock the swap device or hold a reference to keep it valid.
> + * Update the readahead statistics of a vma or globally.
>    */
> -struct folio *swap_cache_get_folio(swp_entry_t entry,
> -		struct vm_area_struct *vma, unsigned long addr)

This also sounds like a good kerneldoc candidate :)

In particular, documenting that it is valid to pass in vma == NULL (in 
which case the addr is ignored).

> +void swap_update_readahead(struct folio *folio,
> +			   struct vm_area_struct *vma,
> +			   unsigned long addr)
>   {


Apart from that LGTM.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 7/9] mm, swap: remove contention workaround for swap cache
  2025-08-22 19:20 ` [PATCH 7/9] mm, swap: remove contention workaround for swap cache Kairui Song
  2025-08-30  4:07   ` Chris Li
@ 2025-09-02 10:06   ` Barry Song
  1 sibling, 0 replies; 90+ messages in thread
From: Barry Song @ 2025-09-02 10:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel, kernel test robot

On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Swap cluster setup will try to shuffle the clusters on initialization.
> It was helpful to avoid contention for the swap cache space. The cluster
> size (2M) was much smaller than each swap cache space (64M), so shuffling
> the cluster means the allocator will try to allocate swap slots that are
> in different swap cache spaces for each CPU, reducing the chance of two
> CPUs using the same swap cache space, and hence reducing the contention.
>
> Now, swap cache is managed by swap clusters, this shuffle is pointless.
> Just remove it, and clean up related macros.
>
> This should also improve the HDD swap performance as shuffling IO is a
> bad idea for HDD, and now the shuffling is gone.
>
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202504241621.f27743ec-lkp@intel.com
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Reviewed-by: Barry Song <baohua@kernel.org>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
                     ` (4 preceding siblings ...)
  2025-09-02 10:06   ` David Hildenbrand
@ 2025-09-02 10:10   ` David Hildenbrand
  2025-09-02 17:13     ` Kairui Song
  5 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 10:10 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Always use swap_cache_get_folio for swap cache folio look up. The reason
> we are not using it in all places is that it also updates the readahead
> info, and some callsites want to avoid that.
> 
> So decouple readahead update with swap cache lookup into a standalone
> helper, let the caller call the readahead update helper if that's
> needed. And convert all swap cache lookups to use swap_cache_get_folio.
> 
> After this commit, there are only three special cases for accessing swap
> cache space now: huge memory splitting, migration and shmem replacing,
> because they need to lock the Xarray. Following commits will wrap their
> accesses to the swap cache too with special helpers.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---



> +void swap_update_readahead(struct folio *folio,
> +			   struct vm_area_struct *vma,
> +			   unsigned long addr)
>   {

Oh, one thing. Regarding recent const-correctness discussions, "folio" 
should probably be const here.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
                     ` (2 preceding siblings ...)
  2025-09-02  5:40   ` Barry Song
@ 2025-09-02 10:18   ` David Hildenbrand
  2025-09-02 10:21     ` David Hildenbrand
  2025-09-02 12:46     ` Chris Li
  3 siblings, 2 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 10:18 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Swap cache lookup is lockless, it only increases the reference count
> of the returned folio. That's not enough to ensure a folio is stable in
> the swap cache, so the folio could be removed from the swap cache at any
> time. The caller always has to lock and check the folio before use.
> 
> Document this as a comment, and introduce a helper for swap cache folio
> verification with proper sanity checks.
> 
> Also, sanitize all current users to use this convention, and use the new
> helper when possible for easier debugging. Some existing callers won't
> cause any major problem right now, only trivial issues like incorrect
> readahead statistic (swapin) or wasted loop (swapoff). It's better to
> always follow this convention to make things robust.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

[...]

> +/**
> + * folio_contains_swap - Does this folio contain this swap entry?
> + * @folio: The folio.
> + * @entry: The swap entry to check against.
> + *
> + * Swap version of folio_contains()
> + *
> + * Context: The caller should have the folio locked to ensure
> + * nothing will move it out of the swap cache.
> + * Return: true or false.
> + */

I appreciate the kerneldoc.

Intuitively, this should be called "..._swap_entry".

But I wonder if "contains" is really the right term to use here. It's 
more like that a swap entry "belongs to" (was assigned to) a folio, right?

Sure, we store the information in the folio, but the "contains" is a bit 
weird.

folio_matches_swp_entry() maybe?


> +static inline bool folio_contains_swap(struct folio *folio, swp_entry_t entry)
> +{

const struct folio *


> +	pgoff_t offset = swp_offset(entry);
> +
> +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +	if (unlikely(!folio_test_swapcache(folio)))
> +		return false;
> +	if (unlikely(swp_type(entry) != swp_type(folio->swap)))
> +		return false;
> +	return offset - swp_offset(folio->swap) < folio_nr_pages(folio);
> +}
> +
>   void show_swap_cache_info(void);
>   void *get_shadow_from_swap_cache(swp_entry_t entry);
>   int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
> @@ -144,6 +167,11 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
>   	return 0;
>   }
>   

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers
  2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
  2025-08-30  2:31   ` Chris Li
  2025-09-02  5:53   ` Barry Song
@ 2025-09-02 10:20   ` David Hildenbrand
  2025-09-02 12:50     ` Chris Li
  2 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 10:20 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> No feature change, move cluster related definitions and helpers to
> mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
> helpers, so they can be used outside of swap files.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

[...]

>   
> -#ifdef CONFIG_THP_SWAP
> -#define SWAPFILE_CLUSTER	HPAGE_PMD_NR
> -
> -#define swap_entry_order(order)	(order)
> -#else
> -#define SWAPFILE_CLUSTER	256
> -
> -/*
> - * Define swap_entry_order() as constant to let compiler to optimize
> - * out some code if !CONFIG_THP_SWAP
> - */
> -#define swap_entry_order(order)	0
> -#endif
> -#define LATENCY_LIMIT		256
> +#define LATENCY_LIMIT 256

No need to touch that line IMHO.


I enjoy the new function names.

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-09-02 10:18   ` David Hildenbrand
@ 2025-09-02 10:21     ` David Hildenbrand
  2025-09-02 12:46     ` Chris Li
  1 sibling, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 10:21 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 02.09.25 12:18, David Hildenbrand wrote:
> On 22.08.25 21:20, Kairui Song wrote:
>> From: Kairui Song <kasong@tencent.com>
>>
>> Swap cache lookup is lockless, it only increases the reference count
>> of the returned folio. That's not enough to ensure a folio is stable in
>> the swap cache, so the folio could be removed from the swap cache at any
>> time. The caller always has to lock and check the folio before use.
>>
>> Document this as a comment, and introduce a helper for swap cache folio
>> verification with proper sanity checks.
>>
>> Also, sanitize all current users to use this convention, and use the new
>> helper when possible for easier debugging. Some existing callers won't
>> cause any major problem right now, only trivial issues like incorrect
>> readahead statistic (swapin) or wasted loop (swapoff). It's better to
>> always follow this convention to make things robust.
>>
>> Signed-off-by: Kairui Song <kasong@tencent.com>
>> ---
> 
> [...]
> 
>> +/**
>> + * folio_contains_swap - Does this folio contain this swap entry?
>> + * @folio: The folio.
>> + * @entry: The swap entry to check against.
>> + *
>> + * Swap version of folio_contains()
>> + *
>> + * Context: The caller should have the folio locked to ensure
>> + * nothing will move it out of the swap cache.
>> + * Return: true or false.
>> + */
> 
> I appreciate the kerneldoc.
> 
> Intuitively, this should be called "..._swap_entry".
> 
> But I wonder if "contains" is really the right term to use here. It's
> more like that a swap entry "belongs to" (was assigned to) a folio, right?
> 
> Sure, we store the information in the folio, but the "contains" is a bit
> weird.
> 
> folio_matches_swp_entry() maybe?

folio_matches_swap_entry() is what I wanted to say :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-08-22 19:20 ` [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Kairui Song
  2025-08-30  4:17   ` Chris Li
@ 2025-09-02 11:15   ` Barry Song
  2025-09-02 13:17     ` Chris Li
  1 sibling, 1 reply; 90+ messages in thread
From: Barry Song @ 2025-09-02 11:15 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now swap table is cluster based, which means free clusters can free its
> table since no one should modify it.
>
> There could be speculative readers, like swap cache look up, protect
> them by making them RCU safe. All swap table should be filled with null
> entries before free, so such readers will either see a NULL pointer or
> a null filled table being lazy freed.
>
> On allocation, allocate the table when a cluster is used by any order.
>

Might be a silly question.

Just curious—what happens if the allocation fails? Does the swap-out
operation also fail? We sometimes encounter strange issues when memory is
very limited, especially if the reclamation path itself needs to allocate
memory.

Assume a case where we want to swap out a folio using clusterN. We then
attempt to swap out the following folios with the same clusterN. But if
the allocation of the swap_table keeps failing, what will happen?

> This way, we can reduce the memory usage of large swap device
> significantly.
>
> This idea to dynamically release unused swap cluster data was initially
> suggested by Chris Li while proposing the cluster swap allocator and
> I found it suits the swap table idea very well.
>

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-31  1:00       ` Chris Li
@ 2025-09-02 11:51         ` Kairui Song
  0 siblings, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-09-02 11:51 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Sun, Aug 31, 2025 at 9:00 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Sat, Aug 30, 2025 at 9:53 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, Aug 30, 2025 at 11:43 AM Chris Li <chrisl@kernel.org> wrote:
> > >
> > > On Fri, Aug 22, 2025 at 12:21 PM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Introduce basic swap table infrastructures, which are now just a
> > > > fixed-sized flat array inside each swap cluster, with access wrappers.
> > > >
> > > > Each cluster contains a swap table of 512 entries. Each table entry is
> > > > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > > > a folio type (pointer), or NULL.
> > > >
> > > > In this first step, it only supports storing a folio or shadow, and it
> > > > is a drop-in replacement for the current swap cache. Convert all swap
> > > > cache users to use the new sets of APIs. Chris Li has been suggesting
> > > > using a new infrastructure for swap cache for better performance, and
> > > > that idea combined well with the swap table as the new backing
> > > > structure. Now the lock contention range is reduced to 2M clusters,
> > > > which is much smaller than the 64M address_space. And we can also drop
> > > > the multiple address_space design.
> > > >
> > > > All the internal works are done with swap_cache_get_* helpers. Swap
> > > > cache lookup is still lock-less like before, and the helper's contexts
> > > > are same with original swap cache helpers. They still require a pin
> > > > on the swap device to prevent the backing data from being freed.
> > > >
> > > > Swap cache updates are now protected by the swap cluster lock
> > > > instead of the Xarray lock. This is mostly handled internally, but new
> > > > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > > > few new cluster access and locking helpers are also introduced.
> > > >
> > > > A fully cluster-based unified swap table can be implemented on top
> > > > of this to take care of all count tracking and synchronization work,
> > > > with dynamic allocation. It should reduce the memory usage while
> > > > making the performance even better.
> > > >
> > > > Co-developed-by: Chris Li <chrisl@kernel.org>
> > > > Signed-off-by: Chris Li <chrisl@kernel.org>
> > > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > > ---
> > > >  /*
> > > > - * This must be called only on folios that have
> > > > - * been verified to be in the swap cache and locked.
> > > > - * It will never put the folio into the free list,
> > > > - * the caller has a reference on the folio.
> > > > + * Replace an old folio in the swap cache with a new one. The caller must
> > > > + * hold the cluster lock and set the new folio's entry and flags.
> > > >   */
> > > > -void delete_from_swap_cache(struct folio *folio)
> > > > +void __swap_cache_replace_folio(struct swap_cluster_info *ci, swp_entry_t entry,
> > > > +                               struct folio *old, struct folio *new)
> > > > +{
> > > > +       unsigned int ci_off = swp_cluster_offset(entry);
> > > > +       unsigned long nr_pages = folio_nr_pages(new);
> > > > +       unsigned int ci_end = ci_off + nr_pages;
> > > > +
> > > > +       VM_WARN_ON_ONCE(entry.val != new->swap.val);
> > > > +       VM_WARN_ON_ONCE(!folio_test_locked(old) || !folio_test_locked(new));
> > > > +       VM_WARN_ON_ONCE(!folio_test_swapcache(old) || !folio_test_swapcache(new));
> > > > +       do {
> > > > +               WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old);
> > > > +               __swap_table_set_folio(ci, ci_off, new);
> > >
> > > I recall in my original experiment swap cache replacement patch I used
> > > atomic compare exchange somewhere. It has been a while. Is there a
> > > reason to not use atomic cmpexchg() or that is in the later part of
> > > the series?
> >
> > For now all swap table modifications are protected by ci lock, extra
> > atomic / cmpxchg is not needed.
> >
> > We might be able to make use of cmpxchg in later phases. e.g. when
> > locking a folio is enough to ensure the final consistency of swap
> > count, cmpxchg can be used as a fast path to increase the swap count.
>
> You did not get what I am asking. Let me clarify.
>
> I mean even if we keep the ci lock, not change that locking
> requirement part. In the above code. Why can't we use cmpexchge to
> make sure that we only overwrite the form "old" -> "new".
> I am not saying we need to do the lockless part here.
>
> I mean in the possible sequence
> WARN_ON_ONCE(swp_tb_to_folio(__swap_table_get(ci, ci_off)) != old); //
> still "old" here, not warning issued
> /// another CPU race writes "old" to "old2" because of a bug.
> __swap_table_set_folio(ci, ci_off, new); // now "new" overwrites
> "old2" without warning.
>
> Has the typical race that you check the value old, then you overwrite
> value new. But what if the old changes to "old2" before you overwrite
> it with "new"?
> You overwrite "old2" silently.
>
> I mean to catch that.
>
> Using cmpxchg will make sure we only change "old" -> "new". We can
> catch the buggy situation above by overwriting "old2" -> "new".
> Also when we find out the entry is "old2" not "old" there, is WARN_ONCE enough?
>
> I also want to discuss what we should do if we did catch the "old2"
> there in the swap cache instead of "old".
> I feel that continuing with WARN_ONCE might not be good enough. It
> will make data corruption popergate.
>
> Should we roll back the new value and fail the swap cache folio set
> function to avoid the possible data corruption?
> if we found "old2", The new guy can't set the folio to the new value.
> Deal with that error. Will that avoid data corruption? Not being able
> to make forward progress is still much better than forward progress
> with data corruption.
>
> I just don't want silent overwritten values we aren't expecting.

Right, I just think-through about this. If we are super cautious
during the early phase, we can have more non-debug checks for
potential bugs.

There are currently three places modifying the swap table: replace
(huge_mm, migration, shmem) / insert (swapin / swapout) / del. I
checked the details, basically in all cases, there is no way to
rollback. Once the data is somehow corrupted, any operation could be
in the wrong direction.

So yeah, let me add some more checks.

I'll slightly adjust swap_cache_add_folio too. In this V1,
swap_cache_add_folio is designed to allow races and returns an int for
potential conflict. But , it should never fail in V1, cause there is
currently no racing caller: we still rely on the SWAP_HAS_CACHE to pin
slots before installing the swap cache. We will kill this ugly dance
very soon in phase 3. (phase 2 removes SYNC_IO swapin is an important
step). I used this version of swap_cache_add_folio from later phases,
just to make it easier later. So in V1 let's make it WARN/BUG if any
conflict folio exists and always return void, that's safer for
catching potential bugs. I'll change swap_cache_add_folio to allow the
race again in a later phase.

For other places, I think a relaxed xchg with WARN/BUG should be just fine.

Later phases can also use something like a CONFIG_DEBUG_VM_SWAP to
wrap these, after things are verified to be stable.

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-09-02  9:55   ` Barry Song
@ 2025-09-02 11:58     ` Kairui Song
  2025-09-02 23:44       ` Barry Song
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-09-02 11:58 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 6:46 PM Barry Song <21cnbao@gmail.com> wrote:
>
> > +
> > +/*
> > + * Helpers for accessing or modifying the swap table of a cluster,
> > + * the swap cluster must be locked.
> > + */
> > +static inline void __swap_table_set(struct swap_cluster_info *ci,
> > +                                   unsigned int off, unsigned long swp_tb)
> > +{
> > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > +       atomic_long_set(&ci->table[off], swp_tb);
> > +}
> > +
> > +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> > +                                            unsigned int off)
> > +{
> > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > +       return atomic_long_read(&ci->table[off]);
> > +}
> > +
>
> Why should this use atomic_long instead of just WRITE_ONCE and
> READ_ONCE?

Hi Barry,

That's a very good question. There are multiple reasons: I wanted to
wrap all access to the swap table to ensure there is no non-atomic
access, since it's almost always wrong to read a folio or shadow value
non-atomically from it. And users should never access swap tables
directly without the wrapper helpers. And in another reply, as Chris
suggested, we can use atomic operations to catch potential issues
easily too.

And most importantly, later phases can make use of things like
atomic_cmpxchg as a fast path to update the swap count of a swap
entry. That's a bit hard to explain for now, short summary is the swap
table will be using a single atomic for both count and folio tracking,
and we'll clean up the folio workflow with swap, so it should be
possible to get an final consistency of swap count by simply locking
the folio, and doing atomic_cmpxchg on swap table with folio locked
will be safe.

For now using atomic doesn't bring any overhead or complexity, only
make it easier to implement other code. So I think it should be good.

>
> Thanks
> Barry
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-02 10:06   ` David Hildenbrand
@ 2025-09-02 12:32     ` Chris Li
  2025-09-02 13:18       ` David Hildenbrand
  2025-09-02 16:38     ` Kairui Song
  1 sibling, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-09-02 12:32 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Baoquan He, Nhat Pham, Kemeng Shi,
	Baolin Wang, Ying Huang, Johannes Weiner, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 3:07 AM David Hildenbrand <david@redhat.com> wrote:
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 99513b74b5d8..ff9eb761a103 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
> >       printk("Total swap = %lukB\n", K(total_swap_pages));
> >   }
> >
> > +/*
>
> While at it, proper kerneldoc?

Agree, add the kerneldoc while we are at it. Those are important API
to interact with the swap table.

BTW, I already submitted a design doc for swap table to Kairui, which
will show up as the first patch in the V2 series

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-09-02 10:18   ` David Hildenbrand
  2025-09-02 10:21     ` David Hildenbrand
@ 2025-09-02 12:46     ` Chris Li
  2025-09-02 13:27       ` Kairui Song
  1 sibling, 1 reply; 90+ messages in thread
From: Chris Li @ 2025-09-02 12:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Baoquan He, Nhat Pham, Kemeng Shi,
	Baolin Wang, Ying Huang, Johannes Weiner, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 3:18 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 21:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Swap cache lookup is lockless, it only increases the reference count
> > of the returned folio. That's not enough to ensure a folio is stable in
> > the swap cache, so the folio could be removed from the swap cache at any
> > time. The caller always has to lock and check the folio before use.
> >
> > Document this as a comment, and introduce a helper for swap cache folio
> > verification with proper sanity checks.
> >
> > Also, sanitize all current users to use this convention, and use the new
> > helper when possible for easier debugging. Some existing callers won't
> > cause any major problem right now, only trivial issues like incorrect
> > readahead statistic (swapin) or wasted loop (swapoff). It's better to
> > always follow this convention to make things robust.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
>
> [...]
>
> > +/**
> > + * folio_contains_swap - Does this folio contain this swap entry?
> > + * @folio: The folio.
> > + * @entry: The swap entry to check against.
> > + *
> > + * Swap version of folio_contains()
> > + *
> > + * Context: The caller should have the folio locked to ensure
> > + * nothing will move it out of the swap cache.
> > + * Return: true or false.
> > + */
>
> I appreciate the kerneldoc.
>
> Intuitively, this should be called "..._swap_entry".
>
> But I wonder if "contains" is really the right term to use here. It's
> more like that a swap entry "belongs to" (was assigned to) a folio, right?

Right, in the other design doc I use the word "binding" for the
relationship between folio and swap entry. As if it is a binding
contract, your folio data goes and only goes here. There is no owning
relationship. Other folios might want to compete and win over the
binding contract as well (the race in swap in).

> Sure, we store the information in the folio, but the "contains" is a bit
> weird.
>
> folio_matches_swp_entry() maybe?

Yes, I like the name folio_match_swap_entry() you suggested in the
other email as well.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers
  2025-09-02 10:20   ` David Hildenbrand
@ 2025-09-02 12:50     ` Chris Li
  0 siblings, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-09-02 12:50 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Baoquan He, Nhat Pham, Kemeng Shi,
	Baolin Wang, Ying Huang, Johannes Weiner, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 3:21 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 21:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > No feature change, move cluster related definitions and helpers to
> > mm/swap.h, also tidy up and add a "swap_" prefix for cluster lock/unlock
> > helpers, so they can be used outside of swap files.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
>
> [...]
>
> >
> > -#ifdef CONFIG_THP_SWAP
> > -#define SWAPFILE_CLUSTER     HPAGE_PMD_NR
> > -
> > -#define swap_entry_order(order)      (order)
> > -#else
> > -#define SWAPFILE_CLUSTER     256
> > -
> > -/*
> > - * Define swap_entry_order() as constant to let compiler to optimize
> > - * out some code if !CONFIG_THP_SWAP
> > - */
> > -#define swap_entry_order(order)      0
> > -#endif
> > -#define LATENCY_LIMIT                256
> > +#define LATENCY_LIMIT 256
>
> No need to touch that line IMHO.
>
>
> I enjoy the new function names.

I enjoy that naming convention too, wink wink.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-09-02 11:15   ` Barry Song
@ 2025-09-02 13:17     ` Chris Li
  2025-09-02 16:57       ` Kairui Song
  2025-09-02 23:31       ` Barry Song
  0 siblings, 2 replies; 90+ messages in thread
From: Chris Li @ 2025-09-02 13:17 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now swap table is cluster based, which means free clusters can free its
> > table since no one should modify it.
> >
> > There could be speculative readers, like swap cache look up, protect
> > them by making them RCU safe. All swap table should be filled with null
> > entries before free, so such readers will either see a NULL pointer or
> > a null filled table being lazy freed.
> >
> > On allocation, allocate the table when a cluster is used by any order.
> >
>
> Might be a silly question.
>
> Just curious—what happens if the allocation fails? Does the swap-out
> operation also fail? We sometimes encounter strange issues when memory is
> very limited, especially if the reclamation path itself needs to allocate
> memory.
>
> Assume a case where we want to swap out a folio using clusterN. We then
> attempt to swap out the following folios with the same clusterN. But if
> the allocation of the swap_table keeps failing, what will happen?

I think this is the same behavior as the XArray allocation node with no memory.
The swap allocator will fail to isolate this cluster, it gets a NULL
ci pointer as return value. The swap allocator will try other cluster
lists, e.g. non_full, fragment etc.
If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
will propagate back to the try to swap out, then the shrink folio
list. It will put this page back to the LRU.

The shrink folio list either free enough memory (happy path) or not
able to free enough memory and it will cause an OOM kill.

I believe previously XArray will also return -ENOMEM at insert a
pointer and not be able to allocate a node to hold that ponter. It has
the same error poperation path. We did not change that.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-02 12:32     ` Chris Li
@ 2025-09-02 13:18       ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 13:18 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Baoquan He, Nhat Pham, Kemeng Shi,
	Baolin Wang, Ying Huang, Johannes Weiner, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On 02.09.25 14:32, Chris Li wrote:
> On Tue, Sep 2, 2025 at 3:07 AM David Hildenbrand <david@redhat.com> wrote:
>>> diff --git a/mm/swap_state.c b/mm/swap_state.c
>>> index 99513b74b5d8..ff9eb761a103 100644
>>> --- a/mm/swap_state.c
>>> +++ b/mm/swap_state.c
>>> @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
>>>        printk("Total swap = %lukB\n", K(total_swap_pages));
>>>    }
>>>
>>> +/*
>>
>> While at it, proper kerneldoc?
> 
> Agree, add the kerneldoc while we are at it. Those are important API
> to interact with the swap table.
> 
> BTW, I already submitted a design doc for swap table to Kairui, which
> will show up as the first patch in the V2 series

Nice!

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use
  2025-09-02 12:46     ` Chris Li
@ 2025-09-02 13:27       ` Kairui Song
  0 siblings, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-09-02 13:27 UTC (permalink / raw)
  To: Chris Li
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Baoquan He, Nhat Pham, Kemeng Shi,
	Baolin Wang, Ying Huang, Johannes Weiner, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 9:03 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 2, 2025 at 3:18 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 22.08.25 21:20, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Swap cache lookup is lockless, it only increases the reference count
> > > of the returned folio. That's not enough to ensure a folio is stable in
> > > the swap cache, so the folio could be removed from the swap cache at any
> > > time. The caller always has to lock and check the folio before use.
> > >
> > > Document this as a comment, and introduce a helper for swap cache folio
> > > verification with proper sanity checks.
> > >
> > > Also, sanitize all current users to use this convention, and use the new
> > > helper when possible for easier debugging. Some existing callers won't
> > > cause any major problem right now, only trivial issues like incorrect
> > > readahead statistic (swapin) or wasted loop (swapoff). It's better to
> > > always follow this convention to make things robust.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> >
> > [...]
> >
> > > +/**
> > > + * folio_contains_swap - Does this folio contain this swap entry?
> > > + * @folio: The folio.
> > > + * @entry: The swap entry to check against.
> > > + *
> > > + * Swap version of folio_contains()
> > > + *
> > > + * Context: The caller should have the folio locked to ensure
> > > + * nothing will move it out of the swap cache.
> > > + * Return: true or false.
> > > + */
> >
> > I appreciate the kerneldoc.
> >
> > Intuitively, this should be called "..._swap_entry".
> >
> > But I wonder if "contains" is really the right term to use here. It's
> > more like that a swap entry "belongs to" (was assigned to) a folio, right?
>
> Right, in the other design doc I use the word "binding" for the
> relationship between folio and swap entry. As if it is a binding
> contract, your folio data goes and only goes here. There is no owning
> relationship. Other folios might want to compete and win over the
> binding contract as well (the race in swap in).
>
> > Sure, we store the information in the folio, but the "contains" is a bit
> > weird.
> >
> > folio_matches_swp_entry() maybe?
>
> Yes, I like the name folio_match_swap_entry() you suggested in the
> other email as well.

I like this name too. The `folio_contains_swap` name comes from
`folio_contains` as it's just like a swap version of it. But I also
found the name a bit strange as they are different things, but had no
better idea. Thanks for the suggestion.

>
> Chris
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
  2025-08-27  3:47   ` Baoquan He
  2025-09-02  6:02   ` Barry Song
@ 2025-09-02 13:33   ` David Hildenbrand
  2025-09-02 15:03     ` Kairui Song
  2 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-09-02 13:33 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> swp_swap_info is the most commonly used helper for retrieving swap info.
> It has an internal check that may lead to a NULL return value, but
> almost none of its caller checks the return value, making the internal
> check pointless. In fact, most of these callers already ensured the
> entry is valid and never expect a NULL value.
> 
> Tidy this up and shorten the name.

Shorter != better. But yes, "swp_swap" was a mess.

> If the caller can make sure the
> swap entry/type is valid and the device is pinned, use the new introduced
> swp_info/swp_type_info instead. They have more debug sanity checks and
> lower overhead as they are inlined.
> 
> Callers that may expect a NULL value should use
> swp_get_info/swp_type_get_info instead.

High-level comments:

1) I hate the "swp" vs. "swap". Is that a valuable distinction or could 
we just convert it to "swap" as we touch it?

You're converting swap_type_to_swap_info() to swp_type_to_swap_info(), 
and I am not sure if that is the right direction :)


2) Can we just call it "swap_entry" when we work on a swap entry and 
"swap_type" when we work on a swap type in the function name?

swp_info() is a rather bad function name.


3) I am not sure about "to" -> "get". "to" is much more readable in that 
context and consistent.


4) swp_info[] vs. swap_info() gah.


I would just have done:

swap_type_to_info(int type)
__swap_type_to_info(int type)
swap_entry_to_info(swp_entry_t entry)
__swap_entry_to_info(swp_entry_t entry)

__ are the expert functions where we don't expect NULL.

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-09-02 13:33   ` David Hildenbrand
@ 2025-09-02 15:03     ` Kairui Song
  2025-09-03  8:11       ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-09-02 15:03 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On Tue, Sep 2, 2025 at 10:14 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 21:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > swp_swap_info is the most commonly used helper for retrieving swap info.
> > It has an internal check that may lead to a NULL return value, but
> > almost none of its caller checks the return value, making the internal
> > check pointless. In fact, most of these callers already ensured the
> > entry is valid and never expect a NULL value.
> >
> > Tidy this up and shorten the name.
>
> Shorter != better. But yes, "swp_swap" was a mess.
>
> > If the caller can make sure the
> > swap entry/type is valid and the device is pinned, use the new introduced
> > swp_info/swp_type_info instead. They have more debug sanity checks and
> > lower overhead as they are inlined.
> >
> > Callers that may expect a NULL value should use
> > swp_get_info/swp_type_get_info instead.
>
> High-level comments:
>
> 1) I hate the "swp" vs. "swap". Is that a valuable distinction or could
> we just convert it to "swap" as we touch it?

Totally agree. I was just blindly following the old style. It's kind
of confusing indeed.

>
> You're converting swap_type_to_swap_info() to swp_type_to_swap_info(),
> and I am not sure if that is the right direction :)
>
>
> 2) Can we just call it "swap_entry" when we work on a swap entry and
> "swap_type" when we work on a swap type in the function name?
>
> swp_info() is a rather bad function name.
>
>
> 3) I am not sure about "to" -> "get". "to" is much more readable in that
> context and consistent.
>
>
> 4) swp_info[] vs. swap_info() gah.
>
>
> I would just have done:
>
> swap_type_to_info(int type)
> __swap_type_to_info(int type)
> swap_entry_to_info(swp_entry_t entry)
> __swap_entry_to_info(swp_entry_t entry)
>
> __ are the expert functions where we don't expect NULL.
>

Thanks a lot for the suggestions! I also like the idea of using "__"
to seperate the non-NULL version a lot and implis the caller have to
careful.

My concern was that names will be getting very long in later commits
following this convention. Which is also the reason I want to shorten
them here.

A lot of SWAP relate operations will be cluster based, so it will be
very common to get offset or the swap cluster from a swap entry.
We will end up having a really long name like
__swap_entry_to_cluster_offset (convert swap entry to offset inside a
cluster).

Since we already have the swap entry type called `swp_entry_t` and
helprs like `swp_offset` and 'swp_swap_info' that convert an entry to
other swap things, so I thought that anything converts swap entry /
offset to others are named `swp_*`.

Maybe a bad practise here, we can fix it while at it, or at least no
longer introduce more confusing names.

I can follow this suggested style, will it be a good idea if we have
following set of helpers?

For swap cluster and swap device (swap_info_struct):
swap_type_to_info(int)
__swap_type_to_info(int)
swap_entry_to_info(swp_entry_t)
__swap_entry_to_info(swp_entry_t)
__swap_offset_to_cluster(struct swap_info_struct *, pgoff_t)
__swap_entry_to_cluster(swp_entry_t)

And for offsets, we still use:
swp_offset() (Existing helper)
swp_cluster_offset()

Now all swp_* helpers are pure arithmetic operations (we just renamed
swp_swap_info which seems the only exception). Is this better?
I'm open to suggestions as I'm really bad at naming things :)

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-02 10:06   ` David Hildenbrand
  2025-09-02 12:32     ` Chris Li
@ 2025-09-02 16:38     ` Kairui Song
  1 sibling, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-09-02 16:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On Tue, Sep 2, 2025 at 7:22 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 21:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Always use swap_cache_get_folio for swap cache folio look up. The reason
> > we are not using it in all places is that it also updates the readahead
> > info, and some callsites want to avoid that.
> >
> > So decouple readahead update with swap cache lookup into a standalone
> > helper, let the caller call the readahead update helper if that's
> > needed. And convert all swap cache lookups to use swap_cache_get_folio.
> >
> > After this commit, there are only three special cases for accessing swap
> > cache space now: huge memory splitting, migration and shmem replacing,
> > because they need to lock the Xarray. Following commits will wrap their
> > accesses to the swap cache too with special helpers.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/memory.c      |  6 ++-
> >   mm/mincore.c     |  3 +-
> >   mm/shmem.c       |  4 +-
> >   mm/swap.h        | 13 +++++--
> >   mm/swap_state.c  | 99 +++++++++++++++++++++++-------------------------
> >   mm/swapfile.c    | 11 +++---
> >   mm/userfaultfd.c |  5 +--
> >   7 files changed, 72 insertions(+), 69 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index d9de6c056179..10ef528a5f44 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4660,9 +4660,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >       if (unlikely(!si))
> >               goto out;
> >
> > -     folio = swap_cache_get_folio(entry, vma, vmf->address);
> > -     if (folio)
> > +     folio = swap_cache_get_folio(entry);
> > +     if (folio) {
> > +             swap_update_readahead(folio, vma, vmf->address);
> >               page = folio_file_page(folio, swp_offset(entry));
> > +     }
> >       swapcache = folio;
> >
> >       if (!folio) {
> > diff --git a/mm/mincore.c b/mm/mincore.c
> > index 2f3e1816a30d..8ec4719370e1 100644
> > --- a/mm/mincore.c
> > +++ b/mm/mincore.c
> > @@ -76,8 +76,7 @@ static unsigned char mincore_swap(swp_entry_t entry, bool shmem)
> >               if (!si)
> >                       return 0;
> >       }
> > -     folio = filemap_get_entry(swap_address_space(entry),
> > -                               swap_cache_index(entry));
> > +     folio = swap_cache_get_folio(entry);
> >       if (shmem)
> >               put_swap_device(si);
> >       /* The swap cache space contains either folio, shadow or NULL */
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 13cc51df3893..e9d0d2784cd5 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2354,7 +2354,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >       }
> >
> >       /* Look it up and read it in.. */
> > -     folio = swap_cache_get_folio(swap, NULL, 0);
> > +     folio = swap_cache_get_folio(swap);
> >       if (!folio) {
> >               if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
> >                       /* Direct swapin skipping swap cache & readahead */
> > @@ -2379,6 +2379,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >                       count_vm_event(PGMAJFAULT);
> >                       count_memcg_event_mm(fault_mm, PGMAJFAULT);
> >               }
> > +     } else {
> > +             swap_update_readahead(folio, NULL, 0);
> >       }
> >
> >       if (order > folio_order(folio)) {
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 1ae44d4193b1..efb6d7ff9f30 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -62,8 +62,7 @@ void delete_from_swap_cache(struct folio *folio);
> >   void clear_shadow_from_swap_cache(int type, unsigned long begin,
> >                                 unsigned long end);
> >   void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> > -struct folio *swap_cache_get_folio(swp_entry_t entry,
> > -             struct vm_area_struct *vma, unsigned long addr);
> > +struct folio *swap_cache_get_folio(swp_entry_t entry);
> >   struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >               struct vm_area_struct *vma, unsigned long addr,
> >               struct swap_iocb **plug);
> > @@ -74,6 +73,8 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> >               struct mempolicy *mpol, pgoff_t ilx);
> >   struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
> >               struct vm_fault *vmf);
> > +void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> > +                        unsigned long addr);
> >
> >   static inline unsigned int folio_swap_flags(struct folio *folio)
> >   {
> > @@ -159,6 +160,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
> >       return NULL;
> >   }
> >
> > +static inline void swap_update_readahead(struct folio *folio,
> > +             struct vm_area_struct *vma, unsigned long addr)
> > +{
> > +}
> > +
> >   static inline int swap_writeout(struct folio *folio,
> >               struct swap_iocb **swap_plug)
> >   {
> > @@ -169,8 +175,7 @@ static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entr
> >   {
> >   }
> >
> > -static inline struct folio *swap_cache_get_folio(swp_entry_t entry,
> > -             struct vm_area_struct *vma, unsigned long addr)
> > +static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
> >   {
> >       return NULL;
> >   }
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 99513b74b5d8..ff9eb761a103 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -69,6 +69,21 @@ void show_swap_cache_info(void)
> >       printk("Total swap = %lukB\n", K(total_swap_pages));
> >   }
> >
> > +/*
>
> While at it, proper kerneldoc?
>
> /**
>
> etc.
>
> Also documenting that it will only return a valid folio pointer or NULL

Good suggestion. I added some kerneldoc in later commit for this
function, do it earlier here is better.

>
> > + * Lookup a swap entry in the swap cache. A found folio will be returned
> > + * unlocked and with its refcount incremented.
> > + *
> > + * Caller must lock the swap device or hold a reference to keep it valid.
> > + */
> > +struct folio *swap_cache_get_folio(swp_entry_t entry)
> > +{
> > +     struct folio *folio = filemap_get_folio(swap_address_space(entry),
> > +                                             swap_cache_index(entry));
> > +     if (!IS_ERR(folio))
> > +             return folio;
> > +     return NULL;
>
> Maybe better as (avoid one !)
>
> if (IS_ERR(folio))
>         return NULL;
> return folio;
>
> or simply
>
> return IS_ERR(folio) ? NULL : folio.
>
> > +}
> > +
> >   void *get_shadow_from_swap_cache(swp_entry_t entry)
> >   {
> >       struct address_space *address_space = swap_address_space(entry);
> > @@ -273,54 +288,40 @@ static inline bool swap_use_vma_readahead(void)
> >   }
> >
> >   /*
> > - * Lookup a swap entry in the swap cache. A found folio will be returned
> > - * unlocked and with its refcount incremented - we rely on the kernel
> > - * lock getting page table operations atomic even if we drop the folio
> > - * lock before returning.
> > - *
> > - * Caller must lock the swap device or hold a reference to keep it valid.
> > + * Update the readahead statistics of a vma or globally.
> >    */
> > -struct folio *swap_cache_get_folio(swp_entry_t entry,
> > -             struct vm_area_struct *vma, unsigned long addr)
>
> This also sounds like a good kerneldoc candidate :)
>
> In particular, documenting that it is valid to pass in vma == NULL (in
> which case the addr is ignored).

Agree, I forgot this one, will add some doc.

>
> > +void swap_update_readahead(struct folio *folio,
> > +                        struct vm_area_struct *vma,
> > +                        unsigned long addr)
> >   {
>
>
> Apart from that LGTM.

Thanks!

>
> --
> Cheers
>
> David / dhildenb
>
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-09-02 13:17     ` Chris Li
@ 2025-09-02 16:57       ` Kairui Song
  2025-09-02 23:31       ` Barry Song
  1 sibling, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-09-02 16:57 UTC (permalink / raw)
  To: Chris Li
  Cc: Barry Song, linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 9:20 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Now swap table is cluster based, which means free clusters can free its
> > > table since no one should modify it.
> > >
> > > There could be speculative readers, like swap cache look up, protect
> > > them by making them RCU safe. All swap table should be filled with null
> > > entries before free, so such readers will either see a NULL pointer or
> > > a null filled table being lazy freed.
> > >
> > > On allocation, allocate the table when a cluster is used by any order.
> > >
> >
> > Might be a silly question.
> >
> > Just curious—what happens if the allocation fails? Does the swap-out
> > operation also fail? We sometimes encounter strange issues when memory is
> > very limited, especially if the reclamation path itself needs to allocate
> > memory.
> >
> > Assume a case where we want to swap out a folio using clusterN. We then
> > attempt to swap out the following folios with the same clusterN. But if
> > the allocation of the swap_table keeps failing, what will happen?
>
> I think this is the same behavior as the XArray allocation node with no memory.
> The swap allocator will fail to isolate this cluster, it gets a NULL
> ci pointer as return value. The swap allocator will try other cluster
> lists, e.g. non_full, fragment etc.
> If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
> will propagate back to the try to swap out, then the shrink folio
> list. It will put this page back to the LRU.
>
> The shrink folio list either free enough memory (happy path) or not
> able to free enough memory and it will cause an OOM kill.
>
> I believe previously XArray will also return -ENOMEM at insert a
> pointer and not be able to allocate a node to hold that ponter. It has
> the same error poperation path. We did not change that.

Yes, exactly. The overall behaviour is the same.

The allocation is only needed when a CPU's local swap cluster is
drained and swap allocator needs a new cluster. But after the previous
patch [1], many swap devices will prefer nonfull list. So the chance
that we need a swap table allocation is lower.

If it failed to allocate a swap table for a new cluster, it will try
fallback to frag / reclaim full. Only if all lists are drained,
folio_alloc_swap may fail with -ENOMEM and the caller (lru shink)
either try reclaim some other page or fail with OOM.

I think the fallback of nonfull / free / frag / reclaim-full might
even be helpful to avoid swapout failure when under heavy pressure. I
don't have data for that though, but I did run many test with heavy
pressure and didn't seen any issue.

Link: https://lore.kernel.org/linux-mm/20250812-swap-scan-list-v3-0-6d73504d267b@kernel.org/
[1]
>
> Chris
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-02 10:10   ` David Hildenbrand
@ 2025-09-02 17:13     ` Kairui Song
  2025-09-03  8:00       ` David Hildenbrand
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-09-02 17:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On Tue, Sep 2, 2025 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 21:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Always use swap_cache_get_folio for swap cache folio look up. The reason
> > we are not using it in all places is that it also updates the readahead
> > info, and some callsites want to avoid that.
> >
> > So decouple readahead update with swap cache lookup into a standalone
> > helper, let the caller call the readahead update helper if that's
> > needed. And convert all swap cache lookups to use swap_cache_get_folio.
> >
> > After this commit, there are only three special cases for accessing swap
> > cache space now: huge memory splitting, migration and shmem replacing,
> > because they need to lock the Xarray. Following commits will wrap their
> > accesses to the swap cache too with special helpers.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
>
>
>
> > +void swap_update_readahead(struct folio *folio,
> > +                        struct vm_area_struct *vma,
> > +                        unsigned long addr)
> >   {
>
> Oh, one thing. Regarding recent const-correctness discussions, "folio"
> should probably be const here.
>

Not here, swap_update_readahead does folio_test_clear_readahead so...

I'll try add const to other places where I see the folio is const,
thanks for the info!

> --
> Cheers
>
> David / dhildenb
>
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-09-02 13:17     ` Chris Li
  2025-09-02 16:57       ` Kairui Song
@ 2025-09-02 23:31       ` Barry Song
  2025-09-03  2:13         ` Kairui Song
  2025-09-03 12:35         ` Chris Li
  1 sibling, 2 replies; 90+ messages in thread
From: Barry Song @ 2025-09-02 23:31 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Wed, Sep 3, 2025 at 1:17 AM Chris Li <chrisl@kernel.org> wrote:
>
> On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Now swap table is cluster based, which means free clusters can free its
> > > table since no one should modify it.
> > >
> > > There could be speculative readers, like swap cache look up, protect
> > > them by making them RCU safe. All swap table should be filled with null
> > > entries before free, so such readers will either see a NULL pointer or
> > > a null filled table being lazy freed.
> > >
> > > On allocation, allocate the table when a cluster is used by any order.
> > >
> >
> > Might be a silly question.
> >
> > Just curious—what happens if the allocation fails? Does the swap-out
> > operation also fail? We sometimes encounter strange issues when memory is
> > very limited, especially if the reclamation path itself needs to allocate
> > memory.
> >
> > Assume a case where we want to swap out a folio using clusterN. We then
> > attempt to swap out the following folios with the same clusterN. But if
> > the allocation of the swap_table keeps failing, what will happen?
>
> I think this is the same behavior as the XArray allocation node with no memory.
> The swap allocator will fail to isolate this cluster, it gets a NULL
> ci pointer as return value. The swap allocator will try other cluster
> lists, e.g. non_full, fragment etc.

What I’m actually concerned about is that we keep iterating on this
cluster. If we try others, that sounds good.

> If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
> will propagate back to the try to swap out, then the shrink folio
> list. It will put this page back to the LRU.
>
> The shrink folio list either free enough memory (happy path) or not
> able to free enough memory and it will cause an OOM kill.
>
> I believe previously XArray will also return -ENOMEM at insert a
> pointer and not be able to allocate a node to hold that ponter. It has
> the same error poperation path. We did not change that.

Yes, I agree there was an -ENOMEM, but the difference is that we
are allocating much larger now :-)

One option is to organize every 4 or 8 swap slots into a group for
allocating or freeing the swap table. This way, we avoid the worst
case where a single unfreed slot consumes a whole swap table, and
the allocation size also becomes smaller. However, it’s unclear
whether the memory savings justify the added complexity and effort.

Anyway, I’m glad to see the current swap_table moving towards merge
and look forward to running it on various devices. This should help
us see if it causes any real issues.

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-09-02 11:58     ` Kairui Song
@ 2025-09-02 23:44       ` Barry Song
  2025-09-03  2:12         ` Kairui Song
  0 siblings, 1 reply; 90+ messages in thread
From: Barry Song @ 2025-09-02 23:44 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Sep 2, 2025 at 6:46 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > +
> > > +/*
> > > + * Helpers for accessing or modifying the swap table of a cluster,
> > > + * the swap cluster must be locked.
> > > + */
> > > +static inline void __swap_table_set(struct swap_cluster_info *ci,
> > > +                                   unsigned int off, unsigned long swp_tb)
> > > +{
> > > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > > +       atomic_long_set(&ci->table[off], swp_tb);
> > > +}
> > > +
> > > +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> > > +                                            unsigned int off)
> > > +{
> > > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > > +       return atomic_long_read(&ci->table[off]);
> > > +}
> > > +
> >
> > Why should this use atomic_long instead of just WRITE_ONCE and
> > READ_ONCE?
>
> Hi Barry,
>
> That's a very good question. There are multiple reasons: I wanted to
> wrap all access to the swap table to ensure there is no non-atomic
> access, since it's almost always wrong to read a folio or shadow value
> non-atomically from it. And users should never access swap tables
> directly without the wrapper helpers. And in another reply, as Chris
> suggested, we can use atomic operations to catch potential issues
> easily too.

I still find it odd that for writing we have the si_cluster lock,
but for reading a long, atomic operations don’t seem to provide
valid protection against anything. For example, you’re still
checking folio_lock and folio_test_swapcache() in such cases.


>
> And most importantly, later phases can make use of things like
> atomic_cmpxchg as a fast path to update the swap count of a swap
> entry. That's a bit hard to explain for now, short summary is the swap
> table will be using a single atomic for both count and folio tracking,
> and we'll clean up the folio workflow with swap, so it should be
> possible to get an final consistency of swap count by simply locking
> the folio, and doing atomic_cmpxchg on swap table with folio locked
> will be safe.

I’m still missing this part: if the long stores a folio pointer,
how could it further save the swap_count?

>
> For now using atomic doesn't bring any overhead or complexity, only
> make it easier to implement other code. So I think it should be good.

I guess it depends on the architecture. On some arches, it might
require irq_disable plus a spinlock.

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-09-02 23:44       ` Barry Song
@ 2025-09-03  2:12         ` Kairui Song
  2025-09-03  2:31           ` Barry Song
  0 siblings, 1 reply; 90+ messages in thread
From: Kairui Song @ 2025-09-03  2:12 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, LKML

Barry Song <21cnbao@gmail.com> 于 2025年9月3日周三 07:44写道：
>
> On Tue, Sep 2, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Tue, Sep 2, 2025 at 6:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > +
> > > > +/*
> > > > + * Helpers for accessing or modifying the swap table of a cluster,
> > > > + * the swap cluster must be locked.
> > > > + */
> > > > +static inline void __swap_table_set(struct swap_cluster_info *ci,
> > > > +                                   unsigned int off, unsigned long swp_tb)
> > > > +{
> > > > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > > > +       atomic_long_set(&ci->table[off], swp_tb);
> > > > +}
> > > > +
> > > > +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> > > > +                                            unsigned int off)
> > > > +{
> > > > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > > > +       return atomic_long_read(&ci->table[off]);
> > > > +}
> > > > +
> > >
> > > Why should this use atomic_long instead of just WRITE_ONCE and
> > > READ_ONCE?
> >
> > Hi Barry,
> >
> > That's a very good question. There are multiple reasons: I wanted to
> > wrap all access to the swap table to ensure there is no non-atomic
> > access, since it's almost always wrong to read a folio or shadow value
> > non-atomically from it. And users should never access swap tables
> > directly without the wrapper helpers. And in another reply, as Chris
> > suggested, we can use atomic operations to catch potential issues
> > easily too.
>
> I still find it odd that for writing we have the si_cluster lock,
> but for reading a long, atomic operations don’t seem to provide
> valid protection against anything. For example, you’re still
> checking folio_lock and folio_test_swapcache() in such cases.
>
>
> >
> > And most importantly, later phases can make use of things like
> > atomic_cmpxchg as a fast path to update the swap count of a swap
> > entry. That's a bit hard to explain for now, short summary is the swap
> > table will be using a single atomic for both count and folio tracking,
> > and we'll clean up the folio workflow with swap, so it should be
> > possible to get an final consistency of swap count by simply locking
> > the folio, and doing atomic_cmpxchg on swap table with folio locked
> > will be safe.
>
> I’m still missing this part: if the long stores a folio pointer,
> how could it further save the swap_count?

We use PFN here, it works very well, saves more memory and the
performance is very good, tested using the 28 series patch which have
already implemented this:
https://lore.kernel.org/linux-mm/20250514201729.48420-25-ryncsn@gmail.com/

>
> >
> > For now using atomic doesn't bring any overhead or complexity, only
> > make it easier to implement other code. So I think it should be good.
>
> I guess it depends on the architecture. On some arches, it might
> require irq_disable plus a spinlock.

If an arch can't provide atomic for basic access to a long, then that
justified the usage of atomic here even more.. The read has to be
atomic since swap cache lookup is lockless, so the write should be
atomic too.

Xchg / cmpxchg is a bit more complex on some arches, they are optional
in the swap table anyway. We can use them only on arches that provide
better performance with atomic. I believe most arches do. For the xchg
debug check, it can be dropped once we are confident enough that there
is no hidden bug.

>
> Thanks
> Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-09-02 23:31       ` Barry Song
@ 2025-09-03  2:13         ` Kairui Song
  2025-09-03 12:35         ` Chris Li
  1 sibling, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-09-03  2:13 UTC (permalink / raw)
  To: Barry Song
  Cc: Chris Li, linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, LKML

Barry Song <21cnbao@gmail.com> 于 2025年9月3日周三 08:03写道：
>
> On Wed, Sep 3, 2025 at 1:17 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Now swap table is cluster based, which means free clusters can free its
> > > > table since no one should modify it.
> > > >
> > > > There could be speculative readers, like swap cache look up, protect
> > > > them by making them RCU safe. All swap table should be filled with null
> > > > entries before free, so such readers will either see a NULL pointer or
> > > > a null filled table being lazy freed.
> > > >
> > > > On allocation, allocate the table when a cluster is used by any order.
> > > >
> > >
> > > Might be a silly question.
> > >
> > > Just curious—what happens if the allocation fails? Does the swap-out
> > > operation also fail? We sometimes encounter strange issues when memory is
> > > very limited, especially if the reclamation path itself needs to allocate
> > > memory.
> > >
> > > Assume a case where we want to swap out a folio using clusterN. We then
> > > attempt to swap out the following folios with the same clusterN. But if
> > > the allocation of the swap_table keeps failing, what will happen?
> >
> > I think this is the same behavior as the XArray allocation node with no memory.
> > The swap allocator will fail to isolate this cluster, it gets a NULL
> > ci pointer as return value. The swap allocator will try other cluster
> > lists, e.g. non_full, fragment etc.
>
> What I’m actually concerned about is that we keep iterating on this
> cluster. If we try others, that sounds good.
>
> > If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
> > will propagate back to the try to swap out, then the shrink folio
> > list. It will put this page back to the LRU.
> >
> > The shrink folio list either free enough memory (happy path) or not
> > able to free enough memory and it will cause an OOM kill.
> >
> > I believe previously XArray will also return -ENOMEM at insert a
> > pointer and not be able to allocate a node to hold that ponter. It has
> > the same error poperation path. We did not change that.
>
> Yes, I agree there was an -ENOMEM, but the difference is that we
> are allocating much larger now :-)
>
> One option is to organize every 4 or 8 swap slots into a group for
> allocating or freeing the swap table. This way, we avoid the worst
> case where a single unfreed slot consumes a whole swap table, and
> the allocation size also becomes smaller. However, it’s unclear
> whether the memory savings justify the added complexity and effort.
>
> Anyway, I’m glad to see the current swap_table moving towards merge
> and look forward to running it on various devices. This should help
> us see if it causes any real issues.

Thanks for the insightful review.

I do plan to implement a shrinker to compact the swap table of idle /
full clusters when under pressure. It will be done at the very end.
Things will be much cleaner by then so it's easier to do. And
currently it seems the memory usage is quite good already.

>>
> Thanks
> Barry
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-09-03  2:12         ` Kairui Song
@ 2025-09-03  2:31           ` Barry Song
  0 siblings, 0 replies; 90+ messages in thread
From: Barry Song @ 2025-09-03  2:31 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, David Hildenbrand, Yosry Ahmed, Lorenzo Stoakes,
	Zi Yan, LKML

On Wed, Sep 3, 2025 at 2:12 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> Barry Song <21cnbao@gmail.com> 于 2025年9月3日周三 07:44写道：
> >
> > On Tue, Sep 2, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Tue, Sep 2, 2025 at 6:46 PM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > > +
> > > > > +/*
> > > > > + * Helpers for accessing or modifying the swap table of a cluster,
> > > > > + * the swap cluster must be locked.
> > > > > + */
> > > > > +static inline void __swap_table_set(struct swap_cluster_info *ci,
> > > > > +                                   unsigned int off, unsigned long swp_tb)
> > > > > +{
> > > > > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > > > > +       atomic_long_set(&ci->table[off], swp_tb);
> > > > > +}
> > > > > +
> > > > > +static inline unsigned long __swap_table_get(struct swap_cluster_info *ci,
> > > > > +                                            unsigned int off)
> > > > > +{
> > > > > +       VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
> > > > > +       return atomic_long_read(&ci->table[off]);
> > > > > +}
> > > > > +
> > > >
> > > > Why should this use atomic_long instead of just WRITE_ONCE and
> > > > READ_ONCE?
> > >
> > > Hi Barry,
> > >
> > > That's a very good question. There are multiple reasons: I wanted to
> > > wrap all access to the swap table to ensure there is no non-atomic
> > > access, since it's almost always wrong to read a folio or shadow value
> > > non-atomically from it. And users should never access swap tables
> > > directly without the wrapper helpers. And in another reply, as Chris
> > > suggested, we can use atomic operations to catch potential issues
> > > easily too.
> >
> > I still find it odd that for writing we have the si_cluster lock,
> > but for reading a long, atomic operations don’t seem to provide
> > valid protection against anything. For example, you’re still
> > checking folio_lock and folio_test_swapcache() in such cases.
> >
> >
> > >
> > > And most importantly, later phases can make use of things like
> > > atomic_cmpxchg as a fast path to update the swap count of a swap
> > > entry. That's a bit hard to explain for now, short summary is the swap
> > > table will be using a single atomic for both count and folio tracking,
> > > and we'll clean up the folio workflow with swap, so it should be
> > > possible to get an final consistency of swap count by simply locking
> > > the folio, and doing atomic_cmpxchg on swap table with folio locked
> > > will be safe.
> >
> > I’m still missing this part: if the long stores a folio pointer,
> > how could it further save the swap_count?
>
> We use PFN here, it works very well, saves more memory and the
> performance is very good, tested using the 28 series patch which have
> already implemented this:
> https://lore.kernel.org/linux-mm/20250514201729.48420-25-ryncsn@gmail.com/

Alright, I see. With the PFN, we already have the folio.

>
> >
> > >
> > > For now using atomic doesn't bring any overhead or complexity, only
> > > make it easier to implement other code. So I think it should be good.
> >
> > I guess it depends on the architecture. On some arches, it might
> > require irq_disable plus a spinlock.
>
> If an arch can't provide atomic for basic access to a long, then that
> justified the usage of atomic here even more.. The read has to be
> atomic since swap cache lookup is lockless, so the write should be
> atomic too.

I actually confused atomic_64 with atomic_long. After double-checking, I
found that on almost all architectures, atomic_long_set/read are effectively
write_once and read_once. However, many architectures override these in the
common header file. This seems like a spot worth cleaning up for those
architectures.

>
> Xchg / cmpxchg is a bit more complex on some arches, they are optional
> in the swap table anyway. We can use them only on arches that provide
> better performance with atomic. I believe most arches do. For the xchg
> debug check, it can be dropped once we are confident enough that there
> is no hidden bug.

Thanks
Barry

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 1/9] mm, swap: use unified helper for swap cache look up
  2025-09-02 17:13     ` Kairui Song
@ 2025-09-03  8:00       ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-03  8:00 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 02.09.25 19:13, Kairui Song wrote:
> On Tue, Sep 2, 2025 at 6:13 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 22.08.25 21:20, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> Always use swap_cache_get_folio for swap cache folio look up. The reason
>>> we are not using it in all places is that it also updates the readahead
>>> info, and some callsites want to avoid that.
>>>
>>> So decouple readahead update with swap cache lookup into a standalone
>>> helper, let the caller call the readahead update helper if that's
>>> needed. And convert all swap cache lookups to use swap_cache_get_folio.
>>>
>>> After this commit, there are only three special cases for accessing swap
>>> cache space now: huge memory splitting, migration and shmem replacing,
>>> because they need to lock the Xarray. Following commits will wrap their
>>> accesses to the swap cache too with special helpers.
>>>
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>
>>
>>
>>> +void swap_update_readahead(struct folio *folio,
>>> +                        struct vm_area_struct *vma,
>>> +                        unsigned long addr)
>>>    {
>>
>> Oh, one thing. Regarding recent const-correctness discussions, "folio"
>> should probably be const here.
>>
> 
> Not here, swap_update_readahead does folio_test_clear_readahead so...
> 

Ah, makes sense!

> I'll try add const to other places where I see the folio is const,
> thanks for the info!


Thanks!


-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-09-02 15:03     ` Kairui Song
@ 2025-09-03  8:11       ` David Hildenbrand
  0 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-03  8:11 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 02.09.25 17:03, Kairui Song wrote:
> On Tue, Sep 2, 2025 at 10:14 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 22.08.25 21:20, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> swp_swap_info is the most commonly used helper for retrieving swap info.
>>> It has an internal check that may lead to a NULL return value, but
>>> almost none of its caller checks the return value, making the internal
>>> check pointless. In fact, most of these callers already ensured the
>>> entry is valid and never expect a NULL value.
>>>
>>> Tidy this up and shorten the name.
>>
>> Shorter != better. But yes, "swp_swap" was a mess.
>>
>>> If the caller can make sure the
>>> swap entry/type is valid and the device is pinned, use the new introduced
>>> swp_info/swp_type_info instead. They have more debug sanity checks and
>>> lower overhead as they are inlined.
>>>
>>> Callers that may expect a NULL value should use
>>> swp_get_info/swp_type_get_info instead.
>>
>> High-level comments:
>>
>> 1) I hate the "swp" vs. "swap". Is that a valuable distinction or could
>> we just convert it to "swap" as we touch it?
> 
> Totally agree. I was just blindly following the old style. It's kind
> of confusing indeed.

... and not a lot of space saved :)

> 
>>
>> You're converting swap_type_to_swap_info() to swp_type_to_swap_info(),
>> and I am not sure if that is the right direction :)
>>
>>
>> 2) Can we just call it "swap_entry" when we work on a swap entry and
>> "swap_type" when we work on a swap type in the function name?
>>
>> swp_info() is a rather bad function name.
>>
>>
>> 3) I am not sure about "to" -> "get". "to" is much more readable in that
>> context and consistent.
>>
>>
>> 4) swp_info[] vs. swap_info() gah.
>>
>>
>> I would just have done:
>>
>> swap_type_to_info(int type)
>> __swap_type_to_info(int type)
>> swap_entry_to_info(swp_entry_t entry)
>> __swap_entry_to_info(swp_entry_t entry)
>>
>> __ are the expert functions where we don't expect NULL.
>>
> 
> Thanks a lot for the suggestions! I also like the idea of using "__"
> to seperate the non-NULL version a lot and implis the caller have to
> careful.

Right, it's the "pro" version :)

> 
> My concern was that names will be getting very long in later commits
> following this convention. Which is also the reason I want to shorten
> them here.
> 
> A lot of SWAP relate operations will be cluster based, so it will be
> very common to get offset or the swap cluster from a swap entry.
> We will end up having a really long name like
> __swap_entry_to_cluster_offset (convert swap entry to offset inside a
> cluster).

That's a perfectly fine length though :)

> 
> Since we already have the swap entry type called `swp_entry_t` and
> helprs like `swp_offset` and 'swp_swap_info' that convert an entry to
> other swap things, so I thought that anything converts swap entry /
> offset to others are named `swp_*`.

Yeah, I think that's just bad historical baggage we should clean up at 
some point.

> 
> Maybe a bad practise here, we can fix it while at it, or at least no
> longer introduce more confusing names.
> 
> I can follow this suggested style, will it be a good idea if we have
> following set of helpers?
> 
> For swap cluster and swap device (swap_info_struct):
> swap_type_to_info(int)
> __swap_type_to_info(int)
> swap_entry_to_info(swp_entry_t)
> __swap_entry_to_info(swp_entry_t)
> __swap_offset_to_cluster(struct swap_info_struct *, pgoff_t)
> __swap_entry_to_cluster(swp_entry_t)

Looks great to me, but let's hear other opinions.

> 
> And for offsets, we still use:
> swp_offset() (Existing helper)

Yeah, there's also "swp_type" and "swp_offset_pfn". They really only 
extract basic properties of the entry, so they are a bit special.

I think we should call them "swap_entry_offset" "swap_entry_type" 
"swap_entry_pfn".

Now, that's not something I would expect in your series.

> swp_cluster_offset()

That one could later become swap_entry_cluster_offset()

> 
> Now all swp_* helpers are pure arithmetic operations (we just renamed
> swp_swap_info which seems the only exception). Is this better?

I'm already happy once we name+document the new functions properly.

I could probably live with "swp_cluster_offset" for the time being :)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio
  2025-08-22 19:20 ` [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
  2025-08-25  3:02   ` Baolin Wang
@ 2025-09-03  8:25   ` David Hildenbrand
  1 sibling, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-03  8:25 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Shmem may replace a folio in the swap cache if the cached one doesn't
> fit the swapin's GFP zone. When doing so, shmem has already double
> checked that the swap cache folio is locked, still has the swap cache
> flag set, and contains the wanted swap entry. So it is impossible to
> fail due to an Xarray mismatch. There is even a comment for that.
> 
> Delete the defensive error handling path, and add a WARN_ON instead:
> if that happened, something has broken the basic principle of how the
> swap cache works, we should catch and fix that.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

Sounds sensible to me.

With the "!error" code not deleted

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers
  2025-08-27 17:44     ` Chris Li
  2025-08-27 23:46       ` Baoquan He
  2025-09-02  6:01       ` Barry Song
@ 2025-09-03  9:28       ` David Hildenbrand
  2 siblings, 0 replies; 90+ messages in thread
From: David Hildenbrand @ 2025-09-03  9:28 UTC (permalink / raw)
  To: Chris Li, Baoquan He
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Barry Song, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 27.08.25 19:44, Chris Li wrote:
> On Tue, Aug 26, 2025 at 8:47 PM Baoquan He <bhe@redhat.com> wrote:
>>
>> On 08/23/25 at 03:20am, Kairui Song wrote:
>> ......
>>> diff --git a/mm/swap.h b/mm/swap.h
>>> index 223b40f2d37e..7b3efaa51624 100644
>>> --- a/mm/swap.h
>>> +++ b/mm/swap.h
>>> @@ -15,6 +15,8 @@ extern int page_cluster;
>>>   #define swap_entry_order(order)      0
>>>   #endif
>>>
>>> +extern struct swap_info_struct *swap_info[];
>>> +
>>>   /*
>>>    * We use this to track usage of a cluster. A cluster is a block of swap disk
>>>    * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All
>>> @@ -53,9 +55,28 @@ enum swap_cluster_flags {
>>>   #include <linux/swapops.h> /* for swp_offset */
>>>   #include <linux/blk_types.h> /* for bio_end_io_t */
>>>
>>> +/*
>>> + * Callers of all swp_* helpers here must ensure the entry is valid, and
>>> + * pin the swap device by reference or in other ways.
>>> + */
>>> +static inline struct swap_info_struct *swp_type_info(int type)
>>> +{
>>> +     struct swap_info_struct *si;
>>> +
>>> +     si = READ_ONCE(swap_info[type]); /* rcu_dereference() */
>>> +     VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
>>> +     return si;
>>> +}
>>> +
>>> +static inline struct swap_info_struct *swp_info(swp_entry_t entry)
>>> +{
>>> +     return swp_type_info(swp_type(entry));
>>> +}
>>
>> swp_type_info() is only used by swp_info() in the whole series, can we
>> open code it in swp_info()?
> 
> BTW, off topic here. I really don't like the "_info" suffix. Anything
> you can put into a C struct by definition is some kind of information.

I guess we use "info" when we just have to have some metadata and we 
don't really find a better description / abstraction.

So I don't completely hate the "_info" suffix here.

> Same to the _struct. Anything defined by a struct is a struct. Don't
> need to say that.

Yeah, at some point people thought it would be a good idea to do that 
(mm_struct, vm_area_struct).

> The "struct swap_info_struct" gets two of the unnecessary words. It
> should be something like  "struct swap_file" or "struct swap_device".
> Renaming it is too invasive to the code base and it will mess up the
> git annotation history.

I wouldn't be scared about doing something like that that actually 
improves the code.

You'd likely have to find an abstraction for "swap_file" and 
"swap_device", and maybe that was the challenge back then.

swap_info_struct has a comment above it "in-memory structure used to 
track swap areas". So naturally I would just have call this "swap_area".

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
                     ` (2 preceding siblings ...)
  2025-09-02  9:55   ` Barry Song
@ 2025-09-03 11:41   ` David Hildenbrand
  2025-09-03 12:54     ` Kairui Song
  3 siblings, 1 reply; 90+ messages in thread
From: David Hildenbrand @ 2025-09-03 11:41 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li, Barry Song,
	Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang, Ying Huang,
	Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On 22.08.25 21:20, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Introduce basic swap table infrastructures, which are now just a
> fixed-sized flat array inside each swap cluster, with access wrappers.
> 
> Each cluster contains a swap table of 512 entries. Each table entry is
> an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> a folio type (pointer), or NULL.
> 
> In this first step, it only supports storing a folio or shadow, and it
> is a drop-in replacement for the current swap cache. Convert all swap
> cache users to use the new sets of APIs. Chris Li has been suggesting
> using a new infrastructure for swap cache for better performance, and
> that idea combined well with the swap table as the new backing
> structure. Now the lock contention range is reduced to 2M clusters,
> which is much smaller than the 64M address_space. And we can also drop
> the multiple address_space design.
> 
> All the internal works are done with swap_cache_get_* helpers. Swap
> cache lookup is still lock-less like before, and the helper's contexts
> are same with original swap cache helpers. They still require a pin
> on the swap device to prevent the backing data from being freed.
> 
> Swap cache updates are now protected by the swap cluster lock
> instead of the Xarray lock. This is mostly handled internally, but new
> __swap_cache_* helpers require the caller to lock the cluster. So, a
> few new cluster access and locking helpers are also introduced.
> 
> A fully cluster-based unified swap table can be implemented on top
> of this to take care of all count tracking and synchronization work,
> with dynamic allocation. It should reduce the memory usage while
> making the performance even better.
> 
> Co-developed-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---

[...]

> @@ -4504,7 +4504,7 @@ static void filemap_cachestat(struct address_space *mapping,
>   				 * invalidation, so there might not be
>   				 * a shadow in the swapcache (yet).
>   				 */
> -				shadow = get_shadow_from_swap_cache(swp);
> +				shadow = swap_cache_get_shadow(swp);
>   				if (!shadow)
>   					goto resched;
>   			}

This looks like a cleanup that can be performed separately upfront to 
make this patch smaller.

Same applies to delete_from_swap_cache->swap_cache_del_folio

> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2a47cd3bb649..209580d395a1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3721,7 +3721,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   	/* Prevent deferred_split_scan() touching ->_refcount */
>   	spin_lock(&ds_queue->split_queue_lock);
>   	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> -		struct address_space *swap_cache = NULL;
> +		struct swap_cluster_info *swp_ci = NULL;

I'm wondering if we could also perform this change upfront, so we can ...


>   		struct lruvec *lruvec;
>   		int expected_refs;
>   
> @@ -3765,8 +3765,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   				goto fail;
>   			}
>   
> -			swap_cache = swap_address_space(folio->swap);
> -			xa_lock(&swap_cache->i_pages);
> +			swp_ci = swap_cluster_lock_by_folio(folio);

... perform these cleanups outside of the main patch. Just a thought.


Because this patch is rather big and touches quite some code (hard to 
review)

-- 
Cheers

David / dhildenb


^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 8/9] mm, swap: implement dynamic allocation of swap table
  2025-09-02 23:31       ` Barry Song
  2025-09-03  2:13         ` Kairui Song
@ 2025-09-03 12:35         ` Chris Li
  1 sibling, 0 replies; 90+ messages in thread
From: Chris Li @ 2025-09-03 12:35 UTC (permalink / raw)
  To: Barry Song
  Cc: Kairui Song, linux-mm, Andrew Morton, Matthew Wilcox,
	Hugh Dickins, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, David Hildenbrand, Yosry Ahmed,
	Lorenzo Stoakes, Zi Yan, linux-kernel

On Tue, Sep 2, 2025 at 4:31 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Sep 3, 2025 at 1:17 AM Chris Li <chrisl@kernel.org> wrote:
> >
> > On Tue, Sep 2, 2025 at 4:15 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > On Sat, Aug 23, 2025 at 3:21 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > Now swap table is cluster based, which means free clusters can free its
> > > > table since no one should modify it.
> > > >
> > > > There could be speculative readers, like swap cache look up, protect
> > > > them by making them RCU safe. All swap table should be filled with null
> > > > entries before free, so such readers will either see a NULL pointer or
> > > > a null filled table being lazy freed.
> > > >
> > > > On allocation, allocate the table when a cluster is used by any order.
> > > >
> > >
> > > Might be a silly question.
> > >
> > > Just curious—what happens if the allocation fails? Does the swap-out
> > > operation also fail? We sometimes encounter strange issues when memory is
> > > very limited, especially if the reclamation path itself needs to allocate
> > > memory.
> > >
> > > Assume a case where we want to swap out a folio using clusterN. We then
> > > attempt to swap out the following folios with the same clusterN. But if
> > > the allocation of the swap_table keeps failing, what will happen?
> >
> > I think this is the same behavior as the XArray allocation node with no memory.
> > The swap allocator will fail to isolate this cluster, it gets a NULL
> > ci pointer as return value. The swap allocator will try other cluster
> > lists, e.g. non_full, fragment etc.
>
> What I’m actually concerned about is that we keep iterating on this
> cluster. If we try others, that sounds good.

No, the isolation of the current cluster will remove the cluster from
the head and eventually put it back to the tail of the appropriate
list. It will not keep iterating the same cluster. Otherwise trying to
allocate a high order swap entry will also deadlooping on the first
cluster if it fails to allocate swap entries.

>
> > If all of them fail, the folio_alloc_swap() will return -ENOMEM. Which
> > will propagate back to the try to swap out, then the shrink folio
> > list. It will put this page back to the LRU.
> >
> > The shrink folio list either free enough memory (happy path) or not
> > able to free enough memory and it will cause an OOM kill.
> >
> > I believe previously XArray will also return -ENOMEM at insert a
> > pointer and not be able to allocate a node to hold that ponter. It has
> > the same error poperation path. We did not change that.
>
> Yes, I agree there was an -ENOMEM, but the difference is that we
> are allocating much larger now :-)

Even that is not 100% true. The XArray uses kmem_cache. Most of the
time it is allocated from the kmem_cache cached page without hitting
the system page allocation. When kmem_cache runs out of the current
cached page, it will allocate from the system via page allocation, at
least page size.

So from the page allocator point of view, the swap table allocation is
not bigger either.

> One option is to organize every 4 or 8 swap slots into a group for
> allocating or freeing the swap table. This way, we avoid the worst
> case where a single unfreed slot consumes a whole swap table, and
> the allocation size also becomes smaller. However, it’s unclear
> whether the memory savings justify the added complexity and effort.

Keep in mind that XArray also has this fragmentation issue as well.
When a 64 pointer node is free, it will return to the kmem_cache as
free area of the cache page. Only when every object in that page is
free, that page can return to the page allocator. The difference is
that the unused area seating at the swap table can be used
immediately. The unused XArray node will sit in the kmem_cache and
need extra kmem_cache_alloc to get the node to be used in the XArray.
There is also a subtle difference that all xarray share the same
kmem_cache pool for all xarray users. There is no dedicated kmem_cache
pool for swap. The swap node might mix with other xarray nodes, making
it even harder to release the underlying page. The swap table uses the
page directly and it does not have this issue. If you have a swing of
batch jobs causing a lot of swap, when the job is done, those swap
entries will be free and the swap table can return those pages back.
But xarray might not be able to release as many pages because of the
mix usage of the xarray. It depends on what other xarray node was
allocated during the swap usage.

I guess that is too much detail.

>
> Anyway, I’m glad to see the current swap_table moving towards merge
> and look forward to running it on various devices. This should help
> us see if it causes any real issues.

Agree.

Chris

^ permalink raw reply	[flat|nested] 90+ messages in thread

* Re: [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API
  2025-09-03 11:41   ` David Hildenbrand
@ 2025-09-03 12:54     ` Kairui Song
  0 siblings, 0 replies; 90+ messages in thread
From: Kairui Song @ 2025-09-03 12:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Hugh Dickins, Chris Li,
	Barry Song, Baoquan He, Nhat Pham, Kemeng Shi, Baolin Wang,
	Ying Huang, Johannes Weiner, Yosry Ahmed, Lorenzo Stoakes, Zi Yan,
	linux-kernel

On Wed, Sep 3, 2025 at 7:44 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 22.08.25 21:20, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Introduce basic swap table infrastructures, which are now just a
> > fixed-sized flat array inside each swap cluster, with access wrappers.
> >
> > Each cluster contains a swap table of 512 entries. Each table entry is
> > an opaque atomic long. It could be in 3 types: a shadow type (XA_VALUE),
> > a folio type (pointer), or NULL.
> >
> > In this first step, it only supports storing a folio or shadow, and it
> > is a drop-in replacement for the current swap cache. Convert all swap
> > cache users to use the new sets of APIs. Chris Li has been suggesting
> > using a new infrastructure for swap cache for better performance, and
> > that idea combined well with the swap table as the new backing
> > structure. Now the lock contention range is reduced to 2M clusters,
> > which is much smaller than the 64M address_space. And we can also drop
> > the multiple address_space design.
> >
> > All the internal works are done with swap_cache_get_* helpers. Swap
> > cache lookup is still lock-less like before, and the helper's contexts
> > are same with original swap cache helpers. They still require a pin
> > on the swap device to prevent the backing data from being freed.
> >
> > Swap cache updates are now protected by the swap cluster lock
> > instead of the Xarray lock. This is mostly handled internally, but new
> > __swap_cache_* helpers require the caller to lock the cluster. So, a
> > few new cluster access and locking helpers are also introduced.
> >
> > A fully cluster-based unified swap table can be implemented on top
> > of this to take care of all count tracking and synchronization work,
> > with dynamic allocation. It should reduce the memory usage while
> > making the performance even better.
> >
> > Co-developed-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
>
> [...]
>
> > @@ -4504,7 +4504,7 @@ static void filemap_cachestat(struct address_space *mapping,
> >                                * invalidation, so there might not be
> >                                * a shadow in the swapcache (yet).
> >                                */
> > -                             shadow = get_shadow_from_swap_cache(swp);
> > +                             shadow = swap_cache_get_shadow(swp);
> >                               if (!shadow)
> >                                       goto resched;
> >                       }
>
> This looks like a cleanup that can be performed separately upfront to
> make this patch smaller.
>
> Same applies to delete_from_swap_cache->swap_cache_del_folio

I can have a patch to rename and add kernel doc / comments in swap.h
for a few helpers like this one. That will make this patch a bit
smaller.

> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 2a47cd3bb649..209580d395a1 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3721,7 +3721,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >       /* Prevent deferred_split_scan() touching ->_refcount */
> >       spin_lock(&ds_queue->split_queue_lock);
> >       if (folio_ref_freeze(folio, 1 + extra_pins)) {
> > -             struct address_space *swap_cache = NULL;
> > +             struct swap_cluster_info *swp_ci = NULL;
>
> I'm wondering if we could also perform this change upfront, so we can ...

This one seems not very doable on itsown since the cluster idea wasn't
used out side of swap before this patch..

>
> >               struct lruvec *lruvec;
> >               int expected_refs;
> >
> > @@ -3765,8 +3765,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >                               goto fail;
> >                       }
> >
> > -                     swap_cache = swap_address_space(folio->swap);
> > -                     xa_lock(&swap_cache->i_pages);
> > +                     swp_ci = swap_cluster_lock_by_folio(folio);
>
> ... perform these cleanups outside of the main patch. Just a thought.
>
>
> Because this patch is rather big and touches quite some code (hard to
> review)

Thanks for the review!


>
> --
> Cheers
>
> David / dhildenb
>
>

^ permalink raw reply	[flat|nested] 90+ messages in thread

end of thread, other threads:[~2025-09-03 12:55 UTC | newest]

Thread overview: 90+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22 19:20 [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Kairui Song
2025-08-22 19:20 ` [PATCH 1/9] mm, swap: use unified helper for swap cache look up Kairui Song
2025-08-27  2:47   ` Chris Li
2025-08-27  3:50     ` Chris Li
2025-08-27 13:45     ` Kairui Song
2025-08-27  3:52   ` Baoquan He
2025-08-27 13:46     ` Kairui Song
2025-08-28  3:20   ` Baolin Wang
2025-09-01 23:50   ` Barry Song
2025-09-02  6:12     ` Kairui Song
2025-09-02  6:52       ` Chris Li
2025-09-02 10:06   ` David Hildenbrand
2025-09-02 12:32     ` Chris Li
2025-09-02 13:18       ` David Hildenbrand
2025-09-02 16:38     ` Kairui Song
2025-09-02 10:10   ` David Hildenbrand
2025-09-02 17:13     ` Kairui Song
2025-09-03  8:00       ` David Hildenbrand
2025-08-22 19:20 ` [PATCH 2/9] mm, swap: always lock and check the swap cache folio before use Kairui Song
2025-08-27  6:13   ` Chris Li
2025-08-27 13:44     ` Kairui Song
2025-08-30  1:42       ` Chris Li
2025-08-27  7:03   ` Chris Li
2025-08-27 14:35     ` Kairui Song
2025-08-28  3:41       ` Baolin Wang
2025-08-28 18:05         ` Kairui Song
2025-08-30  1:53       ` Chris Li
2025-08-30 15:15         ` Kairui Song
2025-08-30 17:17           ` Chris Li
2025-09-01 18:17         ` Kairui Song
2025-09-01 21:10           ` Chris Li
2025-09-02  5:40   ` Barry Song
2025-09-02 10:18   ` David Hildenbrand
2025-09-02 10:21     ` David Hildenbrand
2025-09-02 12:46     ` Chris Li
2025-09-02 13:27       ` Kairui Song
2025-08-22 19:20 ` [PATCH 3/9] mm, swap: rename and move some swap cluster definition and helpers Kairui Song
2025-08-30  2:31   ` Chris Li
2025-09-02  5:53   ` Barry Song
2025-09-02 10:20   ` David Hildenbrand
2025-09-02 12:50     ` Chris Li
2025-08-22 19:20 ` [PATCH 4/9] mm, swap: tidy up swap device and cluster info helpers Kairui Song
2025-08-27  3:47   ` Baoquan He
2025-08-27 17:44     ` Chris Li
2025-08-27 23:46       ` Baoquan He
2025-08-30  2:38         ` Chris Li
2025-09-02  6:01       ` Barry Song
2025-09-03  9:28       ` David Hildenbrand
2025-09-02  6:02   ` Barry Song
2025-09-02 13:33   ` David Hildenbrand
2025-09-02 15:03     ` Kairui Song
2025-09-03  8:11       ` David Hildenbrand
2025-08-22 19:20 ` [PATCH 5/9] mm/shmem, swap: remove redundant error handling for replacing folio Kairui Song
2025-08-25  3:02   ` Baolin Wang
2025-08-25  9:45     ` Kairui Song
2025-08-30  2:41       ` Chris Li
2025-09-03  8:25   ` David Hildenbrand
2025-08-22 19:20 ` [PATCH 6/9] mm, swap: use the swap table for the swap cache and switch API Kairui Song
2025-08-30  1:54   ` Baoquan He
2025-08-30  3:40     ` Chris Li
2025-08-30  3:34   ` Chris Li
2025-08-30 16:52     ` Kairui Song
2025-08-31  1:00       ` Chris Li
2025-09-02 11:51         ` Kairui Song
2025-09-02  9:55   ` Barry Song
2025-09-02 11:58     ` Kairui Song
2025-09-02 23:44       ` Barry Song
2025-09-03  2:12         ` Kairui Song
2025-09-03  2:31           ` Barry Song
2025-09-03 11:41   ` David Hildenbrand
2025-09-03 12:54     ` Kairui Song
2025-08-22 19:20 ` [PATCH 7/9] mm, swap: remove contention workaround for swap cache Kairui Song
2025-08-30  4:07   ` Chris Li
2025-08-30 15:24     ` Kairui Song
2025-08-31 15:54       ` Kairui Song
2025-08-31 20:06         ` Chris Li
2025-08-31 20:04       ` Chris Li
2025-09-02 10:06   ` Barry Song
2025-08-22 19:20 ` [PATCH 8/9] mm, swap: implement dynamic allocation of swap table Kairui Song
2025-08-30  4:17   ` Chris Li
2025-09-02 11:15   ` Barry Song
2025-09-02 13:17     ` Chris Li
2025-09-02 16:57       ` Kairui Song
2025-09-02 23:31       ` Barry Song
2025-09-03  2:13         ` Kairui Song
2025-09-03 12:35         ` Chris Li
2025-08-22 19:20 ` [PATCH 9/9] mm, swap: use a single page for swap table when the size fits Kairui Song
2025-08-30  4:23   ` Chris Li
2025-08-26 22:00 ` [PATCH 0/9] mm, swap: introduce swap table as swap cache (phase I) Chris Li
2025-08-30  5:44 ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).