[PATCH v3 00/13] mm, swap: rework of swap allocator locks

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 00/13] mm, swap: rework of swap allocator locks
@ 2024-12-30 17:46 Kairui Song
  2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
                   ` (12 more replies)
  0 siblings, 13 replies; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series greatly improved swap performance by reworking
the locking design and simplify a lot of code path. Test showed
a up to 400% vm-scalability improvement with pmem as SWAP, and
up to 37% reduce of kernel compile real time with ZRAM as SWAP
(up to 60% improvement in system time).

This is part of the new swap allocator discussed during
the "Swap Abstraction" discussion at LSF/MM 2024, and
"mTHP and swap allocator" discussion at LPC 2024.

This is a follow up of previous swap cluster allocator series:
https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
Also enables further optimizations which will come later.

Previous series introduced a fully cluster based allocator, this
series completely get rid of the old allocator and makes the new
allocator avoid touching the si->lock unless needed. This bring huge
performance gain and get rid of slot cache for freeing path.

Currently, swap locking is mainly composed of two locks, cluster lock
(ci->lock) and device lock (si->lock). The device lock is widely used
to protect many things, causing it to be the main bottleneck for SWAP.

Cluster lock is much more fine-grained, so it will be best to use
ci->lock instead of si->lock as much as possible.

`perf lock` indicates this issue clearly. Doing linux kernel build
using tmpfs and ZRAM with limited memory (make -j64 with 1G memcg and 4k
pages), result of "perf lock contention -ab sleep 3" shows:

  contended   total wait     max wait     avg wait         type   caller
     34948     53.63 s       7.11 ms      1.53 ms     spinlock   free_swap_and_cache_nr+0x350
     16569     40.05 s       6.45 ms      2.42 ms     spinlock   get_swap_pages+0x231
     11191     28.41 s       7.03 ms      2.54 ms     spinlock   swapcache_free_entries+0x59
      4147     22.78 s     122.66 ms      5.49 ms     spinlock   page_vma_mapped_walk+0x6f3
      4595      7.17 s       6.79 ms      1.56 ms     spinlock   swapcache_free_entries+0x59
    406027      2.74 s       2.59 ms      6.74 us     spinlock   list_lru_add+0x39
  ...snip...

The top 5 caller are all users of si->lock, total wait time sums to several
minutes in the 3 seconds time window.

Following the new allocator design, many operation doesn't need to touch
si->lock at all. We only need to take si->lock when doing operations
across multiple clusters (changing the cluster list). So ideally
allocator should always take ci->lock first, then take si->lock only if
needed. But due to historical reasons, ci->lock is used inside si->lock
critical section, causing lock inversion if we simply try to acquire
si->lock after acquiring ci->lock.

This series audited all si->lock usage, clean up legacy codes, eliminate
usage of si->lock as much as possible by introducing new designs based
on the new cluster allocator.

Old HDD allocation codes are removed, cluster allocator is adapted
with small changes for HDD usage, test is looking OK.

And this also removed slot cache for freeing path. The performance is
even better without it now, and this enables other clean up and
optimizations as discussed before:

https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/

After this series, lock contention on si->lock is nearly unobservable
with `perf lock` with the same test above:

  contended   total wait     max wait     avg wait         type   caller
  ... snip ...
         91    204.62 us      4.51 us      2.25 us     spinlock   cluster_move+0x2e
  ... snip ...
         47    125.62 us      4.47 us      2.67 us     spinlock   cluster_move+0x2e
  ... snip ...
         23     63.15 us      3.95 us      2.74 us     spinlock   cluster_move+0x2e
  ... snip ...
         17     41.26 us      4.58 us      2.43 us     spinlock   cluster_isolate_lock+0x1d
  ... snip ...

`cluster_move` and `cluster_isolate_lock` (two new introduced helper)
are basically the only users of si->lock now, performance gain is huge,
and LOC is reduced.

Tests Results:

vm-scalability
==============
Running `usemem --init-time -O -y -x -R -31 1G` from vm-scalability
in a 12G memory cgroup using simulated pmem as SWAP backend (32G pmem,
32 CPUs).

Using 4K folio by default, 64k mTHP and sequential access (!-R) results
are also provided. 6 test runs for each case, Total Throughput:

Test             Before (KB/s) (stdev)  After (KB/s) (stdev)   Delta
---------------------------------------------------------------------------
Random (4K):     69937.11 (16449.77)    369816.17  (24476.68)  +428.78%
Random (64k):    123442.83 (13207.51)   216379.00  (25024.83)  +75.28%
Sequential (4K): 6313909.83 (148856.12) 6419860.66 (183563.38) +1.7%

Sequential access will cause lower stress for the allocator so the gain is
limited, but with random access (which is much closer to real workloads)
the performance gain is huge.

Build kernel with defconfig on tmpfs with ZRAM
==============================================
Below results shows a test matrix using different memory cgroup limit
and job numbets, and scaled up progressive for a intuitive result.
Done on a 48c96t system.

6 test run for each case, it can be seen clearly that as concurrent job
number goes higher the performance gain is higher, but even -j6 is
showing slight improvement.

   make -j<NR>     |   System Time (seconds)  |   Total Time (seconds)
 (NR / Mem / ZRAM) | (Before / After / Delta) | (Before / After / Delta)
 With 4k pages only:
  6 / 192M / 3G    |    1533 /  1522 / -0.7%  |    1420 /  1414 / -0.3%
 12 / 256M / 4G    |    2275 /  2226 / -2.2%  |     758 /   742 / -2.1%
 24 / 384M / 5G    |    3596 /  3154 / -12.3% |     476 /   422 / -11.3%
 48 / 768M / 7G    |    8159 /  3605 / -55.8% |     330 /   221 / -33.0%
 96 / 1.5G / 10G   |   18541 /  6462 / -65.1% |     283 /   180 / -36.4%
 With 64k mTHP:
 24 / 512M / 5G    |    3585 /  3469 /  -3.2% |     293 /   290 / -0.1%
 48 /   1G / 7G    |    8173 /  3607 / -55.9% |     251 /   158 / -37.0%
 96 /   2G / 10G   |   16305 /  7791 / -52.2% |     226 /   144 / -36.3%

The fragmentation are reduced too:
With: make -j96 / 1152M memcg, 64K mTHP:
(avg of 4 test run)
Before:
hugepages-64kB/stats/swpout: 1696184
hugepages-64kB/stats/swpout_fallback: 414318
After: (-63.2% mTHP swapout failure)
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330

There is a up to 65.1% improvement in sys time for build kernel test,
and lower fragmentation rate.

Build kernel with tinyconfig on tmpfs with HDD as swap:
=======================================================

This test is similar to above, but HDD test is very noisy and slow, the
deviation is huge, so just use tinyconfig instead and take the median test
result of 3 test run, which looks OK:

Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU
2901232inputs+0outputs (238877major+4227640minor)pagefaults

After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU
2548728inputs+0outputs (235471major+4238110minor)pagefaults

Single thread SWAP:
===================

Sequential SWAP should also be slightly faster as we removed a lot of
unnecessary parts. Test using micro benchmark for swapout/in 4G
zero memory using ZRAM, 10 test runs:

Swapout Before (avg. 3359304):
3353796 3358551 3371305 3356043 3367524 3355303 3355924 3354513 3360776

Swapin Before (avg. 1928698):
1920283 1927183 1934105 1921373 1926562 1938261 1927726 1928636 1934155

Swapout After (avg. 3347511, -0.4%):
3337863 3347948 3355235 3339081 3333134 3353006 3354917 3346055 3360359

Swapin After (avg. 1922290, -0.3%):
1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

The gain is limited at noise level but seems slightly better.

V2: https://lore.kernel.org/linux-mm/202412250259.S5ew5ZrN-lkp@intel.com/T/
Updates since V2:
- Use atomic_long_try_cmpxchg instead of atomic_long_cmpxchg
  [Uros Bizjak]
- Fix bot build error after previous rebase.

V1: https://lore.kernel.org/linux-mm/20241022192451.38138-1-ryncsn@gmail.com/
Updates since V1:
- Retest some tests after rebase on top of latest mm-unstable, the new
  Cgroup lock removal increased the performance gain of this series too,
  some results are basically same as before so unchanged:
  https://lore.kernel.org/linux-mm/20241218114633.85196-1-ryncsn@gmail.com/
- Rework the off-list bit handling, make it easier to review and more
  robust, also reduce LOC [Chris Li].
- Code style improvements and minor code optimizations. [Chris Li].
- Fixing a potential swapoff race issue due to missing SWP_WRITEOK check
  [Huang Ying].
- Added vm-scalability test with pmem [Huang Ying].

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>

Kairui Song (13):
  mm, swap: minor clean up for swap entry allocation
  mm, swap: fold swap_info_get_cont in the only caller
  mm, swap: remove old allocation path for HDD
  mm, swap: use cluster lock for HDD
  mm, swap: clean up device availability check
  mm, swap: clean up plist removal and adding
  mm, swap: hold a reference during scan and cleanup flag usage
  mm, swap: use an enum to define all cluster flags and wrap flags
    changes
  mm, swap: reduce contention on device lock
  mm, swap: simplify percpu cluster updating
  mm, swap: introduce a helper for retrieving cluster from offset
  mm, swap: use a global swap cluster for non-rotation devices
  mm, swap_slots: remove slot cache for freeing path

 fs/btrfs/inode.c           |    1 -
 fs/f2fs/data.c             |    1 -
 fs/iomap/swapfile.c        |    1 -
 include/linux/swap.h       |   34 +-
 include/linux/swap_slots.h |    3 -
 mm/page_io.c               |    1 -
 mm/swap_slots.c            |   78 +--
 mm/swapfile.c              | 1250 ++++++++++++++++--------------------
 8 files changed, 595 insertions(+), 774 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-09  4:04   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Direct reclaim can skip the whole folio after reclaimed a set of
folio based slots. Also simplify the code for allocation, reduce
indention.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 59 +++++++++++++++++++++++++--------------------------
 1 file changed, 29 insertions(+), 30 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b0a9071cfe1d..f8002f110104 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -604,23 +604,28 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  unsigned long start, unsigned long end)
 {
 	unsigned char *map = si->swap_map;
-	unsigned long offset;
+	unsigned long offset = start;
+	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
 	spin_unlock(&si->lock);
 
-	for (offset = start; offset < end; offset++) {
+	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
-			continue;
+			offset++;
+			break;
 		case SWAP_HAS_CACHE:
-			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0)
-				continue;
-			goto out;
+			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
+			if (nr_reclaim > 0)
+				offset += nr_reclaim;
+			else
+				goto out;
+			break;
 		default:
 			goto out;
 		}
-	}
+	} while (offset < end);
 out:
 	spin_lock(&si->lock);
 	spin_lock(&ci->lock);
@@ -838,35 +843,30 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 							 &found, order, usage);
 			frags++;
 			if (found)
-				break;
+				goto done;
 		}
 
-		if (!found) {
+		/*
+		 * Nonfull clusters are moved to frag tail if we reached
+		 * here, count them too, don't over scan the frag list.
+		 */
+		while (frags < si->frag_cluster_nr[order]) {
+			ci = list_first_entry(&si->frag_clusters[order],
+					      struct swap_cluster_info, list);
 			/*
-			 * Nonfull clusters are moved to frag tail if we reached
-			 * here, count them too, don't over scan the frag list.
+			 * Rotate the frag list to iterate, they were all failing
+			 * high order allocation or moved here due to per-CPU usage,
+			 * this help keeping usable cluster ahead.
 			 */
-			while (frags < si->frag_cluster_nr[order]) {
-				ci = list_first_entry(&si->frag_clusters[order],
-						      struct swap_cluster_info, list);
-				/*
-				 * Rotate the frag list to iterate, they were all failing
-				 * high order allocation or moved here due to per-CPU usage,
-				 * this help keeping usable cluster ahead.
-				 */
-				list_move_tail(&ci->list, &si->frag_clusters[order]);
-				offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-								 &found, order, usage);
-				frags++;
-				if (found)
-					break;
-			}
+			list_move_tail(&ci->list, &si->frag_clusters[order]);
+			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							 &found, order, usage);
+			frags++;
+			if (found)
+				goto done;
 		}
 	}
 
-	if (found)
-		goto done;
-
 	if (!list_empty(&si->discard_clusters)) {
 		/*
 		 * we don't have free cluster but have some clusters in
@@ -904,7 +904,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 				goto done;
 		}
 	}
-
 done:
 	cluster->next[order] = offset;
 	return found;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
  2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-09  4:05   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 03/13] mm, swap: remove old allocation path for HDD Kairui Song
                   ` (10 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The name of the function is confusing, and the code is much easier to
follow after folding, also rename the confusing naming "p" to more
meaningful "si".

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 39 +++++++++++++++------------------------
 1 file changed, 15 insertions(+), 24 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index f8002f110104..574059158627 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1375,22 +1375,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
-					struct swap_info_struct *q)
-{
-	struct swap_info_struct *p;
-
-	p = _swap_info_get(entry);
-
-	if (p != q) {
-		if (q != NULL)
-			spin_unlock(&q->lock);
-		if (p != NULL)
-			spin_lock(&p->lock);
-	}
-	return p;
-}
-
 static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
@@ -1687,14 +1671,14 @@ static int swp_entry_cmp(const void *ent1, const void *ent2)
 
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
-	struct swap_info_struct *p, *prev;
+	struct swap_info_struct *si, *prev;
 	int i;
 
 	if (n <= 0)
 		return;
 
 	prev = NULL;
-	p = NULL;
+	si = NULL;
 
 	/*
 	 * Sort swap entries by swap device, so each lock is only taken once.
@@ -1704,13 +1688,20 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 	if (nr_swapfiles > 1)
 		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
-		p = swap_info_get_cont(entries[i], prev);
-		if (p)
-			swap_entry_range_free(p, entries[i], 1);
-		prev = p;
+		si = _swap_info_get(entries[i]);
+
+		if (si != prev) {
+			if (prev != NULL)
+				spin_unlock(&prev->lock);
+			if (si != NULL)
+				spin_lock(&si->lock);
+		}
+		if (si)
+			swap_entry_range_free(si, entries[i], 1);
+		prev = si;
 	}
-	if (p)
-		spin_unlock(&p->lock);
+	if (si)
+		spin_unlock(&si->lock);
 }
 
 int __swap_count(swp_entry_t entry)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 03/13] mm, swap: remove old allocation path for HDD
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
  2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
  2024-12-30 17:46 ` [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-09  4:06   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 04/13] mm, swap: use cluster lock " Kairui Song
                   ` (9 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

We are currently using different swap allocation algorithm for HDD and
non-HDD. This leads to the existence of a different set of locks, and
the code path is heavily bloated, causing difficulties for further
optimization and maintenance.

This commit removes all HDD swap allocation and related dead code,
and uses the cluster allocation algorithm instead.

The performance may drop temporarily, but this should be negligible:
The main advantage of the legacy HDD allocation algorithm is that it
tends to use continuous slots, but swap device gets fragmented quickly
anyway, and the attempt to use continuous slots will fail easily.

This commit also enables mTHP swap on HDD, which is expected to be
beneficial, and following commits will adapt and optimize the cluster
allocator for HDD.

Suggested-by: Chris Li <chrisl@kernel.org>
Suggested-by: "Huang, Ying" <ying.huang@linux.alibaba.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   3 -
 mm/swapfile.c        | 235 ++-----------------------------------------
 2 files changed, 9 insertions(+), 229 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 187715eec3cb..0c681aa5cb98 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,9 +310,6 @@ struct swap_info_struct {
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
 	unsigned int inuse_pages;	/* number of those currently in use */
-	unsigned int cluster_next;	/* likely index for next allocation */
-	unsigned int cluster_nr;	/* countdown to next cluster search */
-	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 574059158627..fca58d43b836 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1001,49 +1001,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
 }
 
-static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
-{
-	unsigned long prev;
-
-	if (!(si->flags & SWP_SOLIDSTATE)) {
-		si->cluster_next = next;
-		return;
-	}
-
-	prev = this_cpu_read(*si->cluster_next_cpu);
-	/*
-	 * Cross the swap address space size aligned trunk, choose
-	 * another trunk randomly to avoid lock contention on swap
-	 * address space if possible.
-	 */
-	if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=
-	    (next >> SWAP_ADDRESS_SPACE_SHIFT)) {
-		/* No free swap slots available */
-		if (si->highest_bit <= si->lowest_bit)
-			return;
-		next = get_random_u32_inclusive(si->lowest_bit, si->highest_bit);
-		next = ALIGN_DOWN(next, SWAP_ADDRESS_SPACE_PAGES);
-		next = max_t(unsigned int, next, si->lowest_bit);
-	}
-	this_cpu_write(*si->cluster_next_cpu, next);
-}
-
-static bool swap_offset_available_and_locked(struct swap_info_struct *si,
-					     unsigned long offset)
-{
-	if (data_race(!si->swap_map[offset])) {
-		spin_lock(&si->lock);
-		return true;
-	}
-
-	if (vm_swap_full() && READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
-		spin_lock(&si->lock);
-		return true;
-	}
-
-	return false;
-}
-
 static int cluster_alloc_swap(struct swap_info_struct *si,
 			     unsigned char usage, int nr,
 			     swp_entry_t slots[], int order)
@@ -1071,13 +1028,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
 			       swp_entry_t slots[], int order)
 {
-	unsigned long offset;
-	unsigned long scan_base;
-	unsigned long last_in_cluster = 0;
-	int latency_ration = LATENCY_LIMIT;
 	unsigned int nr_pages = 1 << order;
-	int n_ret = 0;
-	bool scanned_many = false;
 
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
@@ -1089,7 +1040,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	 * But we do now try to find an empty cluster.  -Andrea
 	 * And we let swap pages go all over an SSD partition.  Hugh
 	 */
-
 	if (order > 0) {
 		/*
 		 * Should not even be attempting large allocations when huge
@@ -1109,158 +1059,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			return 0;
 	}
 
-	if (si->cluster_info)
-		return cluster_alloc_swap(si, usage, nr, slots, order);
-
-	si->flags += SWP_SCANNING;
-
-	/* For HDD, sequential access is more important. */
-	scan_base = si->cluster_next;
-	offset = scan_base;
-
-	if (unlikely(!si->cluster_nr--)) {
-		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
-			si->cluster_nr = SWAPFILE_CLUSTER - 1;
-			goto checks;
-		}
-
-		spin_unlock(&si->lock);
-
-		/*
-		 * If seek is expensive, start searching for new cluster from
-		 * start of partition, to minimize the span of allocated swap.
-		 */
-		scan_base = offset = si->lowest_bit;
-		last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
-
-		/* Locate the first empty (unaligned) cluster */
-		for (; last_in_cluster <= READ_ONCE(si->highest_bit); offset++) {
-			if (si->swap_map[offset])
-				last_in_cluster = offset + SWAPFILE_CLUSTER;
-			else if (offset == last_in_cluster) {
-				spin_lock(&si->lock);
-				offset -= SWAPFILE_CLUSTER - 1;
-				si->cluster_next = offset;
-				si->cluster_nr = SWAPFILE_CLUSTER - 1;
-				goto checks;
-			}
-			if (unlikely(--latency_ration < 0)) {
-				cond_resched();
-				latency_ration = LATENCY_LIMIT;
-			}
-		}
-
-		offset = scan_base;
-		spin_lock(&si->lock);
-		si->cluster_nr = SWAPFILE_CLUSTER - 1;
-	}
-
-checks:
-	if (!(si->flags & SWP_WRITEOK))
-		goto no_page;
-	if (!si->highest_bit)
-		goto no_page;
-	if (offset > si->highest_bit)
-		scan_base = offset = si->lowest_bit;
-
-	/* reuse swap entry of cache-only swap if not busy. */
-	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
-		int swap_was_freed;
-		spin_unlock(&si->lock);
-		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
-		spin_lock(&si->lock);
-		/* entry was freed successfully, try to use this again */
-		if (swap_was_freed > 0)
-			goto checks;
-		goto scan; /* check next one */
-	}
-
-	if (si->swap_map[offset]) {
-		if (!n_ret)
-			goto scan;
-		else
-			goto done;
-	}
-	memset(si->swap_map + offset, usage, nr_pages);
-
-	swap_range_alloc(si, offset, nr_pages);
-	slots[n_ret++] = swp_entry(si->type, offset);
-
-	/* got enough slots or reach max slots? */
-	if ((n_ret == nr) || (offset >= si->highest_bit))
-		goto done;
-
-	/* search for next available slot */
-
-	/* time to take a break? */
-	if (unlikely(--latency_ration < 0)) {
-		if (n_ret)
-			goto done;
-		spin_unlock(&si->lock);
-		cond_resched();
-		spin_lock(&si->lock);
-		latency_ration = LATENCY_LIMIT;
-	}
-
-	if (si->cluster_nr && !si->swap_map[++offset]) {
-		/* non-ssd case, still more slots in cluster? */
-		--si->cluster_nr;
-		goto checks;
-	}
-
-	/*
-	 * Even if there's no free clusters available (fragmented),
-	 * try to scan a little more quickly with lock held unless we
-	 * have scanned too many slots already.
-	 */
-	if (!scanned_many) {
-		unsigned long scan_limit;
-
-		if (offset < scan_base)
-			scan_limit = scan_base;
-		else
-			scan_limit = si->highest_bit;
-		for (; offset <= scan_limit && --latency_ration > 0;
-		     offset++) {
-			if (!si->swap_map[offset])
-				goto checks;
-		}
-	}
-
-done:
-	if (order == 0)
-		set_cluster_next(si, offset + 1);
-	si->flags -= SWP_SCANNING;
-	return n_ret;
-
-scan:
-	VM_WARN_ON(order > 0);
-	spin_unlock(&si->lock);
-	while (++offset <= READ_ONCE(si->highest_bit)) {
-		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
-			latency_ration = LATENCY_LIMIT;
-			scanned_many = true;
-		}
-		if (swap_offset_available_and_locked(si, offset))
-			goto checks;
-	}
-	offset = si->lowest_bit;
-	while (offset < scan_base) {
-		if (unlikely(--latency_ration < 0)) {
-			cond_resched();
-			latency_ration = LATENCY_LIMIT;
-			scanned_many = true;
-		}
-		if (swap_offset_available_and_locked(si, offset))
-			goto checks;
-		offset++;
-	}
-	spin_lock(&si->lock);
-
-no_page:
-	si->flags -= SWP_SCANNING;
-	return n_ret;
+	return cluster_alloc_swap(si, usage, nr, slots, order);
 }
 
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
@@ -2871,8 +2670,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
-	free_percpu(p->cluster_next_cpu);
-	p->cluster_next_cpu = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
 	kvfree(cluster_info);
@@ -3184,8 +2981,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 	}
 
 	si->lowest_bit  = 1;
-	si->cluster_next = 1;
-	si->cluster_nr = 0;
 
 	maxpages = swapfile_maximum_size;
 	last_page = swap_header->info.last_page;
@@ -3271,7 +3066,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 						unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
 	struct swap_cluster_info *cluster_info;
 	unsigned long i, j, k, idx;
 	int cpu, err = -ENOMEM;
@@ -3283,15 +3077,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	si->cluster_next_cpu = alloc_percpu(unsigned int);
-	if (!si->cluster_next_cpu)
-		goto err_free;
-
-	/* Random start position to help with wear leveling */
-	for_each_possible_cpu(cpu)
-		per_cpu(*si->cluster_next_cpu, cpu) =
-		get_random_u32_inclusive(1, si->highest_bit);
-
 	si->percpu_cluster = alloc_percpu(struct percpu_cluster);
 	if (!si->percpu_cluster)
 		goto err_free;
@@ -3333,7 +3118,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	 * sharing same address space.
 	 */
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
-		j = (k + col) % SWAP_CLUSTER_COLS;
+		j = k % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
 			struct swap_cluster_info *ci;
 			idx = i * SWAP_CLUSTER_COLS + j;
@@ -3483,18 +3268,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	if (si->bdev && bdev_nonrot(si->bdev)) {
 		si->flags |= SWP_SOLIDSTATE;
-
-		cluster_info = setup_clusters(si, swap_header, maxpages);
-		if (IS_ERR(cluster_info)) {
-			error = PTR_ERR(cluster_info);
-			cluster_info = NULL;
-			goto bad_swap_unlock_inode;
-		}
 	} else {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
 	}
 
+	cluster_info = setup_clusters(si, swap_header, maxpages);
+	if (IS_ERR(cluster_info)) {
+		error = PTR_ERR(cluster_info);
+		cluster_info = NULL;
+		goto bad_swap_unlock_inode;
+	}
+
 	if ((swap_flags & SWAP_FLAG_DISCARD) &&
 	    si->bdev && bdev_max_discard_sectors(si->bdev)) {
 		/*
@@ -3575,8 +3360,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap:
 	free_percpu(si->percpu_cluster);
 	si->percpu_cluster = NULL;
-	free_percpu(si->cluster_next_cpu);
-	si->cluster_next_cpu = NULL;
 	inode = NULL;
 	destroy_swap_extents(si);
 	swap_cgroup_swapoff(si->type);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 04/13] mm, swap: use cluster lock for HDD
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (2 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 03/13] mm, swap: remove old allocation path for HDD Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-09  4:07   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 05/13] mm, swap: clean up device availability check Kairui Song
                   ` (8 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Cluster lock (ci->lock) was introduce to reduce contention for certain
operations. Using cluster lock for HDD is not helpful as HDD have a poor
performance, so locking isn't the bottleneck. But having different set
of locks for HDD / non-HDD prevents further rework of device lock
(si->lock).

This commit just changed all lock_cluster_or_swap_info to lock_cluster,
which is a safe and straight conversion since cluster info is always
allocated now, also removed all cluster_info related checks.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 107 ++++++++++++++++----------------------------------
 1 file changed, 34 insertions(+), 73 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index fca58d43b836..d0e5b9fa0c48 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,10 +58,9 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
-static struct swap_cluster_info *lock_cluster_or_swap_info(
-		struct swap_info_struct *si, unsigned long offset);
-static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
-					struct swap_cluster_info *ci);
+static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
+					      unsigned long offset);
+static void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -222,9 +221,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * swap_map is HAS_CACHE only, which means the slots have no page table
 	 * reference or pending writeback, and can't be allocated to others.
 	 */
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	need_reclaim = swap_is_has_cache(si, offset, nr_pages);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	if (!need_reclaim)
 		goto out_unlock;
 
@@ -404,45 +403,15 @@ static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si
 {
 	struct swap_cluster_info *ci;
 
-	ci = si->cluster_info;
-	if (ci) {
-		ci += offset / SWAPFILE_CLUSTER;
-		spin_lock(&ci->lock);
-	}
-	return ci;
-}
-
-static inline void unlock_cluster(struct swap_cluster_info *ci)
-{
-	if (ci)
-		spin_unlock(&ci->lock);
-}
-
-/*
- * Determine the locking method in use for this device.  Return
- * swap_cluster_info if SSD-style cluster-based locking is in place.
- */
-static inline struct swap_cluster_info *lock_cluster_or_swap_info(
-		struct swap_info_struct *si, unsigned long offset)
-{
-	struct swap_cluster_info *ci;
-
-	/* Try to use fine-grained SSD-style locking if available: */
-	ci = lock_cluster(si, offset);
-	/* Otherwise, fall back to traditional, coarse locking: */
-	if (!ci)
-		spin_lock(&si->lock);
+	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	spin_lock(&ci->lock);
 
 	return ci;
 }
 
-static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
-					       struct swap_cluster_info *ci)
+static inline void unlock_cluster(struct swap_cluster_info *ci)
 {
-	if (ci)
-		unlock_cluster(ci);
-	else
-		spin_unlock(&si->lock);
+	spin_unlock(&ci->lock);
 }
 
 /* Add a cluster to discard list and schedule it to do discard */
@@ -558,9 +527,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
 	struct swap_cluster_info *ci;
 
-	if (!cluster_info)
-		return;
-
 	ci = cluster_info + idx;
 	ci->count++;
 
@@ -576,9 +542,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
 static void dec_cluster_info_page(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci, int nr_pages)
 {
-	if (!si->cluster_info)
-		return;
-
 	VM_BUG_ON(ci->count < nr_pages);
 	VM_BUG_ON(cluster_is_free(ci));
 	lockdep_assert_held(&si->lock);
@@ -1007,8 +970,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 {
 	int n_ret = 0;
 
-	VM_BUG_ON(!si->cluster_info);
-
 	si->flags += SWP_SCANNING;
 
 	while (n_ret < nr) {
@@ -1052,10 +1013,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		}
 
 		/*
-		 * Swapfile is not block device or not using clusters so unable
+		 * Swapfile is not block device so unable
 		 * to allocate large entries.
 		 */
-		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
+		if (!(si->flags & SWP_BLKDEV))
 			return 0;
 	}
 
@@ -1295,9 +1256,9 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si,
 	unsigned long offset = swp_offset(entry);
 	unsigned char usage;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	usage = __swap_entry_free_locked(si, offset, 1);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	if (!usage)
 		free_swap_slot(entry);
 
@@ -1320,14 +1281,14 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)
 		goto fallback;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
-		unlock_cluster_or_swap_info(si, ci);
+		unlock_cluster(ci);
 		goto fallback;
 	}
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 
 	if (!has_cache) {
 		for (i = 0; i < nr; i++)
@@ -1383,7 +1344,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
 	int i, nr;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	while (nr_pages) {
 		nr = min(BITS_PER_LONG, nr_pages);
 		for (i = 0; i < nr; i++) {
@@ -1391,18 +1352,18 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 				bitmap_set(to_free, i, 1);
 		}
 		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
-			unlock_cluster_or_swap_info(si, ci);
+			unlock_cluster(ci);
 			for_each_set_bit(i, to_free, BITS_PER_LONG)
 				free_swap_slot(swp_entry(si->type, offset + i));
 			if (nr == nr_pages)
 				return;
 			bitmap_clear(to_free, 0, BITS_PER_LONG);
-			ci = lock_cluster_or_swap_info(si, offset);
+			ci = lock_cluster(si, offset);
 		}
 		offset += nr;
 		nr_pages -= nr;
 	}
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 }
 
 /*
@@ -1441,9 +1402,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	if (!si)
 		return;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	if (size > 1 && swap_is_has_cache(si, offset, size)) {
-		unlock_cluster_or_swap_info(si, ci);
+		unlock_cluster(ci);
 		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, size);
 		spin_unlock(&si->lock);
@@ -1451,14 +1412,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	}
 	for (int i = 0; i < size; i++, entry.val++) {
 		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
-			unlock_cluster_or_swap_info(si, ci);
+			unlock_cluster(ci);
 			free_swap_slot(entry);
 			if (i == size - 1)
 				return;
-			lock_cluster_or_swap_info(si, offset);
+			lock_cluster(si, offset);
 		}
 	}
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 }
 
 static int swp_entry_cmp(const void *ent1, const void *ent2)
@@ -1522,9 +1483,9 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 	struct swap_cluster_info *ci;
 	int count;
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 	count = swap_count(si->swap_map[offset]);
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return count;
 }
 
@@ -1547,7 +1508,7 @@ int swp_swapcount(swp_entry_t entry)
 
 	offset = swp_offset(entry);
 
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 
 	count = swap_count(si->swap_map[offset]);
 	if (!(count & COUNT_CONTINUED))
@@ -1570,7 +1531,7 @@ int swp_swapcount(swp_entry_t entry)
 		n *= (SWAP_CONT_MAX + 1);
 	} while (tmp_count & COUNT_CONTINUED);
 out:
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return count;
 }
 
@@ -1585,8 +1546,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 	int i;
 	bool ret = false;
 
-	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || nr_pages == 1) {
+	ci = lock_cluster(si, offset);
+	if (nr_pages == 1) {
 		if (swap_count(map[roffset]))
 			ret = true;
 		goto unlock_out;
@@ -1598,7 +1559,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 		}
 	}
 unlock_out:
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return ret;
 }
 
@@ -3428,7 +3389,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	offset = swp_offset(entry);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
-	ci = lock_cluster_or_swap_info(si, offset);
+	ci = lock_cluster(si, offset);
 
 	err = 0;
 	for (i = 0; i < nr; i++) {
@@ -3483,7 +3444,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	}
 
 unlock_out:
-	unlock_cluster_or_swap_info(si, ci);
+	unlock_cluster(ci);
 	return err;
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 05/13] mm, swap: clean up device availability check
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (3 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 04/13] mm, swap: use cluster lock " Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-09  4:08   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 06/13] mm, swap: clean up plist removal and adding Kairui Song
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Remove highest_bit and lowest_bit. After the HDD allocation path
has been removed, the only purpose of these two fields is to determine
whether the device is full or not, which can instead be determined
by checking the inuse_pages.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 fs/btrfs/inode.c     |  1 -
 fs/f2fs/data.c       |  1 -
 fs/iomap/swapfile.c  |  1 -
 include/linux/swap.h |  2 --
 mm/page_io.c         |  1 -
 mm/swapfile.c        | 38 ++++++++------------------------------
 6 files changed, 8 insertions(+), 36 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 488edca8333a..a1ba78afab2c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -10044,7 +10044,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
 	*span = bsi.highest_ppage - bsi.lowest_ppage + 1;
 	sis->max = bsi.nr_pages;
 	sis->pages = bsi.nr_pages - 1;
-	sis->highest_bit = bsi.nr_pages - 1;
 	return bsi.nr_extents;
 }
 #else
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index a2478c2afb3a..a9eddd782dbc 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -4043,7 +4043,6 @@ static int check_swap_activate(struct swap_info_struct *sis,
 		cur_lblock = 1;	/* force Empty message */
 	sis->max = cur_lblock;
 	sis->pages = cur_lblock - 1;
-	sis->highest_bit = cur_lblock - 1;
 out:
 	if (not_aligned)
 		f2fs_warn(sbi, "Swapfile (%u) is not align to section: 1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(%lu * N)",
diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
index 5fc0ac36dee3..b90d0eda9e51 100644
--- a/fs/iomap/swapfile.c
+++ b/fs/iomap/swapfile.c
@@ -189,7 +189,6 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
 	*pagespan = 1 + isi.highest_ppage - isi.lowest_ppage;
 	sis->max = isi.nr_pages;
 	sis->pages = isi.nr_pages - 1;
-	sis->highest_bit = isi.nr_pages - 1;
 	return isi.nr_extents;
 }
 EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0c681aa5cb98..0c222017b5c6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -306,8 +306,6 @@ struct swap_info_struct {
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
 	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
-	unsigned int lowest_bit;	/* index of first free in swap_map */
-	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
 	unsigned int inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
diff --git a/mm/page_io.c b/mm/page_io.c
index 4b4ea8e49cf6..9b983de351f9 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -163,7 +163,6 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
 		page_no = 1;	/* force Empty message */
 	sis->max = page_no;
 	sis->pages = page_no - 1;
-	sis->highest_bit = page_no - 1;
 out:
 	return ret;
 bad_bmap:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d0e5b9fa0c48..7963a0c646a4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -55,7 +55,7 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
 				  unsigned int nr_pages);
-static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
 static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -650,7 +650,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	}
 
 	memset(si->swap_map + start, usage, nr_pages);
-	swap_range_alloc(si, start, nr_pages);
+	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
 	if (ci->count == SWAPFILE_CLUSTER) {
@@ -888,19 +888,11 @@ static void del_from_avail_list(struct swap_info_struct *si)
 	spin_unlock(&swap_avail_lock);
 }
 
-static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries)
 {
-	unsigned int end = offset + nr_entries - 1;
-
-	if (offset == si->lowest_bit)
-		si->lowest_bit += nr_entries;
-	if (end == si->highest_bit)
-		WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries);
 	WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
 	if (si->inuse_pages == si->pages) {
-		si->lowest_bit = si->max;
-		si->highest_bit = 0;
 		del_from_avail_list(si);
 
 		if (si->cluster_info && vm_swap_full())
@@ -933,15 +925,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	for (i = 0; i < nr_entries; i++)
 		clear_bit(offset + i, si->zeromap);
 
-	if (offset < si->lowest_bit)
-		si->lowest_bit = offset;
-	if (end > si->highest_bit) {
-		bool was_full = !si->highest_bit;
-
-		WRITE_ONCE(si->highest_bit, end);
-		if (was_full && (si->flags & SWP_WRITEOK))
-			add_to_avail_list(si);
-	}
+	if (si->inuse_pages == si->pages)
+		add_to_avail_list(si);
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
 			si->bdev->bd_disk->fops->swap_slot_free_notify;
@@ -1051,15 +1036,12 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		spin_lock(&si->lock);
-		if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
+		if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
 			spin_lock(&swap_avail_lock);
 			if (plist_node_empty(&si->avail_lists[node])) {
 				spin_unlock(&si->lock);
 				goto nextsi;
 			}
-			WARN(!si->highest_bit,
-			     "swap_info %d in list but !highest_bit\n",
-			     si->type);
 			WARN(!(si->flags & SWP_WRITEOK),
 			     "swap_info %d in list but !SWP_WRITEOK\n",
 			     si->type);
@@ -2441,8 +2423,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 	 */
 	plist_add(&si->list, &swap_active_head);
 
-	/* add to available list iff swap device is not full */
-	if (si->highest_bit)
+	/* add to available list if swap device is not full */
+	if (si->inuse_pages < si->pages)
 		add_to_avail_list(si);
 }
 
@@ -2606,7 +2588,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	drain_mmlist();
 
 	/* wait for anyone still in scan_swap_map_slots */
-	p->highest_bit = 0;		/* cuts scans short */
 	while (p->flags >= SWP_SCANNING) {
 		spin_unlock(&p->lock);
 		spin_unlock(&swap_lock);
@@ -2941,8 +2922,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 		return 0;
 	}
 
-	si->lowest_bit  = 1;
-
 	maxpages = swapfile_maximum_size;
 	last_page = swap_header->info.last_page;
 	if (!last_page) {
@@ -2959,7 +2938,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
 		if ((unsigned int)maxpages == 0)
 			maxpages = UINT_MAX;
 	}
-	si->highest_bit = maxpages - 1;
 
 	if (!maxpages)
 		return 0;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 06/13] mm, swap: clean up plist removal and adding
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (4 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 05/13] mm, swap: clean up device availability check Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-02  8:59   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Kairui Song
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

When the swap device is full (inuse_pages == pages), it should be
removed from the allocation available plist. If any slot is freed,
the swap device should be added back to the plist. Additionally,
during swapon or swapoff, the swap device is forcefully added
or removed.

Currently, the condition (inuse_pages == pages) is checked after
every counter update, then remove or add the device accordingly.
This is serialized by si->lock.

This commit decouples it from the protection of si->lock and reworked
plist removal and adding, making it possible to get rid of the hard
dependency on si->lock in allocation path in later commits.

To achieve this, simply using another lock is not an optimal approach,
as the overhead is observable for a hot counter, and may cause complex
locking issues. Thus, this commit manages to make it a lock-free
atomic operation, by embedding the plist state into the second highest
bit of the atomic counter.

Simply making the counter an atomic will not work, if the update
and plist status check are not performed atomically, we may miss an
addition or removal. With the embedded info we can update the counter
and check the plist status with single atomic operations, and avoid
any extra overheads:

If the counter is full (inuse_pages == pages) and the off-list bit
is unset, we attempt to remove it from the plist. If the counter is
not full (inuse_pages != pages) and the off-list bit is set, we
attempt to add it to the plist. Removing, adding and bit update
is serialized with a lock, which is a cold path. Ordinary counter
updates will be lock-free.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   2 +-
 mm/swapfile.c        | 188 +++++++++++++++++++++++++++++++------------
 2 files changed, 139 insertions(+), 51 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0c222017b5c6..e1eeea6307cd 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ struct swap_info_struct {
 					/* list of cluster that are fragmented or contented */
 	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
-	unsigned int inuse_pages;	/* number of those currently in use */
+	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7963a0c646a4..e6e58cfb5178 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -128,6 +128,26 @@ static inline unsigned char swap_count(unsigned char ent)
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
 }
 
+/*
+ * Use the second highest bit of inuse_pages counter as the indicator
+ * of if one swap device is on the available plist, so the atomic can
+ * still be updated arithmetic while having special data embedded.
+ *
+ * inuse_pages counter is the only thing indicating if a device should
+ * be on avail_lists or not (except swapon / swapoff). By embedding the
+ * on-list bit in the atomic counter, updates no longer need any lock
+ * to check the list status.
+ *
+ * This bit will be set if the device is not on the plist and not
+ * usable, will be cleared if the device is on the plist.
+ */
+#define SWAP_USAGE_OFFLIST_BIT (1UL << (BITS_PER_TYPE(atomic_t) - 2))
+#define SWAP_USAGE_COUNTER_MASK (~SWAP_USAGE_OFFLIST_BIT)
+static long swap_usage_in_pages(struct swap_info_struct *si)
+{
+	return atomic_long_read(&si->inuse_pages) & SWAP_USAGE_COUNTER_MASK;
+}
+
 /* Reclaim the swap entry anyway if possible */
 #define TTRS_ANYWAY		0x1
 /*
@@ -717,7 +737,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	int nr_reclaim;
 
 	if (force)
-		to_scan = si->inuse_pages / SWAPFILE_CLUSTER;
+		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
 
 	while (!list_empty(&si->full_clusters)) {
 		ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
@@ -872,42 +892,128 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	return found;
 }
 
-static void __del_from_avail_list(struct swap_info_struct *si)
+/* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
+static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
 {
 	int nid;
+	unsigned long pages;
+
+	spin_lock(&swap_avail_lock);
+
+	if (swapoff) {
+		/*
+		 * Forcefully remove it. Clear the SWP_WRITEOK flags for
+		 * swapoff here so it's synchronized by both si->lock and
+		 * swap_avail_lock, to ensure the result can be seen by
+		 * add_to_avail_list.
+		 */
+		lockdep_assert_held(&si->lock);
+		si->flags &= ~SWP_WRITEOK;
+		atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
+	} else {
+		/*
+		 * If not called by swapoff, take it off-list only if it's
+		 * full and SWAP_USAGE_OFFLIST_BIT is not set (strictly
+		 * si->inuse_pages == pages), any concurrent slot freeing,
+		 * or device already removed from plist by someone else
+		 * will make this return false.
+		 */
+		pages = si->pages;
+		if (!atomic_long_try_cmpxchg(&si->inuse_pages, &pages,
+					     pages | SWAP_USAGE_OFFLIST_BIT))
+			goto skip;
+	}
 
-	assert_spin_locked(&si->lock);
 	for_each_node(nid)
 		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
+
+skip:
+	spin_unlock(&swap_avail_lock);
 }
 
-static void del_from_avail_list(struct swap_info_struct *si)
+/* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
+static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 {
+	int nid;
+	long val;
+	unsigned long pages;
+
 	spin_lock(&swap_avail_lock);
-	__del_from_avail_list(si);
+
+	/* Corresponding to SWP_WRITEOK clearing in del_from_avail_list */
+	if (swapon) {
+		lockdep_assert_held(&si->lock);
+		si->flags |= SWP_WRITEOK;
+	} else {
+		if (!(READ_ONCE(si->flags) & SWP_WRITEOK))
+			goto skip;
+	}
+
+	if (!(atomic_long_read(&si->inuse_pages) & SWAP_USAGE_OFFLIST_BIT))
+		goto skip;
+
+	val = atomic_long_fetch_and_relaxed(~SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
+
+	/*
+	 * When device is full and device is on the plist, only one updater will
+	 * see (inuse_pages == si->pages) and will call del_from_avail_list. If
+	 * that updater happen to be here, just skip adding.
+	 */
+	pages = si->pages;
+	if (val == pages) {
+		/* Just like the cmpxchg in del_from_avail_list */
+		if (atomic_long_try_cmpxchg(&si->inuse_pages, &pages,
+					    pages | SWAP_USAGE_OFFLIST_BIT))
+			goto skip;
+	}
+
+	for_each_node(nid)
+		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
+
+skip:
 	spin_unlock(&swap_avail_lock);
 }
 
-static void swap_range_alloc(struct swap_info_struct *si,
-			     unsigned int nr_entries)
+/*
+ * swap_usage_add / swap_usage_sub of each slot are serialized by ci->lock
+ * within each cluster, so the total contribution to the global counter should
+ * always be positive and cannot exceed the total number of usable slots.
+ */
+static bool swap_usage_add(struct swap_info_struct *si, unsigned int nr_entries)
 {
-	WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
-	if (si->inuse_pages == si->pages) {
-		del_from_avail_list(si);
+	long val = atomic_long_add_return_relaxed(nr_entries, &si->inuse_pages);
 
-		if (si->cluster_info && vm_swap_full())
-			schedule_work(&si->reclaim_work);
+	/*
+	 * If device is full, and SWAP_USAGE_OFFLIST_BIT is not set,
+	 * remove it from the plist.
+	 */
+	if (unlikely(val == si->pages)) {
+		del_from_avail_list(si, false);
+		return true;
 	}
+
+	return false;
 }
 
-static void add_to_avail_list(struct swap_info_struct *si)
+static void swap_usage_sub(struct swap_info_struct *si, unsigned int nr_entries)
 {
-	int nid;
+	long val = atomic_long_sub_return_relaxed(nr_entries, &si->inuse_pages);
 
-	spin_lock(&swap_avail_lock);
-	for_each_node(nid)
-		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
-	spin_unlock(&swap_avail_lock);
+	/*
+	 * If device is not full, and SWAP_USAGE_OFFLIST_BIT is set,
+	 * remove it from the plist.
+	 */
+	if (unlikely(val & SWAP_USAGE_OFFLIST_BIT))
+		add_to_avail_list(si, false);
+}
+
+static void swap_range_alloc(struct swap_info_struct *si,
+			     unsigned int nr_entries)
+{
+	if (swap_usage_add(si, nr_entries)) {
+		if (si->cluster_info && vm_swap_full())
+			schedule_work(&si->reclaim_work);
+	}
 }
 
 static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
@@ -925,8 +1031,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	for (i = 0; i < nr_entries; i++)
 		clear_bit(offset + i, si->zeromap);
 
-	if (si->inuse_pages == si->pages)
-		add_to_avail_list(si);
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
 			si->bdev->bd_disk->fops->swap_slot_free_notify;
@@ -946,7 +1050,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 */
 	smp_wmb();
 	atomic_long_add(nr_entries, &nr_swap_pages);
-	WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
+	swap_usage_sub(si, nr_entries);
 }
 
 static int cluster_alloc_swap(struct swap_info_struct *si,
@@ -1036,19 +1140,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		spin_lock(&si->lock);
-		if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
-			spin_lock(&swap_avail_lock);
-			if (plist_node_empty(&si->avail_lists[node])) {
-				spin_unlock(&si->lock);
-				goto nextsi;
-			}
-			WARN(!(si->flags & SWP_WRITEOK),
-			     "swap_info %d in list but !SWP_WRITEOK\n",
-			     si->type);
-			__del_from_avail_list(si);
-			spin_unlock(&si->lock);
-			goto nextsi;
-		}
 		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 					    n_goal, swp_entries, order);
 		spin_unlock(&si->lock);
@@ -1057,7 +1148,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		cond_resched();
 
 		spin_lock(&swap_avail_lock);
-nextsi:
 		/*
 		 * if we got here, it's likely that si was almost full before,
 		 * and since scan_swap_map_slots() can drop the si->lock,
@@ -1789,7 +1879,7 @@ unsigned int count_swap_pages(int type, int free)
 		if (sis->flags & SWP_WRITEOK) {
 			n = sis->pages;
 			if (free)
-				n -= sis->inuse_pages;
+				n -= swap_usage_in_pages(sis);
 		}
 		spin_unlock(&sis->lock);
 	}
@@ -2124,7 +2214,7 @@ static int try_to_unuse(unsigned int type)
 	swp_entry_t entry;
 	unsigned int i;
 
-	if (!READ_ONCE(si->inuse_pages))
+	if (!swap_usage_in_pages(si))
 		goto success;
 
 retry:
@@ -2137,7 +2227,7 @@ static int try_to_unuse(unsigned int type)
 
 	spin_lock(&mmlist_lock);
 	p = &init_mm.mmlist;
-	while (READ_ONCE(si->inuse_pages) &&
+	while (swap_usage_in_pages(si) &&
 	       !signal_pending(current) &&
 	       (p = p->next) != &init_mm.mmlist) {
 
@@ -2165,7 +2255,7 @@ static int try_to_unuse(unsigned int type)
 	mmput(prev_mm);
 
 	i = 0;
-	while (READ_ONCE(si->inuse_pages) &&
+	while (swap_usage_in_pages(si) &&
 	       !signal_pending(current) &&
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
@@ -2200,7 +2290,7 @@ static int try_to_unuse(unsigned int type)
 	 * folio_alloc_swap(), temporarily hiding that swap.  It's easy
 	 * and robust (though cpu-intensive) just to keep retrying.
 	 */
-	if (READ_ONCE(si->inuse_pages)) {
+	if (swap_usage_in_pages(si)) {
 		if (!signal_pending(current))
 			goto retry;
 		return -EINTR;
@@ -2209,7 +2299,7 @@ static int try_to_unuse(unsigned int type)
 success:
 	/*
 	 * Make sure that further cleanups after try_to_unuse() returns happen
-	 * after swap_range_free() reduces si->inuse_pages to 0.
+	 * after swap_range_free() reduces inuse_pages to 0.
 	 */
 	smp_mb();
 	return 0;
@@ -2227,7 +2317,7 @@ static void drain_mmlist(void)
 	unsigned int type;
 
 	for (type = 0; type < nr_swapfiles; type++)
-		if (swap_info[type]->inuse_pages)
+		if (swap_usage_in_pages(swap_info[type]))
 			return;
 	spin_lock(&mmlist_lock);
 	list_for_each_safe(p, next, &init_mm.mmlist)
@@ -2406,7 +2496,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
 
 static void _enable_swap_info(struct swap_info_struct *si)
 {
-	si->flags |= SWP_WRITEOK;
 	atomic_long_add(si->pages, &nr_swap_pages);
 	total_swap_pages += si->pages;
 
@@ -2423,9 +2512,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
 	 */
 	plist_add(&si->list, &swap_active_head);
 
-	/* add to available list if swap device is not full */
-	if (si->inuse_pages < si->pages)
-		add_to_avail_list(si);
+	/* Add back to available list */
+	add_to_avail_list(si, true);
 }
 
 static void enable_swap_info(struct swap_info_struct *si, int prio,
@@ -2523,7 +2611,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 		goto out_dput;
 	}
 	spin_lock(&p->lock);
-	del_from_avail_list(p);
+	del_from_avail_list(p, true);
 	if (p->prio < 0) {
 		struct swap_info_struct *si = p;
 		int nid;
@@ -2541,7 +2629,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	plist_del(&p->list, &swap_active_head);
 	atomic_long_sub(p->pages, &nr_swap_pages);
 	total_swap_pages -= p->pages;
-	p->flags &= ~SWP_WRITEOK;
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 
@@ -2721,7 +2808,7 @@ static int swap_show(struct seq_file *swap, void *v)
 	}
 
 	bytes = K(si->pages);
-	inuse = K(READ_ONCE(si->inuse_pages));
+	inuse = K(swap_usage_in_pages(si));
 
 	file = si->swap_file;
 	len = seq_file_path(swap, file, " \t\n\\");
@@ -2838,6 +2925,7 @@ static struct swap_info_struct *alloc_swap_info(void)
 	}
 	spin_lock_init(&p->lock);
 	spin_lock_init(&p->cont_lock);
+	atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT);
 	init_completion(&p->comp);
 
 	return p;
@@ -3335,7 +3423,7 @@ void si_swapinfo(struct sysinfo *val)
 		struct swap_info_struct *si = swap_info[type];
 
 		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
-			nr_to_be_unused += READ_ONCE(si->inuse_pages);
+			nr_to_be_unused += swap_usage_in_pages(si);
 	}
 	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
 	val->totalswap = total_swap_pages + nr_to_be_unused;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (5 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 06/13] mm, swap: clean up plist removal and adding Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-04  5:46   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The flag SWP_SCANNING was used as an indicator of whether a device
is being scanned for allocation, and prevents swapoff. Combined with
SWP_WRITEOK, they work as a set of barriers for a clean swapoff:

1. Swapoff clears SWP_WRITEOK, allocation requests will see
   ~SWP_WRITEOK and abort as it's serialized by si->lock.
2. Swapoff unuses all allocated entries.
3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
   allocations will stop, preventing UAF.
4. Now swapoff can free everything safely.

This will make the allocation path have a hard dependency on
si->lock. Allocation always have to acquire si->lock first for
setting SWP_SCANNING and checking SWP_WRITEOK.

This commit removes this flag, and just uses the existing per-CPU
refcount instead to prevent UAF in step 3, which serves well for
such usage without dependency on si->lock, and scales very well too.
Just hold a reference during the whole scan and allocation process.
Swapoff will kill and wait for the counter.

And for preventing any allocation from happening after step 1 so the
unuse in step 2 can ensure all slots are free, swapoff will acquire
the ci->lock of each cluster one by one to ensure all allocations
see ~SWP_WRITEOK and abort.

This way these dependences on si->lock are gone. And worth noting we
can't kill the refcount as the first step for swapoff as the unuse
process have to acquire the refcount.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 90 ++++++++++++++++++++++++++++----------------
 2 files changed, 57 insertions(+), 34 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e1eeea6307cd..02120f1005d5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -219,7 +219,6 @@ enum {
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
 					/* add others here before... */
-	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
 };
 
 #define SWAP_CLUSTER_MAX 32UL
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e6e58cfb5178..99fd0b0d84a2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 {
 	unsigned int nr_pages = 1 << order;
 
+	lockdep_assert_held(&ci->lock);
+
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
 
@@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 {
 	int n_ret = 0;
 
-	si->flags += SWP_SCANNING;
-
 	while (n_ret < nr) {
 		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
 
@@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 		slots[n_ret++] = swp_entry(si->type, offset);
 	}
 
-	si->flags -= SWP_SCANNING;
-
 	return n_ret;
 }
 
@@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return cluster_alloc_swap(si, usage, nr, slots, order);
 }
 
+static bool get_swap_device_info(struct swap_info_struct *si)
+{
+	if (!percpu_ref_tryget_live(&si->users))
+		return false;
+	/*
+	 * Guarantee the si->users are checked before accessing other
+	 * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is
+	 * up to dated.
+	 *
+	 * Paired with the spin_unlock() after setup_swap_info() in
+	 * enable_swap_info(), and smp_wmb() in swapoff.
+	 */
+	smp_rmb();
+	return true;
+}
+
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order = swap_entry_order(entry_order);
@@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		/* requeue si to after same-priority siblings */
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
-		spin_lock(&si->lock);
-		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					    n_goal, swp_entries, order);
-		spin_unlock(&si->lock);
-		if (n_ret || size > 1)
-			goto check_out;
-		cond_resched();
+		if (get_swap_device_info(si)) {
+			spin_lock(&si->lock);
+			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
+					n_goal, swp_entries, order);
+			spin_unlock(&si->lock);
+			put_swap_device(si);
+			if (n_ret || size > 1)
+				goto check_out;
+			cond_resched();
+		}
 
 		spin_lock(&swap_avail_lock);
 		/*
@@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	si = swp_swap_info(entry);
 	if (!si)
 		goto bad_nofile;
-	if (!percpu_ref_tryget_live(&si->users))
+	if (!get_swap_device_info(si))
 		goto out;
-	/*
-	 * Guarantee the si->users are checked before accessing other
-	 * fields of swap_info_struct.
-	 *
-	 * Paired with the spin_unlock() after setup_swap_info() in
-	 * enable_swap_info().
-	 */
-	smp_rmb();
 	offset = swp_offset(entry);
 	if (offset >= si->max)
 		goto put_out;
@@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
-	spin_lock(&si->lock);
-	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
-		atomic_long_dec(&nr_swap_pages);
-	spin_unlock(&si->lock);
+	if (get_swap_device_info(si)) {
+		spin_lock(&si->lock);
+		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+			atomic_long_dec(&nr_swap_pages);
+		spin_unlock(&si->lock);
+		put_swap_device(si);
+	}
 fail:
 	return entry;
 }
@@ -2562,6 +2574,25 @@ bool has_usable_swap(void)
 	return ret;
 }
 
+/*
+ * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
+ * see the updated flags, so there will be no more allocations.
+ */
+static void wait_for_allocation(struct swap_info_struct *si)
+{
+	unsigned long offset;
+	unsigned long end = ALIGN(si->max, SWAPFILE_CLUSTER);
+	struct swap_cluster_info *ci;
+
+	BUG_ON(si->flags & SWP_WRITEOK);
+
+	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
+		ci = lock_cluster(si, offset);
+		unlock_cluster(ci);
+		offset += SWAPFILE_CLUSTER;
+	}
+}
+
 SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 {
 	struct swap_info_struct *p = NULL;
@@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_unlock(&p->lock);
 	spin_unlock(&swap_lock);
 
+	wait_for_allocation(p);
+
 	disable_swap_slots_cache_lock();
 
 	set_current_oom_origin();
@@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	spin_lock(&p->lock);
 	drain_mmlist();
 
-	/* wait for anyone still in scan_swap_map_slots */
-	while (p->flags >= SWP_SCANNING) {
-		spin_unlock(&p->lock);
-		spin_unlock(&swap_lock);
-		schedule_timeout_uninterruptible(1);
-		spin_lock(&swap_lock);
-		spin_lock(&p->lock);
-	}
-
 	swap_file = p->swap_file;
 	p->swap_file = NULL;
 	p->max = 0;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (6 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-06  8:43   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, we are only using flags to indicate which list the cluster
is on. Using one bit for each list type might be a waste, as the list
type grows, we will consume too many bits. Additionally, the current
mixed usage of '&' and '==' is a bit confusing.

Make it clean by using an enum to define all possible cluster
statuses. Only an off-list cluster will have the NONE (0) flag.
And use a wrapper to annotate and sanitize all flag settings
and list movements.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 17 +++++++---
 mm/swapfile.c        | 75 +++++++++++++++++++++++---------------------
 2 files changed, 52 insertions(+), 40 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 02120f1005d5..339d7f0192ff 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -257,10 +257,19 @@ struct swap_cluster_info {
 	u8 order;
 	struct list_head list;
 };
-#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
-#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
-#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */
+
+/* All on-list cluster must have a non-zero flag. */
+enum swap_cluster_flags {
+	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
+	CLUSTER_FLAG_FREE,
+	CLUSTER_FLAG_NONFULL,
+	CLUSTER_FLAG_FRAG,
+	/* Clusters with flags above are allocatable */
+	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
+	CLUSTER_FLAG_FULL,
+	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_MAX,
+};
 
 /*
  * The first page in the swap file is the swap header, which is always marked
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 99fd0b0d84a2..7795a3d27273 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -403,7 +403,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags & CLUSTER_FLAG_FREE;
+	return info->flags == CLUSTER_FLAG_FREE;
 }
 
 static inline unsigned int cluster_index(struct swap_info_struct *si,
@@ -434,6 +434,27 @@ static inline void unlock_cluster(struct swap_cluster_info *ci)
 	spin_unlock(&ci->lock);
 }
 
+static void cluster_move(struct swap_info_struct *si,
+			 struct swap_cluster_info *ci, struct list_head *list,
+			 enum swap_cluster_flags new_flags)
+{
+	VM_WARN_ON(ci->flags == new_flags);
+	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
+
+	if (ci->flags == CLUSTER_FLAG_NONE) {
+		list_add_tail(&ci->list, list);
+	} else {
+		if (ci->flags == CLUSTER_FLAG_FRAG) {
+			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
+			si->frag_cluster_nr[ci->order]--;
+		}
+		list_move_tail(&ci->list, list);
+	}
+	ci->flags = new_flags;
+	if (new_flags == CLUSTER_FLAG_FRAG)
+		si->frag_cluster_nr[ci->order]++;
+}
+
 /* Add a cluster to discard list and schedule it to do discard */
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 		struct swap_cluster_info *ci)
@@ -447,10 +468,8 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	 */
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
-
-	VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-	list_move_tail(&ci->list, &si->discard_clusters);
-	ci->flags = 0;
+	VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE);
+	cluster_move(si, ci, &si->discard_clusters, CLUSTER_FLAG_DISCARD);
 	schedule_work(&si->discard_work);
 }
 
@@ -458,12 +477,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
 {
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
-
-	if (ci->flags)
-		list_move_tail(&ci->list, &si->free_clusters);
-	else
-		list_add_tail(&ci->list, &si->free_clusters);
-	ci->flags = CLUSTER_FLAG_FREE;
+	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
 
@@ -479,6 +493,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 	while (!list_empty(&si->discard_clusters)) {
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
 		list_del(&ci->list);
+		/* Must clear flag when taking a cluster off-list */
+		ci->flags = CLUSTER_FLAG_NONE;
 		idx = cluster_index(si, ci);
 		spin_unlock(&si->lock);
 
@@ -519,9 +535,6 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 
-	if (ci->flags & CLUSTER_FLAG_FRAG)
-		si->frag_cluster_nr[ci->order]--;
-
 	/*
 	 * If the swap is discardable, prepare discard the cluster
 	 * instead of free it immediately. The cluster will be freed
@@ -573,13 +586,9 @@ static void dec_cluster_info_page(struct swap_info_struct *si,
 		return;
 	}
 
-	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
-		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-		if (ci->flags & CLUSTER_FLAG_FRAG)
-			si->frag_cluster_nr[ci->order]--;
-		list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]);
-		ci->flags = CLUSTER_FLAG_NONFULL;
-	}
+	if (ci->flags != CLUSTER_FLAG_NONFULL)
+		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
+			     CLUSTER_FLAG_NONFULL);
 }
 
 static bool cluster_reclaim_range(struct swap_info_struct *si,
@@ -663,11 +672,13 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
 
+	VM_BUG_ON(ci->flags == CLUSTER_FLAG_NONE);
+	VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE);
+
 	if (cluster_is_free(ci)) {
-		if (nr_pages < SWAPFILE_CLUSTER) {
-			list_move_tail(&ci->list, &si->nonfull_clusters[order]);
-			ci->flags = CLUSTER_FLAG_NONFULL;
-		}
+		if (nr_pages < SWAPFILE_CLUSTER)
+			cluster_move(si, ci, &si->nonfull_clusters[order],
+				     CLUSTER_FLAG_NONFULL);
 		ci->order = order;
 	}
 
@@ -675,14 +686,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
-	if (ci->count == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!(ci->flags &
-			  (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
-		if (ci->flags & CLUSTER_FLAG_FRAG)
-			si->frag_cluster_nr[ci->order]--;
-		list_move_tail(&ci->list, &si->full_clusters);
-		ci->flags = CLUSTER_FLAG_FULL;
-	}
+	if (ci->count == SWAPFILE_CLUSTER)
+		cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
 
 	return true;
 }
@@ -821,9 +826,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		while (!list_empty(&si->nonfull_clusters[order])) {
 			ci = list_first_entry(&si->nonfull_clusters[order],
 					      struct swap_cluster_info, list);
-			list_move_tail(&ci->list, &si->frag_clusters[order]);
-			ci->flags = CLUSTER_FLAG_FRAG;
-			si->frag_cluster_nr[order]++;
+			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
 			frags++;
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (7 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-06 10:12   ` Baoquan He
  2025-01-08 11:09   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 10/13] mm, swap: simplify percpu cluster updating Kairui Song
                   ` (3 subsequent siblings)
  12 siblings, 2 replies; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Currently, swap locking is mainly composed of two locks: the cluster
lock (ci->lock) and the device lock (si->lock).

The cluster lock is much more fine-grained, so it is best to use
ci->lock instead of si->lock as much as possible.

We have cleaned up other hard dependencies on si->lock. Following the
new cluster allocator design, most operations don't need to touch
si->lock at all. In practice, we only need to take si->lock when
moving clusters between lists.

To achieve this, this commit reworks the locking pattern of all
si->lock and ci->lock users, eliminates all usage of ci->lock inside
si->lock, and introduces a new design to avoid touching si->lock
unless needed.

For minimal contention and easier understanding of the system, two
ideas are introduced with the corresponding helpers: isolation and
relocation.

- Clusters will be `isolated` from the list when iterating the list
  to search for an allocatable cluster.

  This ensures other CPUs won't walk into the same cluster easily,
  and it releases si->lock after acquiring ci->lock, providing the
  only place that handles the inversion of two locks, and avoids
  contention.

  Iterating the cluster list almost always moves the cluster
  (free -> nonfull, nonfull -> frag, frag -> frag tail), but it
  doesn't know where the cluster should be moved to until scanning
  is done. So keeping the cluster off-list is a good option with
  low overhead.

  The off-list time window of a cluster is also minimal. In the worst
  case, one CPU will return the cluster after scanning the 512 entries
  on it, which we used to busy wait with a spin lock.

This is done with the new helper `cluster_isolate_lock`.

- Clusters will be `relocated` after allocation or freeing, according
  to their usage count and status.

  Allocations no longer hold si->lock now, and may drop ci->lock for
  reclaim, so the cluster could be moved to any location while no lock
  is held. Besides, isolation clears all flags when it takes the
  cluster off the list (the flags must be in sync with the list status,
  so cluster users don't need to touch si->lock for checking its list
  status). So the cluster has to be relocated to the right list
  according to its usage after allocation or freeing.

  Relocation is optional, if the cluster flags indicate it's already
  on the right list, it will skip touching the list or si->lock.

This is done with relocate_cluster after allocation or with
[partial_]free_cluster after freeing.

This handled usage of all kinds of clusters in a clean way.

Scanning and allocation by iterating the cluster list is handled by
"isolate - <scan / allocate> - relocate".

Scanning and allocation of per-CPU clusters will only involve
"<scan / allocate> - relocate", as it knows which cluster to lock
and use.

Freeing will only involve "relocate".

Each CPU will keep using its per-CPU cluster until the 512 entries
are all consumed. Freeing also has to free 512 entries to trigger
cluster movement in the best case, so si->lock is rarely touched.

Testing with building the Linux kernel with defconfig showed huge
improvement:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, on Intel 8255C:
Before:
Sys time: 73578.30, Real time: 864.05
After: (-50.7% sys time, -44.8% real time)
Sys time: 36227.49, Real time: 476.66

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, on Intel 8255C:
(avg of 4 test run)
Before:
Sys time: 74044.85, Real time: 846.51
hugepages-64kB/stats/swpout: 1735216
hugepages-64kB/stats/swpout_fallback: 430333

After: (-40.4% sys time, -37.1% real time)
Sys time: 44160.56, Real time: 532.07
hugepages-64kB/stats/swpout: 1786288
hugepages-64kB/stats/swpout_fallback: 243384

time make -j32 / 512M memcg, 4K pages, 5G ZRAM, on AMD 7K62:
Before:
Sys time: 8098.21, Real time: 401.3
After: (-22.6% sys time, -12.8% real time )
Sys time: 6265.02, Real time: 349.83

The allocation success rate also slightly improved as we sanitized the
usage of clusters with new defined helpers, previously dropping
si->lock or ci->lock during scan will cause cluster order shuffle.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   3 +-
 mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
 2 files changed, 246 insertions(+), 192 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 339d7f0192ff..c4ff31cb6bde 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -291,6 +291,7 @@ enum swap_cluster_flags {
  * throughput.
  */
 struct percpu_cluster {
+	local_lock_t lock; /* Protect the percpu_cluster above */
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
@@ -313,7 +314,7 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
+	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 7795a3d27273..dadd4fead689 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_ref_sub(folio, nr_pages);
 	folio_set_dirty(folio);
 
-	spin_lock(&si->lock);
 	/* Only sinple page folio can be backed by zswap */
 	if (nr_pages == 1)
 		zswap_invalidate(entry);
 	swap_entry_range_free(si, entry, nr_pages);
-	spin_unlock(&si->lock);
 	ret = nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
-	return info->flags == CLUSTER_FLAG_FREE;
+	return info->count == 0;
+}
+
+static inline bool cluster_is_discard(struct swap_cluster_info *info)
+{
+	return info->flags == CLUSTER_FLAG_DISCARD;
+}
+
+static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
+{
+	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
+		return false;
+	if (!order)
+		return true;
+	return cluster_is_free(ci) || order == ci->order;
 }
 
 static inline unsigned int cluster_index(struct swap_info_struct *si,
@@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct *si,
 {
 	VM_WARN_ON(ci->flags == new_flags);
 	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
+	lockdep_assert_held(&ci->lock);
 
-	if (ci->flags == CLUSTER_FLAG_NONE) {
+	spin_lock(&si->lock);
+	if (ci->flags == CLUSTER_FLAG_NONE)
 		list_add_tail(&ci->list, list);
-	} else {
-		if (ci->flags == CLUSTER_FLAG_FRAG) {
-			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
-			si->frag_cluster_nr[ci->order]--;
-		}
+	else
 		list_move_tail(&ci->list, list);
-	}
+	spin_unlock(&si->lock);
+
+	if (ci->flags == CLUSTER_FLAG_FRAG)
+		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
+	else if (new_flags == CLUSTER_FLAG_FRAG)
+		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags = new_flags;
-	if (new_flags == CLUSTER_FLAG_FRAG)
-		si->frag_cluster_nr[ci->order]++;
 }
 
 /* Add a cluster to discard list and schedule it to do discard */
@@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
 	ci->order = 0;
 }
 
+/*
+ * Isolate and lock the first cluster that is not contented on a list,
+ * clean its flag before taken off-list. Cluster flag must be in sync
+ * with list status, so cluster updaters can always know the cluster
+ * list status without touching si lock.
+ *
+ * Note it's possible that all clusters on a list are contented so
+ * this returns NULL for an non-empty list.
+ */
+static struct swap_cluster_info *cluster_isolate_lock(
+		struct swap_info_struct *si, struct list_head *list)
+{
+	struct swap_cluster_info *ci, *ret = NULL;
+
+	spin_lock(&si->lock);
+
+	if (unlikely(!(si->flags & SWP_WRITEOK)))
+		goto out;
+
+	list_for_each_entry(ci, list, list) {
+		if (!spin_trylock(&ci->lock))
+			continue;
+
+		/* We may only isolate and clear flags of following lists */
+		VM_BUG_ON(!ci->flags);
+		VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
+			  ci->flags != CLUSTER_FLAG_FULL);
+
+		list_del(&ci->list);
+		ci->flags = CLUSTER_FLAG_NONE;
+		ret = ci;
+		break;
+	}
+out:
+	spin_unlock(&si->lock);
+
+	return ret;
+}
+
 /*
  * Doing discard actually. After a cluster discard is finished, the cluster
- * will be added to free cluster list. caller should hold si->lock.
-*/
-static void swap_do_scheduled_discard(struct swap_info_struct *si)
+ * will be added to free cluster list. Discard cluster is a bit special as
+ * they don't participate in allocation or reclaim, so clusters marked as
+ * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
+ */
+static bool swap_do_scheduled_discard(struct swap_info_struct *si)
 {
 	struct swap_cluster_info *ci;
+	bool ret = false;
 	unsigned int idx;
 
+	spin_lock(&si->lock);
 	while (!list_empty(&si->discard_clusters)) {
 		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
+		/*
+		 * Delete the cluster from list but don't clear its flags until
+		 * discard is done, so isolation and relocation will skip it.
+		 */
 		list_del(&ci->list);
-		/* Must clear flag when taking a cluster off-list */
-		ci->flags = CLUSTER_FLAG_NONE;
 		idx = cluster_index(si, ci);
 		spin_unlock(&si->lock);
-
 		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
 				SWAPFILE_CLUSTER);
 
-		spin_lock(&si->lock);
 		spin_lock(&ci->lock);
-		__free_cluster(si, ci);
+		/*
+		 * Discard is done, clear its flags as it's now off-list,
+		 * then return the cluster to allocation list.
+		 */
+		ci->flags = CLUSTER_FLAG_NONE;
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
+		__free_cluster(si, ci);
 		spin_unlock(&ci->lock);
+		ret = true;
+		spin_lock(&si->lock);
 	}
+	spin_unlock(&si->lock);
+	return ret;
 }
 
 static void swap_discard_work(struct work_struct *work)
@@ -516,9 +580,7 @@ static void swap_discard_work(struct work_struct *work)
 
 	si = container_of(work, struct swap_info_struct, discard_work);
 
-	spin_lock(&si->lock);
 	swap_do_scheduled_discard(si);
-	spin_unlock(&si->lock);
 }
 
 static void swap_users_ref_free(struct percpu_ref *ref)
@@ -529,10 +591,14 @@ static void swap_users_ref_free(struct percpu_ref *ref)
 	complete(&si->comp);
 }
 
+/*
+ * Must be called after freeing if ci->count == 0, moves the cluster to free
+ * or discard list.
+ */
 static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
 	VM_BUG_ON(ci->count != 0);
-	lockdep_assert_held(&si->lock);
+	VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE);
 	lockdep_assert_held(&ci->lock);
 
 	/*
@@ -549,6 +615,48 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 	__free_cluster(si, ci);
 }
 
+/*
+ * Must be called after freeing if ci->count != 0, moves the cluster to
+ * nonfull list.
+ */
+static void partial_free_cluster(struct swap_info_struct *si,
+				 struct swap_cluster_info *ci)
+{
+	VM_BUG_ON(!ci->count || ci->count == SWAPFILE_CLUSTER);
+	lockdep_assert_held(&ci->lock);
+
+	if (ci->flags != CLUSTER_FLAG_NONFULL)
+		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
+			     CLUSTER_FLAG_NONFULL);
+}
+
+/*
+ * Must be called after allocation, moves the cluster to full or frag list.
+ * Note: allocation doesn't acquire si lock, and may drop the ci lock for
+ * reclaim, so the cluster could be any where when called.
+ */
+static void relocate_cluster(struct swap_info_struct *si,
+			     struct swap_cluster_info *ci)
+{
+	lockdep_assert_held(&ci->lock);
+
+	/* Discard cluster must remain off-list or on discard list */
+	if (cluster_is_discard(ci))
+		return;
+
+	if (!ci->count) {
+		free_cluster(si, ci);
+	} else if (ci->count != SWAPFILE_CLUSTER) {
+		if (ci->flags != CLUSTER_FLAG_FRAG)
+			cluster_move(si, ci, &si->frag_clusters[ci->order],
+				     CLUSTER_FLAG_FRAG);
+	} else {
+		if (ci->flags != CLUSTER_FLAG_FULL)
+			cluster_move(si, ci, &si->full_clusters,
+				     CLUSTER_FLAG_FULL);
+	}
+}
+
 /*
  * The cluster corresponding to page_nr will be used. The cluster will not be
  * added to free cluster list and its usage counter will be increased by 1.
@@ -567,30 +675,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
 	VM_BUG_ON(ci->flags);
 }
 
-/*
- * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
- * which means no page in the cluster is in use, we can optionally discard
- * the cluster and add it to free cluster list.
- */
-static void dec_cluster_info_page(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci, int nr_pages)
-{
-	VM_BUG_ON(ci->count < nr_pages);
-	VM_BUG_ON(cluster_is_free(ci));
-	lockdep_assert_held(&si->lock);
-	lockdep_assert_held(&ci->lock);
-	ci->count -= nr_pages;
-
-	if (!ci->count) {
-		free_cluster(si, ci);
-		return;
-	}
-
-	if (ci->flags != CLUSTER_FLAG_NONFULL)
-		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
-			     CLUSTER_FLAG_NONFULL);
-}
-
 static bool cluster_reclaim_range(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
 				  unsigned long start, unsigned long end)
@@ -600,8 +684,6 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 	int nr_reclaim;
 
 	spin_unlock(&ci->lock);
-	spin_unlock(&si->lock);
-
 	do {
 		switch (READ_ONCE(map[offset])) {
 		case 0:
@@ -619,9 +701,7 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 		}
 	} while (offset < end);
 out:
-	spin_lock(&si->lock);
 	spin_lock(&ci->lock);
-
 	/*
 	 * Recheck the range no matter reclaim succeeded or not, the slot
 	 * could have been be freed while we are not holding the lock.
@@ -635,11 +715,11 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 
 static bool cluster_scan_range(struct swap_info_struct *si,
 			       struct swap_cluster_info *ci,
-			       unsigned long start, unsigned int nr_pages)
+			       unsigned long start, unsigned int nr_pages,
+			       bool *need_reclaim)
 {
 	unsigned long offset, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
-	bool need_reclaim = false;
 
 	for (offset = start; offset < end; offset++) {
 		switch (READ_ONCE(map[offset])) {
@@ -648,16 +728,13 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 		case SWAP_HAS_CACHE:
 			if (!vm_swap_full())
 				return false;
-			need_reclaim = true;
+			*need_reclaim = true;
 			continue;
 		default:
 			return false;
 		}
 	}
 
-	if (need_reclaim)
-		return cluster_reclaim_range(si, ci, start, end);
-
 	return true;
 }
 
@@ -672,23 +749,13 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	if (!(si->flags & SWP_WRITEOK))
 		return false;
 
-	VM_BUG_ON(ci->flags == CLUSTER_FLAG_NONE);
-	VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE);
-
-	if (cluster_is_free(ci)) {
-		if (nr_pages < SWAPFILE_CLUSTER)
-			cluster_move(si, ci, &si->nonfull_clusters[order],
-				     CLUSTER_FLAG_NONFULL);
+	if (cluster_is_free(ci))
 		ci->order = order;
-	}
 
 	memset(si->swap_map + start, usage, nr_pages);
 	swap_range_alloc(si, nr_pages);
 	ci->count += nr_pages;
 
-	if (ci->count == SWAPFILE_CLUSTER)
-		cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
-
 	return true;
 }
 
@@ -699,37 +766,55 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 	unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
+	bool need_reclaim, ret;
 	struct swap_cluster_info *ci;
 
-	if (end < nr_pages)
-		return SWAP_NEXT_INVALID;
-	end -= nr_pages;
+	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	lockdep_assert_held(&ci->lock);
 
-	ci = lock_cluster(si, offset);
-	if (ci->count + nr_pages > SWAPFILE_CLUSTER) {
+	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) {
 		offset = SWAP_NEXT_INVALID;
-		goto done;
+		goto out;
 	}
 
-	while (offset <= end) {
-		if (cluster_scan_range(si, ci, offset, nr_pages)) {
-			if (!cluster_alloc_range(si, ci, offset, usage, order)) {
-				offset = SWAP_NEXT_INVALID;
-				goto done;
-			}
-			*foundp = offset;
-			if (ci->count == SWAPFILE_CLUSTER) {
+	for (end -= nr_pages; offset <= end; offset += nr_pages) {
+		need_reclaim = false;
+		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
+			continue;
+		if (need_reclaim) {
+			ret = cluster_reclaim_range(si, ci, start, end);
+			/*
+			 * Reclaim drops ci->lock and cluster could be used
+			 * by another order. Not checking flag as off-list
+			 * cluster has no flag set, and change of list
+			 * won't cause fragmentation.
+			 */
+			if (!cluster_is_usable(ci, order)) {
 				offset = SWAP_NEXT_INVALID;
-				goto done;
+				goto out;
 			}
-			offset += nr_pages;
-			break;
+			if (cluster_is_free(ci))
+				offset = start;
+			/* Reclaim failed but cluster is usable, try next */
+			if (!ret)
+				continue;
+		}
+		if (!cluster_alloc_range(si, ci, offset, usage, order)) {
+			offset = SWAP_NEXT_INVALID;
+			goto out;
+		}
+		*foundp = offset;
+		if (ci->count == SWAPFILE_CLUSTER) {
+			offset = SWAP_NEXT_INVALID;
+			goto out;
 		}
 		offset += nr_pages;
+		break;
 	}
 	if (offset > end)
 		offset = SWAP_NEXT_INVALID;
-done:
+out:
+	relocate_cluster(si, ci);
 	unlock_cluster(ci);
 	return offset;
 }
@@ -746,18 +831,17 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	if (force)
 		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
 
-	while (!list_empty(&si->full_clusters)) {
-		ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
-		list_move_tail(&ci->list, &si->full_clusters);
+	while ((ci = cluster_isolate_lock(si, &si->full_clusters))) {
 		offset = cluster_offset(si, ci);
 		end = min(si->max, offset + SWAPFILE_CLUSTER);
 		to_scan--;
 
-		spin_unlock(&si->lock);
 		while (offset < end) {
 			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY | TTRS_DIRECT);
+				spin_lock(&ci->lock);
 				if (nr_reclaim) {
 					offset += abs(nr_reclaim);
 					continue;
@@ -765,8 +849,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 			}
 			offset++;
 		}
-		spin_lock(&si->lock);
 
+		unlock_cluster(ci);
 		if (to_scan <= 0)
 			break;
 	}
@@ -778,9 +862,7 @@ static void swap_reclaim_work(struct work_struct *work)
 
 	si = container_of(work, struct swap_info_struct, reclaim_work);
 
-	spin_lock(&si->lock);
 	swap_reclaim_full_clusters(si, true);
-	spin_unlock(&si->lock);
 }
 
 /*
@@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
 static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
 					      unsigned char usage)
 {
-	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
 
-new_cluster:
-	lockdep_assert_held(&si->lock);
-	cluster = this_cpu_ptr(si->percpu_cluster);
-	offset = cluster->next[order];
+	/* Fast path using per CPU cluster */
+	local_lock(&si->percpu_cluster->lock);
+	offset = __this_cpu_read(si->percpu_cluster->next[order]);
 	if (offset) {
-		offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
+		ci = lock_cluster(si, offset);
+		/* Cluster could have been used by another order */
+		if (cluster_is_usable(ci, order)) {
+			if (cluster_is_free(ci))
+				offset = cluster_offset(si, ci);
+			offset = alloc_swap_scan_cluster(si, offset, &found,
+							 order, usage);
+		} else {
+			unlock_cluster(ci);
+		}
 		if (found)
 			goto done;
 	}
 
-	if (!list_empty(&si->free_clusters)) {
-		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
-		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
-		/*
-		 * Either we didn't touch the cluster due to swapoff,
-		 * or the allocation must success.
-		 */
-		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
-		goto done;
+new_cluster:
+	ci = cluster_isolate_lock(si, &si->free_clusters);
+	if (ci) {
+		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+						 &found, order, usage);
+		if (found)
+			goto done;
 	}
 
 	/* Try reclaim from full clusters if free clusters list is drained */
@@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
 
 	if (order < PMD_ORDER) {
-		unsigned int frags = 0;
+		unsigned int frags = 0, frags_existing;
 
-		while (!list_empty(&si->nonfull_clusters[order])) {
-			ci = list_first_entry(&si->nonfull_clusters[order],
-					      struct swap_cluster_info, list);
-			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
+		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
-			frags++;
+			/*
+			 * With `fragmenting` set to true, it will surely take
+			 * the cluster off nonfull list
+			 */
 			if (found)
 				goto done;
+			frags++;
 		}
 
-		/*
-		 * Nonfull clusters are moved to frag tail if we reached
-		 * here, count them too, don't over scan the frag list.
-		 */
-		while (frags < si->frag_cluster_nr[order]) {
-			ci = list_first_entry(&si->frag_clusters[order],
-					      struct swap_cluster_info, list);
+		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
+		while (frags < frags_existing &&
+		       (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
+			atomic_long_dec(&si->frag_cluster_nr[order]);
 			/*
-			 * Rotate the frag list to iterate, they were all failing
-			 * high order allocation or moved here due to per-CPU usage,
-			 * this help keeping usable cluster ahead.
+			 * Rotate the frag list to iterate, they were all
+			 * failing high order allocation or moved here due to
+			 * per-CPU usage, but they could contain newly released
+			 * reclaimable (eg. lazy-freed swap cache) slots.
 			 */
-			list_move_tail(&ci->list, &si->frag_clusters[order]);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
-			frags++;
 			if (found)
 				goto done;
+			frags++;
 		}
 	}
 
-	if (!list_empty(&si->discard_clusters)) {
-		/*
-		 * we don't have free cluster but have some clusters in
-		 * discarding, do discard now and reclaim them, then
-		 * reread cluster_next_cpu since we dropped si->lock
-		 */
-		swap_do_scheduled_discard(si);
+	/*
+	 * We don't have free cluster but have some clusters in
+	 * discarding, do discard now and reclaim them, then
+	 * reread cluster_next_cpu since we dropped si->lock
+	 */
+	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
 		goto new_cluster;
-	}
 
 	if (order)
 		goto done;
@@ -874,26 +957,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * Clusters here have at least one usable slots and can't fail order 0
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
-		while (!list_empty(&si->frag_clusters[o])) {
-			ci = list_first_entry(&si->frag_clusters[o],
-					      struct swap_cluster_info, list);
+		while ((ci = cluster_isolate_lock(si, &si->frag_clusters[o]))) {
+			atomic_long_dec(&si->frag_cluster_nr[o]);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, 0, usage);
+							 &found, order, usage);
 			if (found)
 				goto done;
 		}
 
-		while (!list_empty(&si->nonfull_clusters[o])) {
-			ci = list_first_entry(&si->nonfull_clusters[o],
-					      struct swap_cluster_info, list);
+		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[o]))) {
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, 0, usage);
+							 &found, order, usage);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	cluster->next[order] = offset;
+	__this_cpu_write(si->percpu_cluster->next[order], offset);
+	local_unlock(&si->percpu_cluster->lock);
+
 	return found;
 }
 
@@ -1157,14 +1239,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			spin_lock(&si->lock);
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
 					n_goal, swp_entries, order);
-			spin_unlock(&si->lock);
 			put_swap_device(si);
 			if (n_ret || size > 1)
 				goto check_out;
-			cond_resched();
 		}
 
 		spin_lock(&swap_avail_lock);
@@ -1377,9 +1456,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	if (!has_cache) {
 		for (i = 0; i < nr; i++)
 			zswap_invalidate(swp_entry(si->type, offset + i));
-		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, nr);
-		spin_unlock(&si->lock);
 	}
 	return has_cache;
 
@@ -1408,16 +1485,27 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 	unsigned char *map_end = map + nr_pages;
 	struct swap_cluster_info *ci;
 
+	/* It should never free entries across different clusters */
+	VM_BUG_ON((offset / SWAPFILE_CLUSTER) != ((offset + nr_pages - 1) / SWAPFILE_CLUSTER));
+
 	ci = lock_cluster(si, offset);
+	VM_BUG_ON(cluster_is_free(ci));
+	VM_BUG_ON(ci->count < nr_pages);
+
+	ci->count -= nr_pages;
 	do {
 		VM_BUG_ON(*map != SWAP_HAS_CACHE);
 		*map = 0;
 	} while (++map < map_end);
-	dec_cluster_info_page(si, ci, nr_pages);
-	unlock_cluster(ci);
 
 	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
+
+	if (!ci->count)
+		free_cluster(si, ci);
+	else
+		partial_free_cluster(si, ci);
+	unlock_cluster(ci);
 }
 
 static void cluster_swap_free_nr(struct swap_info_struct *si,
@@ -1489,9 +1577,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	ci = lock_cluster(si, offset);
 	if (size > 1 && swap_is_has_cache(si, offset, size)) {
 		unlock_cluster(ci);
-		spin_lock(&si->lock);
 		swap_entry_range_free(si, entry, size);
-		spin_unlock(&si->lock);
 		return;
 	}
 	for (int i = 0; i < size; i++, entry.val++) {
@@ -1506,46 +1592,19 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster(ci);
 }
 
-static int swp_entry_cmp(const void *ent1, const void *ent2)
-{
-	const swp_entry_t *e1 = ent1, *e2 = ent2;
-
-	return (int)swp_type(*e1) - (int)swp_type(*e2);
-}
-
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
-	struct swap_info_struct *si, *prev;
 	int i;
+	struct swap_info_struct *si = NULL;
 
 	if (n <= 0)
 		return;
 
-	prev = NULL;
-	si = NULL;
-
-	/*
-	 * Sort swap entries by swap device, so each lock is only taken once.
-	 * nr_swapfiles isn't absolutely correct, but the overhead of sort() is
-	 * so low that it isn't necessary to optimize further.
-	 */
-	if (nr_swapfiles > 1)
-		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
 	for (i = 0; i < n; ++i) {
 		si = _swap_info_get(entries[i]);
-
-		if (si != prev) {
-			if (prev != NULL)
-				spin_unlock(&prev->lock);
-			if (si != NULL)
-				spin_lock(&si->lock);
-		}
 		if (si)
 			swap_entry_range_free(si, entries[i], 1);
-		prev = si;
 	}
-	if (si)
-		spin_unlock(&si->lock);
 }
 
 int __swap_count(swp_entry_t entry)
@@ -1797,13 +1856,8 @@ swp_entry_t get_swap_page_of_type(int type)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
-	if (get_swap_device_info(si)) {
-		spin_lock(&si->lock);
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
-			atomic_long_dec(&nr_swap_pages);
-		spin_unlock(&si->lock);
-		put_swap_device(si);
-	}
+	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+		atomic_long_dec(&nr_swap_pages);
 fail:
 	return entry;
 }
@@ -3141,6 +3195,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
 			cluster->next[i] = SWAP_NEXT_INVALID;
+		local_lock_init(&cluster->lock);
 	}
 
 	/*
@@ -3164,7 +3219,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		si->frag_cluster_nr[i] = 0;
+		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
 
 	/*
@@ -3646,7 +3701,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 		 */
 		goto outer;
 	}
-	spin_lock(&si->lock);
 
 	offset = swp_offset(entry);
 
@@ -3711,7 +3765,6 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
-	spin_unlock(&si->lock);
 	put_swap_device(si);
 outer:
 	if (page)
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 10/13] mm, swap: simplify percpu cluster updating
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (8 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2025-01-09  2:07   ` Baoquan He
  2024-12-30 17:46 ` [PATCH v3 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
                   ` (2 subsequent siblings)
  12 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Instead of using a returning argument, we can simply store the next
cluster offset to the fixed percpu location, which reduce the stack
usage and simplify the function:

Object size:
./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271)
Function                                     old     new   delta
get_swap_pages                              2847    2733    -114
alloc_swap_scan_cluster                      894     737    -157
Total: Before=30833, After=30562, chg -0.88%

Stack usage:
Before:
swapfile.c:1190:5:get_swap_pages       240    static

After:
swapfile.c:1185:5:get_swap_pages       216    static

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  4 +--
 mm/swapfile.c        | 66 +++++++++++++++++++-------------------------
 2 files changed, 31 insertions(+), 39 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index c4ff31cb6bde..4c1d2e69689f 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -275,9 +275,9 @@ enum swap_cluster_flags {
  * The first page in the swap file is the swap header, which is always marked
  * bad to prevent it from being allocated as an entry. This also prevents the
  * cluster to which it belongs being marked free. Therefore 0 is safe to use as
- * a sentinel to indicate next is not valid in percpu_cluster.
+ * a sentinel to indicate an entry is not valid.
  */
-#define SWAP_NEXT_INVALID	0
+#define SWAP_ENTRY_INVALID	0
 
 #ifdef CONFIG_THP_SWAP
 #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dadd4fead689..60a650ba88fd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -759,23 +759,23 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 	return true;
 }
 
-static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
-					    unsigned int *foundp, unsigned int order,
+/* Try use a new cluster for current CPU and allocate from it. */
+static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
+					    struct swap_cluster_info *ci,
+					    unsigned long offset,
+					    unsigned int order,
 					    unsigned char usage)
 {
-	unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
+	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
 	bool need_reclaim, ret;
-	struct swap_cluster_info *ci;
 
-	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
 	lockdep_assert_held(&ci->lock);
 
-	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) {
-		offset = SWAP_NEXT_INVALID;
+	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
 		goto out;
-	}
 
 	for (end -= nr_pages; offset <= end; offset += nr_pages) {
 		need_reclaim = false;
@@ -789,34 +789,27 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 			 * cluster has no flag set, and change of list
 			 * won't cause fragmentation.
 			 */
-			if (!cluster_is_usable(ci, order)) {
-				offset = SWAP_NEXT_INVALID;
+			if (!cluster_is_usable(ci, order))
 				goto out;
-			}
 			if (cluster_is_free(ci))
 				offset = start;
 			/* Reclaim failed but cluster is usable, try next */
 			if (!ret)
 				continue;
 		}
-		if (!cluster_alloc_range(si, ci, offset, usage, order)) {
-			offset = SWAP_NEXT_INVALID;
-			goto out;
-		}
-		*foundp = offset;
-		if (ci->count == SWAPFILE_CLUSTER) {
-			offset = SWAP_NEXT_INVALID;
-			goto out;
-		}
+		if (!cluster_alloc_range(si, ci, offset, usage, order))
+			break;
+		found = offset;
 		offset += nr_pages;
+		if (ci->count < SWAPFILE_CLUSTER && offset <= end)
+			next = offset;
 		break;
 	}
-	if (offset > end)
-		offset = SWAP_NEXT_INVALID;
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	return offset;
+	__this_cpu_write(si->percpu_cluster->next[order], next);
+	return found;
 }
 
 /* Return true if reclaimed a whole cluster */
@@ -885,8 +878,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		if (cluster_is_usable(ci, order)) {
 			if (cluster_is_free(ci))
 				offset = cluster_offset(si, ci);
-			offset = alloc_swap_scan_cluster(si, offset, &found,
-							 order, usage);
+			found = alloc_swap_scan_cluster(si, ci, offset,
+							order, usage);
 		} else {
 			unlock_cluster(ci);
 		}
@@ -897,8 +890,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 new_cluster:
 	ci = cluster_isolate_lock(si, &si->free_clusters);
 	if (ci) {
-		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-						 &found, order, usage);
+		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+						order, usage);
 		if (found)
 			goto done;
 	}
@@ -911,8 +904,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		unsigned int frags = 0, frags_existing;
 
 		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
 			/*
 			 * With `fragmenting` set to true, it will surely take
 			 * the cluster off nonfull list
@@ -932,8 +925,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			 * per-CPU usage, but they could contain newly released
 			 * reclaimable (eg. lazy-freed swap cache) slots.
 			 */
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
 			if (found)
 				goto done;
 			frags++;
@@ -959,21 +952,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 */
 		while ((ci = cluster_isolate_lock(si, &si->frag_clusters[o]))) {
 			atomic_long_dec(&si->frag_cluster_nr[o]);
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							0, usage);
 			if (found)
 				goto done;
 		}
 
 		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[o]))) {
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
-							 &found, order, usage);
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							0, usage);
 			if (found)
 				goto done;
 		}
 	}
 done:
-	__this_cpu_write(si->percpu_cluster->next[order], offset);
 	local_unlock(&si->percpu_cluster->lock);
 
 	return found;
@@ -3194,7 +3186,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 
 		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cluster->next[i] = SWAP_NEXT_INVALID;
+			cluster->next[i] = SWAP_ENTRY_INVALID;
 		local_lock_init(&cluster->lock);
 	}
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 11/13] mm, swap: introduce a helper for retrieving cluster from offset
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (9 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 10/13] mm, swap: simplify percpu cluster updating Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2024-12-30 17:46 ` [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices Kairui Song
  2024-12-30 17:46 ` [PATCH v3 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
  12 siblings, 0 replies; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

It's a common operation to retrieve the cluster info from offset,
introduce a helper for this.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 60a650ba88fd..a3d1239d944b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -424,6 +424,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
 	return ci - si->cluster_info;
 }
 
+static inline struct swap_cluster_info *offset_to_cluster(struct swap_info_struct *si,
+							  unsigned long offset)
+{
+	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+}
+
 static inline unsigned int cluster_offset(struct swap_info_struct *si,
 					  struct swap_cluster_info *ci)
 {
@@ -435,7 +441,7 @@ static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si
 {
 	struct swap_cluster_info *ci;
 
-	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
+	ci = offset_to_cluster(si, offset);
 	spin_lock(&ci->lock);
 
 	return ci;
@@ -1477,10 +1483,10 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 	unsigned char *map_end = map + nr_pages;
 	struct swap_cluster_info *ci;
 
-	/* It should never free entries across different clusters */
-	VM_BUG_ON((offset / SWAPFILE_CLUSTER) != ((offset + nr_pages - 1) / SWAPFILE_CLUSTER));
-
 	ci = lock_cluster(si, offset);
+
+	/* It should never free entries across different clusters */
+	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
 	VM_BUG_ON(cluster_is_free(ci));
 	VM_BUG_ON(ci->count < nr_pages);
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (10 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  2024-12-30 17:46 ` [PATCH v3 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
  12 siblings, 0 replies; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Non-rotational devices (SSD / ZRAM) can tolerate fragmentation, so the
goal of the SWAP allocator is to avoid contention for clusters. It uses
a per-CPU cluster design, and each CPU will use a different cluster as
much as possible.

However, HDDs are very sensitive to fragmentation, contention is trivial
in comparison. Therefore, we use one global cluster instead. This ensures
that each order will be written to the same cluster as much as possible,
which helps make the I/O more continuous.

This ensures that the performance of the cluster allocator is as good as
that of the old allocator. Tests after this commit compared to those
before this series:

Tested using 'make -j32' with tinyconfig, a 1G memcg limit, and HDD swap:

make -j32 with tinyconfig, using 1G memcg limit and HDD swap:

Before this series:
114.44user 29.11system 39:42.90elapsed 6%CPU (0avgtext+0avgdata 157284maxresident)k
2901232inputs+0outputs (238877major+4227640minor)pagefaults

After this commit:
113.90user 23.81system 38:11.77elapsed 6%CPU (0avgtext+0avgdata 157260maxresident)k
2548728inputs+0outputs (235471major+4238110minor)pagefaults

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  2 ++
 mm/swapfile.c        | 51 ++++++++++++++++++++++++++++++++------------
 2 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4c1d2e69689f..b13b72645db3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -318,6 +318,8 @@ struct swap_info_struct {
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
+	struct percpu_cluster *global_cluster; /* Use one global cluster for rotating device */
+	spinlock_t global_cluster_lock;	/* Serialize usage of global cluster */
 	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
 	struct block_device *bdev;	/* swap device or bdev of swap file */
 	struct file *swap_file;		/* seldom referenced */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a3d1239d944b..e57e5453a25b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -814,7 +814,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 out:
 	relocate_cluster(si, ci);
 	unlock_cluster(ci);
-	__this_cpu_write(si->percpu_cluster->next[order], next);
+	if (si->flags & SWP_SOLIDSTATE)
+		__this_cpu_write(si->percpu_cluster->next[order], next);
+	else
+		si->global_cluster->next[order] = next;
 	return found;
 }
 
@@ -875,9 +878,16 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
 
-	/* Fast path using per CPU cluster */
-	local_lock(&si->percpu_cluster->lock);
-	offset = __this_cpu_read(si->percpu_cluster->next[order]);
+	if (si->flags & SWP_SOLIDSTATE) {
+		/* Fast path using per CPU cluster */
+		local_lock(&si->percpu_cluster->lock);
+		offset = __this_cpu_read(si->percpu_cluster->next[order]);
+	} else {
+		/* Serialize HDD SWAP allocation for each device. */
+		spin_lock(&si->global_cluster_lock);
+		offset = si->global_cluster->next[order];
+	}
+
 	if (offset) {
 		ci = lock_cluster(si, offset);
 		/* Cluster could have been used by another order */
@@ -972,8 +982,10 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		}
 	}
 done:
-	local_unlock(&si->percpu_cluster->lock);
-
+	if (si->flags & SWP_SOLIDSTATE)
+		local_unlock(&si->percpu_cluster->lock);
+	else
+		spin_unlock(&si->global_cluster_lock);
 	return found;
 }
 
@@ -2778,6 +2790,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	mutex_unlock(&swapon_mutex);
 	free_percpu(p->percpu_cluster);
 	p->percpu_cluster = NULL;
+	kfree(p->global_cluster);
+	p->global_cluster = NULL;
 	vfree(swap_map);
 	kvfree(zeromap);
 	kvfree(cluster_info);
@@ -3183,17 +3197,24 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < nr_clusters; i++)
 		spin_lock_init(&cluster_info[i].lock);
 
-	si->percpu_cluster = alloc_percpu(struct percpu_cluster);
-	if (!si->percpu_cluster)
-		goto err_free;
+	if (si->flags & SWP_SOLIDSTATE) {
+		si->percpu_cluster = alloc_percpu(struct percpu_cluster);
+		if (!si->percpu_cluster)
+			goto err_free;
 
-	for_each_possible_cpu(cpu) {
-		struct percpu_cluster *cluster;
+		for_each_possible_cpu(cpu) {
+			struct percpu_cluster *cluster;
 
-		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
+			for (i = 0; i < SWAP_NR_ORDERS; i++)
+				cluster->next[i] = SWAP_ENTRY_INVALID;
+			local_lock_init(&cluster->lock);
+		}
+	} else {
+		si->global_cluster = kmalloc(sizeof(*si->global_cluster), GFP_KERNEL);
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			cluster->next[i] = SWAP_ENTRY_INVALID;
-		local_lock_init(&cluster->lock);
+			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
+		spin_lock_init(&si->global_cluster_lock);
 	}
 
 	/*
@@ -3467,6 +3488,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 bad_swap:
 	free_percpu(si->percpu_cluster);
 	si->percpu_cluster = NULL;
+	kfree(si->global_cluster);
+	si->global_cluster = NULL;
 	inode = NULL;
 	destroy_swap_extents(si);
 	swap_cgroup_swapoff(si->type);
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 13/13] mm, swap_slots: remove slot cache for freeing path
  2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
                   ` (11 preceding siblings ...)
  2024-12-30 17:46 ` [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices Kairui Song
@ 2024-12-30 17:46 ` Kairui Song
  12 siblings, 0 replies; 35+ messages in thread
From: Kairui Song @ 2024-12-30 17:46 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Barry Song, Ryan Roberts, Hugh Dickins,
	Yosry Ahmed, Huang, Ying, Nhat Pham, Johannes Weiner,
	Kalesh Singh, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

The slot cache for freeing path is mostly for reducing the overhead
of si->lock. As we have basically eliminated the si->lock usage
for freeing path, it can be removed.

This helps simplify the code, and avoids swap entries from being hold
in cache upon freeing. The delayed freeing of entries have been
causing trouble for further optimizations for zswap [1] and in theory
will also cause more fragmentation, and extra overhead.

Test with build linux kernel showed both performance and fragmentation
is better without the cache:

tiem make -j96 / 768M memcg, 4K pages, 10G ZRAM, avg of 4 test run::
Before:
Sys time: 36047.78, Real time: 472.43
After: (-7.6% sys time, -7.3% real time)
Sys time: 33314.76, Real time: 437.67

time make -j96 / 1152M memcg, 64K mTHP, 10G ZRAM, avg of 4 test run:
Before:
Sys time: 46859.04, Real time: 562.63
hugepages-64kB/stats/swpout: 1783392
hugepages-64kB/stats/swpout_fallback: 240875
After: (-23.3% sys time, -21.3% real time)
Sys time: 35958.87, Real time: 442.69
hugepages-64kB/stats/swpout: 1866267
hugepages-64kB/stats/swpout_fallback: 158330

Sequential SWAP should be also slightly faster, tests didn't show a
measurable difference though, at least no regression:

Swapin 4G zero page on ZRAM (time in us):
Before (avg. 1923756)
1912391 1927023 1927957 1916527 1918263 1914284 1934753 1940813 1921791
After (avg. 1922290):
1919101 1925743 1916810 1917007 1923930 1935152 1917403 1923549 1921913

Link: https://lore.kernel.org/all/CAMgjq7ACohT_uerSz8E_994ZZCv709Zor+43hdmesW_59W1BWw@mail.gmail.com/[1]
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap_slots.h |  3 --
 mm/swap_slots.c            | 78 +++++----------------------------
 mm/swapfile.c              | 89 +++++++++++++++-----------------------
 3 files changed, 44 insertions(+), 126 deletions(-)

diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..840aec3523b2 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -16,15 +16,12 @@ struct swap_slots_cache {
 	swp_entry_t	*slots;
 	int		nr;
 	int		cur;
-	spinlock_t	free_lock;  /* protects slots_ret, n_ret */
-	swp_entry_t	*slots_ret;
 	int		n_ret;
 };
 
 void disable_swap_slots_cache_lock(void);
 void reenable_swap_slots_cache_unlock(void);
 void enable_swap_slots_cache(void);
-void free_swap_slot(swp_entry_t entry);
 
 extern bool swap_slot_cache_enabled;
 
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 13ab3b771409..9c7c171df7ba 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -43,17 +43,15 @@ static DEFINE_MUTEX(swap_slots_cache_mutex);
 /* Serialize swap slots cache enable/disable operations */
 static DEFINE_MUTEX(swap_slots_cache_enable_mutex);
 
-static void __drain_swap_slots_cache(unsigned int type);
+static void __drain_swap_slots_cache(void);
 
 #define use_swap_slot_cache (swap_slot_cache_active && swap_slot_cache_enabled)
-#define SLOTS_CACHE 0x1
-#define SLOTS_CACHE_RET 0x2
 
 static void deactivate_swap_slots_cache(void)
 {
 	mutex_lock(&swap_slots_cache_mutex);
 	swap_slot_cache_active = false;
-	__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+	__drain_swap_slots_cache();
 	mutex_unlock(&swap_slots_cache_mutex);
 }
 
@@ -72,7 +70,7 @@ void disable_swap_slots_cache_lock(void)
 	if (swap_slot_cache_initialized) {
 		/* serialize with cpu hotplug operations */
 		cpus_read_lock();
-		__drain_swap_slots_cache(SLOTS_CACHE|SLOTS_CACHE_RET);
+		__drain_swap_slots_cache();
 		cpus_read_unlock();
 	}
 }
@@ -113,7 +111,7 @@ static bool check_cache_active(void)
 static int alloc_swap_slot_cache(unsigned int cpu)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots, *slots_ret;
+	swp_entry_t *slots;
 
 	/*
 	 * Do allocation outside swap_slots_cache_mutex
@@ -125,28 +123,19 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	if (!slots)
 		return -ENOMEM;
 
-	slots_ret = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
-			     GFP_KERNEL);
-	if (!slots_ret) {
-		kvfree(slots);
-		return -ENOMEM;
-	}
-
 	mutex_lock(&swap_slots_cache_mutex);
 	cache = &per_cpu(swp_slots, cpu);
-	if (cache->slots || cache->slots_ret) {
+	if (cache->slots) {
 		/* cache already allocated */
 		mutex_unlock(&swap_slots_cache_mutex);
 
 		kvfree(slots);
-		kvfree(slots_ret);
 
 		return 0;
 	}
 
 	if (!cache->lock_initialized) {
 		mutex_init(&cache->alloc_lock);
-		spin_lock_init(&cache->free_lock);
 		cache->lock_initialized = true;
 	}
 	cache->nr = 0;
@@ -160,19 +149,16 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	 */
 	mb();
 	cache->slots = slots;
-	cache->slots_ret = slots_ret;
 	mutex_unlock(&swap_slots_cache_mutex);
 	return 0;
 }
 
-static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
-				  bool free_slots)
+static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots = NULL;
 
 	cache = &per_cpu(swp_slots, cpu);
-	if ((type & SLOTS_CACHE) && cache->slots) {
+	if (cache->slots) {
 		mutex_lock(&cache->alloc_lock);
 		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
 		cache->cur = 0;
@@ -183,20 +169,9 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 		}
 		mutex_unlock(&cache->alloc_lock);
 	}
-	if ((type & SLOTS_CACHE_RET) && cache->slots_ret) {
-		spin_lock_irq(&cache->free_lock);
-		swapcache_free_entries(cache->slots_ret, cache->n_ret);
-		cache->n_ret = 0;
-		if (free_slots && cache->slots_ret) {
-			slots = cache->slots_ret;
-			cache->slots_ret = NULL;
-		}
-		spin_unlock_irq(&cache->free_lock);
-		kvfree(slots);
-	}
 }
 
-static void __drain_swap_slots_cache(unsigned int type)
+static void __drain_swap_slots_cache(void)
 {
 	unsigned int cpu;
 
@@ -224,13 +199,13 @@ static void __drain_swap_slots_cache(unsigned int type)
 	 * There are no slots on such cpu that need to be drained.
 	 */
 	for_each_online_cpu(cpu)
-		drain_slots_cache_cpu(cpu, type, false);
+		drain_slots_cache_cpu(cpu, false);
 }
 
 static int free_slot_cache(unsigned int cpu)
 {
 	mutex_lock(&swap_slots_cache_mutex);
-	drain_slots_cache_cpu(cpu, SLOTS_CACHE | SLOTS_CACHE_RET, true);
+	drain_slots_cache_cpu(cpu, true);
 	mutex_unlock(&swap_slots_cache_mutex);
 	return 0;
 }
@@ -269,39 +244,6 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	return cache->nr;
 }
 
-void free_swap_slot(swp_entry_t entry)
-{
-	struct swap_slots_cache *cache;
-
-	/* Large folio swap slot is not covered. */
-	zswap_invalidate(entry);
-
-	cache = raw_cpu_ptr(&swp_slots);
-	if (likely(use_swap_slot_cache && cache->slots_ret)) {
-		spin_lock_irq(&cache->free_lock);
-		/* Swap slots cache may be deactivated before acquiring lock */
-		if (!use_swap_slot_cache || !cache->slots_ret) {
-			spin_unlock_irq(&cache->free_lock);
-			goto direct_free;
-		}
-		if (cache->n_ret >= SWAP_SLOTS_CACHE_SIZE) {
-			/*
-			 * Return slots to global pool.
-			 * The current swap_map value is SWAP_HAS_CACHE.
-			 * Set it to 0 to indicate it is available for
-			 * allocation in global pool
-			 */
-			swapcache_free_entries(cache->slots_ret, cache->n_ret);
-			cache->n_ret = 0;
-		}
-		cache->slots_ret[cache->n_ret++] = entry;
-		spin_unlock_irq(&cache->free_lock);
-	} else {
-direct_free:
-		swapcache_free_entries(&entry, 1);
-	}
-}
-
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_entry_t entry;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e57e5453a25b..d623f5b6dc4c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,14 +53,15 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
-				  unsigned int nr_pages);
+static void swap_entry_range_free(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  swp_entry_t entry, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
 static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 					      unsigned long offset);
-static void unlock_cluster(struct swap_cluster_info *ci);
+static inline void unlock_cluster(struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -261,10 +262,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_ref_sub(folio, nr_pages);
 	folio_set_dirty(folio);
 
-	/* Only sinple page folio can be backed by zswap */
-	if (nr_pages == 1)
-		zswap_invalidate(entry);
-	swap_entry_range_free(si, entry, nr_pages);
+	ci = lock_cluster(si, offset);
+	swap_entry_range_free(si, ci, entry, nr_pages);
+	unlock_cluster(ci);
 	ret = nr_pages;
 out_unlock:
 	folio_unlock(folio);
@@ -1125,8 +1125,10 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
 	 */
-	for (i = 0; i < nr_entries; i++)
+	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
+		zswap_invalidate(swp_entry(si->type, offset + i));
+	}
 
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1431,9 +1433,9 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si,
 
 	ci = lock_cluster(si, offset);
 	usage = __swap_entry_free_locked(si, offset, 1);
-	unlock_cluster(ci);
 	if (!usage)
-		free_swap_slot(entry);
+		swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+	unlock_cluster(ci);
 
 	return usage;
 }
@@ -1461,13 +1463,10 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	}
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
+	if (!has_cache)
+		swap_entry_range_free(si, ci, entry, nr);
 	unlock_cluster(ci);
 
-	if (!has_cache) {
-		for (i = 0; i < nr; i++)
-			zswap_invalidate(swp_entry(si->type, offset + i));
-		swap_entry_range_free(si, entry, nr);
-	}
 	return has_cache;
 
 fallback:
@@ -1487,15 +1486,13 @@ static bool __swap_entries_free(struct swap_info_struct *si,
  * Drop the last HAS_CACHE flag of swap entries, caller have to
  * ensure all entries belong to the same cgroup.
  */
-static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
-				  unsigned int nr_pages)
+static void swap_entry_range_free(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  swp_entry_t entry, unsigned int nr_pages)
 {
 	unsigned long offset = swp_offset(entry);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
-	struct swap_cluster_info *ci;
-
-	ci = lock_cluster(si, offset);
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1515,7 +1512,6 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
 		free_cluster(si, ci);
 	else
 		partial_free_cluster(si, ci);
-	unlock_cluster(ci);
 }
 
 static void cluster_swap_free_nr(struct swap_info_struct *si,
@@ -1523,28 +1519,13 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 		unsigned char usage)
 {
 	struct swap_cluster_info *ci;
-	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
-	int i, nr;
+	unsigned long end = offset + nr_pages;
 
 	ci = lock_cluster(si, offset);
-	while (nr_pages) {
-		nr = min(BITS_PER_LONG, nr_pages);
-		for (i = 0; i < nr; i++) {
-			if (!__swap_entry_free_locked(si, offset + i, usage))
-				bitmap_set(to_free, i, 1);
-		}
-		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
-			unlock_cluster(ci);
-			for_each_set_bit(i, to_free, BITS_PER_LONG)
-				free_swap_slot(swp_entry(si->type, offset + i));
-			if (nr == nr_pages)
-				return;
-			bitmap_clear(to_free, 0, BITS_PER_LONG);
-			ci = lock_cluster(si, offset);
-		}
-		offset += nr;
-		nr_pages -= nr;
-	}
+	do {
+		if (!__swap_entry_free_locked(si, offset, usage))
+			swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+	} while (++offset < end);
 	unlock_cluster(ci);
 }
 
@@ -1585,18 +1566,12 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 		return;
 
 	ci = lock_cluster(si, offset);
-	if (size > 1 && swap_is_has_cache(si, offset, size)) {
-		unlock_cluster(ci);
-		swap_entry_range_free(si, entry, size);
-		return;
-	}
-	for (int i = 0; i < size; i++, entry.val++) {
-		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
-			unlock_cluster(ci);
-			free_swap_slot(entry);
-			if (i == size - 1)
-				return;
-			lock_cluster(si, offset);
+	if (swap_is_has_cache(si, offset, size))
+		swap_entry_range_free(si, ci, entry, size);
+	else {
+		for (int i = 0; i < size; i++, entry.val++) {
+			if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE))
+				swap_entry_range_free(si, ci, entry, 1);
 		}
 	}
 	unlock_cluster(ci);
@@ -1605,6 +1580,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 void swapcache_free_entries(swp_entry_t *entries, int n)
 {
 	int i;
+	struct swap_cluster_info *ci;
 	struct swap_info_struct *si = NULL;
 
 	if (n <= 0)
@@ -1612,8 +1588,11 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 
 	for (i = 0; i < n; ++i) {
 		si = _swap_info_get(entries[i]);
-		if (si)
-			swap_entry_range_free(si, entries[i], 1);
+		if (si) {
+			ci = lock_cluster(si, swp_offset(entries[i]));
+			swap_entry_range_free(si, ci, entries[i], 1);
+			unlock_cluster(ci);
+		}
 	}
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 06/13] mm, swap: clean up plist removal and adding
  2024-12-30 17:46 ` [PATCH v3 06/13] mm, swap: clean up plist removal and adding Kairui Song
@ 2025-01-02  8:59   ` Baoquan He
  2025-01-03  8:07     ` Kairui Song
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-01-02  8:59 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

Hi Kairui,

On 12/31/24 at 01:46am, Kairui Song wrote:
......snip...
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7963a0c646a4..e6e58cfb5178 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -128,6 +128,26 @@ static inline unsigned char swap_count(unsigned char ent)
>  	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
>  }

I am reading swap code, while at it, I am going through this patchset
too. Have some nitpick, please see below inline comments.

>  
> +/*
> + * Use the second highest bit of inuse_pages counter as the indicator
> + * of if one swap device is on the available plist, so the atomic can
      ~~ redundant?
> + * still be updated arithmetic while having special data embedded.
                       ~~~~~~~~~~ typo, arithmetically?
> + *
> + * inuse_pages counter is the only thing indicating if a device should
> + * be on avail_lists or not (except swapon / swapoff). By embedding the
> + * on-list bit in the atomic counter, updates no longer need any lock
      ~~~ off-list?
> + * to check the list status.
> + *
> + * This bit will be set if the device is not on the plist and not
> + * usable, will be cleared if the device is on the plist.
> + */
> +#define SWAP_USAGE_OFFLIST_BIT (1UL << (BITS_PER_TYPE(atomic_t) - 2))
> +#define SWAP_USAGE_COUNTER_MASK (~SWAP_USAGE_OFFLIST_BIT)
> +static long swap_usage_in_pages(struct swap_info_struct *si)
> +{
> +	return atomic_long_read(&si->inuse_pages) & SWAP_USAGE_COUNTER_MASK;
> +}
> +
>  /* Reclaim the swap entry anyway if possible */
>  #define TTRS_ANYWAY		0x1
>  /*
> @@ -717,7 +737,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
>  	int nr_reclaim;
>  
>  	if (force)
> -		to_scan = si->inuse_pages / SWAPFILE_CLUSTER;
> +		to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
>  
>  	while (!list_empty(&si->full_clusters)) {
>  		ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
> @@ -872,42 +892,128 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  	return found;
>  }
>  
> -static void __del_from_avail_list(struct swap_info_struct *si)
> +/* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
       Seems it just says the opposite. The off-list bit is set in
       this function. 
> +static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
>  {
>  	int nid;
> +	unsigned long pages;
> +
> +	spin_lock(&swap_avail_lock);
> +
> +	if (swapoff) {
> +		/*
> +		 * Forcefully remove it. Clear the SWP_WRITEOK flags for
> +		 * swapoff here so it's synchronized by both si->lock and
> +		 * swap_avail_lock, to ensure the result can be seen by
> +		 * add_to_avail_list.
> +		 */
> +		lockdep_assert_held(&si->lock);
> +		si->flags &= ~SWP_WRITEOK;
> +		atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
> +	} else {
> +		/*
> +		 * If not called by swapoff, take it off-list only if it's
> +		 * full and SWAP_USAGE_OFFLIST_BIT is not set (strictly
> +		 * si->inuse_pages == pages), any concurrent slot freeing,
> +		 * or device already removed from plist by someone else
> +		 * will make this return false.
> +		 */
> +		pages = si->pages;
> +		if (!atomic_long_try_cmpxchg(&si->inuse_pages, &pages,
> +					     pages | SWAP_USAGE_OFFLIST_BIT))
> +			goto skip;
> +	}
>  
> -	assert_spin_locked(&si->lock);
>  	for_each_node(nid)
>  		plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
> +
> +skip:
> +	spin_unlock(&swap_avail_lock);
>  }
>  
> -static void del_from_avail_list(struct swap_info_struct *si)
> +/* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */

  Ditto.

> +static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
>  {
> +	int nid;
> +	long val;
> +	unsigned long pages;
> +
>  	spin_lock(&swap_avail_lock);
> -	__del_from_avail_list(si);
> +
> +	/* Corresponding to SWP_WRITEOK clearing in del_from_avail_list */
> +	if (swapon) {
> +		lockdep_assert_held(&si->lock);
> +		si->flags |= SWP_WRITEOK;
> +	} else {
> +		if (!(READ_ONCE(si->flags) & SWP_WRITEOK))
> +			goto skip;
> +	}
> +
> +	if (!(atomic_long_read(&si->inuse_pages) & SWAP_USAGE_OFFLIST_BIT))
> +		goto skip;
> +
> +	val = atomic_long_fetch_and_relaxed(~SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
> +
> +	/*
> +	 * When device is full and device is on the plist, only one updater will
> +	 * see (inuse_pages == si->pages) and will call del_from_avail_list. If
> +	 * that updater happen to be here, just skip adding.
> +	 */
> +	pages = si->pages;
> +	if (val == pages) {
> +		/* Just like the cmpxchg in del_from_avail_list */
> +		if (atomic_long_try_cmpxchg(&si->inuse_pages, &pages,
> +					    pages | SWAP_USAGE_OFFLIST_BIT))
> +			goto skip;
> +	}
> +
> +	for_each_node(nid)
> +		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> +
> +skip:
>  	spin_unlock(&swap_avail_lock);
>  }
>  
> -static void swap_range_alloc(struct swap_info_struct *si,
> -			     unsigned int nr_entries)
> +/*
> + * swap_usage_add / swap_usage_sub of each slot are serialized by ci->lock

Not sure if swap_inuse_add()/swap_inuse_sub() or swap_inuse_cnt_add/sub()
is better, because it mixes with the usage of si->swap_map[offset].
Anyway, not strong opinion.

> + * within each cluster, so the total contribution to the global counter should
> + * always be positive and cannot exceed the total number of usable slots.
> + */
> +static bool swap_usage_add(struct swap_info_struct *si, unsigned int nr_entries)
>  {
> -	WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
> -	if (si->inuse_pages == si->pages) {
> -		del_from_avail_list(si);
> +	long val = atomic_long_add_return_relaxed(nr_entries, &si->inuse_pages);
>  
> -		if (si->cluster_info && vm_swap_full())
> -			schedule_work(&si->reclaim_work);
> +	/*
> +	 * If device is full, and SWAP_USAGE_OFFLIST_BIT is not set,
> +	 * remove it from the plist.
> +	 */
> +	if (unlikely(val == si->pages)) {
> +		del_from_avail_list(si, false);
> +		return true;
>  	}
> +
> +	return false;
>  }
>  
> -static void add_to_avail_list(struct swap_info_struct *si)
> +static void swap_usage_sub(struct swap_info_struct *si, unsigned int nr_entries)
>  {
> -	int nid;
> +	long val = atomic_long_sub_return_relaxed(nr_entries, &si->inuse_pages);
>  
> -	spin_lock(&swap_avail_lock);
> -	for_each_node(nid)
> -		plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> -	spin_unlock(&swap_avail_lock);
> +	/*
> +	 * If device is not full, and SWAP_USAGE_OFFLIST_BIT is set,
> +	 * remove it from the plist.
> +	 */
> +	if (unlikely(val & SWAP_USAGE_OFFLIST_BIT))
> +		add_to_avail_list(si, false);
> +}
> +
> +static void swap_range_alloc(struct swap_info_struct *si,
> +			     unsigned int nr_entries)
> +{
> +	if (swap_usage_add(si, nr_entries)) {
> +		if (si->cluster_info && vm_swap_full())

We may not need check si->cluster_info here since it always exists now.
                    
> +			schedule_work(&si->reclaim_work);
> +	}
>  }
>  
>  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> @@ -925,8 +1031,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  	for (i = 0; i < nr_entries; i++)
>  		clear_bit(offset + i, si->zeromap);
>  
> -	if (si->inuse_pages == si->pages)
> -		add_to_avail_list(si);
>  	if (si->flags & SWP_BLKDEV)
>  		swap_slot_free_notify =
>  			si->bdev->bd_disk->fops->swap_slot_free_notify;
> @@ -946,7 +1050,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  	 */
>  	smp_wmb();
>  	atomic_long_add(nr_entries, &nr_swap_pages);
> -	WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
> +	swap_usage_sub(si, nr_entries);
>  }
>  
>  static int cluster_alloc_swap(struct swap_info_struct *si,
> @@ -1036,19 +1140,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>  		spin_unlock(&swap_avail_lock);
>  		spin_lock(&si->lock);
> -		if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
> -			spin_lock(&swap_avail_lock);
> -			if (plist_node_empty(&si->avail_lists[node])) {
> -				spin_unlock(&si->lock);
> -				goto nextsi;
> -			}
> -			WARN(!(si->flags & SWP_WRITEOK),
> -			     "swap_info %d in list but !SWP_WRITEOK\n",
> -			     si->type);
> -			__del_from_avail_list(si);
> -			spin_unlock(&si->lock);
> -			goto nextsi;
> -		}
>  		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
>  					    n_goal, swp_entries, order);
>  		spin_unlock(&si->lock);
> @@ -1057,7 +1148,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  		cond_resched();
>  
>  		spin_lock(&swap_avail_lock);
> -nextsi:
>  		/*
>  		 * if we got here, it's likely that si was almost full before,
>  		 * and since scan_swap_map_slots() can drop the si->lock,
> @@ -1789,7 +1879,7 @@ unsigned int count_swap_pages(int type, int free)
>  		if (sis->flags & SWP_WRITEOK) {
>  			n = sis->pages;
>  			if (free)
> -				n -= sis->inuse_pages;
> +				n -= swap_usage_in_pages(sis);
>  		}
>  		spin_unlock(&sis->lock);
>  	}
> @@ -2124,7 +2214,7 @@ static int try_to_unuse(unsigned int type)
>  	swp_entry_t entry;
>  	unsigned int i;
>  
> -	if (!READ_ONCE(si->inuse_pages))
> +	if (!swap_usage_in_pages(si))
>  		goto success;
>  
>  retry:
> @@ -2137,7 +2227,7 @@ static int try_to_unuse(unsigned int type)
>  
>  	spin_lock(&mmlist_lock);
>  	p = &init_mm.mmlist;
> -	while (READ_ONCE(si->inuse_pages) &&
> +	while (swap_usage_in_pages(si) &&
>  	       !signal_pending(current) &&
>  	       (p = p->next) != &init_mm.mmlist) {
>  
> @@ -2165,7 +2255,7 @@ static int try_to_unuse(unsigned int type)
>  	mmput(prev_mm);
>  
>  	i = 0;
> -	while (READ_ONCE(si->inuse_pages) &&
> +	while (swap_usage_in_pages(si) &&
>  	       !signal_pending(current) &&
>  	       (i = find_next_to_unuse(si, i)) != 0) {
>  
> @@ -2200,7 +2290,7 @@ static int try_to_unuse(unsigned int type)
>  	 * folio_alloc_swap(), temporarily hiding that swap.  It's easy
>  	 * and robust (though cpu-intensive) just to keep retrying.
>  	 */
> -	if (READ_ONCE(si->inuse_pages)) {
> +	if (swap_usage_in_pages(si)) {
>  		if (!signal_pending(current))
>  			goto retry;
>  		return -EINTR;
> @@ -2209,7 +2299,7 @@ static int try_to_unuse(unsigned int type)
>  success:
>  	/*
>  	 * Make sure that further cleanups after try_to_unuse() returns happen
> -	 * after swap_range_free() reduces si->inuse_pages to 0.
> +	 * after swap_range_free() reduces inuse_pages to 0.

Here, I personally think the original si->inuse_pages may be better.

>  	 */
>  	smp_mb();
>  	return 0;
> @@ -2227,7 +2317,7 @@ static void drain_mmlist(void)
>  	unsigned int type;
>  
>  	for (type = 0; type < nr_swapfiles; type++)
> -		if (swap_info[type]->inuse_pages)
> +		if (swap_usage_in_pages(swap_info[type]))
>  			return;
>  	spin_lock(&mmlist_lock);
>  	list_for_each_safe(p, next, &init_mm.mmlist)
> @@ -2406,7 +2496,6 @@ static void setup_swap_info(struct swap_info_struct *si, int prio,
>  
>  static void _enable_swap_info(struct swap_info_struct *si)
>  {
> -	si->flags |= SWP_WRITEOK;
>  	atomic_long_add(si->pages, &nr_swap_pages);
>  	total_swap_pages += si->pages;
>  
> @@ -2423,9 +2512,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
>  	 */
>  	plist_add(&si->list, &swap_active_head);
>  
> -	/* add to available list if swap device is not full */
> -	if (si->inuse_pages < si->pages)
> -		add_to_avail_list(si);
> +	/* Add back to available list */
> +	add_to_avail_list(si, true);
>  }
>  
>  static void enable_swap_info(struct swap_info_struct *si, int prio,
> @@ -2523,7 +2611,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  		goto out_dput;
>  	}
>  	spin_lock(&p->lock);
> -	del_from_avail_list(p);
> +	del_from_avail_list(p, true);
>  	if (p->prio < 0) {
>  		struct swap_info_struct *si = p;
>  		int nid;
> @@ -2541,7 +2629,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	plist_del(&p->list, &swap_active_head);
>  	atomic_long_sub(p->pages, &nr_swap_pages);
>  	total_swap_pages -= p->pages;
> -	p->flags &= ~SWP_WRITEOK;
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
>  
> @@ -2721,7 +2808,7 @@ static int swap_show(struct seq_file *swap, void *v)
>  	}
>  
>  	bytes = K(si->pages);
> -	inuse = K(READ_ONCE(si->inuse_pages));
> +	inuse = K(swap_usage_in_pages(si));
>  
>  	file = si->swap_file;
>  	len = seq_file_path(swap, file, " \t\n\\");
> @@ -2838,6 +2925,7 @@ static struct swap_info_struct *alloc_swap_info(void)
>  	}
>  	spin_lock_init(&p->lock);
>  	spin_lock_init(&p->cont_lock);
> +	atomic_long_set(&p->inuse_pages, SWAP_USAGE_OFFLIST_BIT);
>  	init_completion(&p->comp);
>  
>  	return p;
> @@ -3335,7 +3423,7 @@ void si_swapinfo(struct sysinfo *val)
>  		struct swap_info_struct *si = swap_info[type];
>  
>  		if ((si->flags & SWP_USED) && !(si->flags & SWP_WRITEOK))
> -			nr_to_be_unused += READ_ONCE(si->inuse_pages);
> +			nr_to_be_unused += swap_usage_in_pages(si);
>  	}
>  	val->freeswap = atomic_long_read(&nr_swap_pages) + nr_to_be_unused;
>  	val->totalswap = total_swap_pages + nr_to_be_unused;
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 06/13] mm, swap: clean up plist removal and adding
  2025-01-02  8:59   ` Baoquan He
@ 2025-01-03  8:07     ` Kairui Song
  0 siblings, 0 replies; 35+ messages in thread
From: Kairui Song @ 2025-01-03  8:07 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Thu, Jan 2, 2025 at 4:59 PM Baoquan He <bhe@redhat.com> wrote:
>
> Hi Kairui,
>
> On 12/31/24 at 01:46am, Kairui Song wrote:
> ......snip...
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 7963a0c646a4..e6e58cfb5178 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -128,6 +128,26 @@ static inline unsigned char swap_count(unsigned char ent)
> >       return ent & ~SWAP_HAS_CACHE;   /* may include COUNT_CONTINUED flag */
> >  }
>
> I am reading swap code, while at it, I am going through this patchset
> too. Have some nitpick, please see below inline comments.

Thanks!

> >
> > +/*
> > + * Use the second highest bit of inuse_pages counter as the indicator
> > + * of if one swap device is on the available plist, so the atomic can
>       ~~ redundant?
> > + * still be updated arithmetic while having special data embedded.
>                        ~~~~~~~~~~ typo, arithmetically?
> > + *
> > + * inuse_pages counter is the only thing indicating if a device should
> > + * be on avail_lists or not (except swapon / swapoff). By embedding the
> > + * on-list bit in the atomic counter, updates no longer need any lock
>       ~~~ off-list?

Ah, right, some typos, will fix these.

> > + * to check the list status.
> > + *
> > + * This bit will be set if the device is not on the plist and not
> > + * usable, will be cleared if the device is on the plist.
> > + */
> > +#define SWAP_USAGE_OFFLIST_BIT (1UL << (BITS_PER_TYPE(atomic_t) - 2))
> > +#define SWAP_USAGE_COUNTER_MASK (~SWAP_USAGE_OFFLIST_BIT)
> > +static long swap_usage_in_pages(struct swap_info_struct *si)
> > +{
> > +     return atomic_long_read(&si->inuse_pages) & SWAP_USAGE_COUNTER_MASK;
> > +}
> > +
> >  /* Reclaim the swap entry anyway if possible */
> >  #define TTRS_ANYWAY          0x1
> >  /*
> > @@ -717,7 +737,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
> >       int nr_reclaim;
> >
> >       if (force)
> > -             to_scan = si->inuse_pages / SWAPFILE_CLUSTER;
> > +             to_scan = swap_usage_in_pages(si) / SWAPFILE_CLUSTER;
> >
> >       while (!list_empty(&si->full_clusters)) {
> >               ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
> > @@ -872,42 +892,128 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >       return found;
> >  }
> >
> > -static void __del_from_avail_list(struct swap_info_struct *si)
> > +/* SWAP_USAGE_OFFLIST_BIT can only be cleared by this helper. */
>        Seems it just says the opposite. The off-list bit is set in
>        this function.

Right, the comments are opposite... will fix them.

> > +static void del_from_avail_list(struct swap_info_struct *si, bool swapoff)
> >  {
> >       int nid;
> > +     unsigned long pages;
> > +
> > +     spin_lock(&swap_avail_lock);
> > +
> > +     if (swapoff) {
> > +             /*
> > +              * Forcefully remove it. Clear the SWP_WRITEOK flags for
> > +              * swapoff here so it's synchronized by both si->lock and
> > +              * swap_avail_lock, to ensure the result can be seen by
> > +              * add_to_avail_list.
> > +              */
> > +             lockdep_assert_held(&si->lock);
> > +             si->flags &= ~SWP_WRITEOK;
> > +             atomic_long_or(SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
> > +     } else {
> > +             /*
> > +              * If not called by swapoff, take it off-list only if it's
> > +              * full and SWAP_USAGE_OFFLIST_BIT is not set (strictly
> > +              * si->inuse_pages == pages), any concurrent slot freeing,
> > +              * or device already removed from plist by someone else
> > +              * will make this return false.
> > +              */
> > +             pages = si->pages;
> > +             if (!atomic_long_try_cmpxchg(&si->inuse_pages, &pages,
> > +                                          pages | SWAP_USAGE_OFFLIST_BIT))
> > +                     goto skip;
> > +     }
> >
> > -     assert_spin_locked(&si->lock);
> >       for_each_node(nid)
> >               plist_del(&si->avail_lists[nid], &swap_avail_heads[nid]);
> > +
> > +skip:
> > +     spin_unlock(&swap_avail_lock);
> >  }
> >
> > -static void del_from_avail_list(struct swap_info_struct *si)
> > +/* SWAP_USAGE_OFFLIST_BIT can only be set by this helper. */
>
>   Ditto.
>
> > +static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
> >  {
> > +     int nid;
> > +     long val;
> > +     unsigned long pages;
> > +
> >       spin_lock(&swap_avail_lock);
> > -     __del_from_avail_list(si);
> > +
> > +     /* Corresponding to SWP_WRITEOK clearing in del_from_avail_list */
> > +     if (swapon) {
> > +             lockdep_assert_held(&si->lock);
> > +             si->flags |= SWP_WRITEOK;
> > +     } else {
> > +             if (!(READ_ONCE(si->flags) & SWP_WRITEOK))
> > +                     goto skip;
> > +     }
> > +
> > +     if (!(atomic_long_read(&si->inuse_pages) & SWAP_USAGE_OFFLIST_BIT))
> > +             goto skip;
> > +
> > +     val = atomic_long_fetch_and_relaxed(~SWAP_USAGE_OFFLIST_BIT, &si->inuse_pages);
> > +
> > +     /*
> > +      * When device is full and device is on the plist, only one updater will
> > +      * see (inuse_pages == si->pages) and will call del_from_avail_list. If
> > +      * that updater happen to be here, just skip adding.
> > +      */
> > +     pages = si->pages;
> > +     if (val == pages) {
> > +             /* Just like the cmpxchg in del_from_avail_list */
> > +             if (atomic_long_try_cmpxchg(&si->inuse_pages, &pages,
> > +                                         pages | SWAP_USAGE_OFFLIST_BIT))
> > +                     goto skip;
> > +     }
> > +
> > +     for_each_node(nid)
> > +             plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> > +
> > +skip:
> >       spin_unlock(&swap_avail_lock);
> >  }
> >
> > -static void swap_range_alloc(struct swap_info_struct *si,
> > -                          unsigned int nr_entries)
> > +/*
> > + * swap_usage_add / swap_usage_sub of each slot are serialized by ci->lock
>
> Not sure if swap_inuse_add()/swap_inuse_sub() or swap_inuse_cnt_add/sub()
> is better, because it mixes with the usage of si->swap_map[offset].
> Anyway, not strong opinion.
>
> > + * within each cluster, so the total contribution to the global counter should
> > + * always be positive and cannot exceed the total number of usable slots.
> > + */
> > +static bool swap_usage_add(struct swap_info_struct *si, unsigned int nr_entries)
> >  {
> > -     WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
> > -     if (si->inuse_pages == si->pages) {
> > -             del_from_avail_list(si);
> > +     long val = atomic_long_add_return_relaxed(nr_entries, &si->inuse_pages);
> >
> > -             if (si->cluster_info && vm_swap_full())
> > -                     schedule_work(&si->reclaim_work);
> > +     /*
> > +      * If device is full, and SWAP_USAGE_OFFLIST_BIT is not set,
> > +      * remove it from the plist.
> > +      */
> > +     if (unlikely(val == si->pages)) {
> > +             del_from_avail_list(si, false);
> > +             return true;
> >       }
> > +
> > +     return false;
> >  }
> >
> > -static void add_to_avail_list(struct swap_info_struct *si)
> > +static void swap_usage_sub(struct swap_info_struct *si, unsigned int nr_entries)
> >  {
> > -     int nid;
> > +     long val = atomic_long_sub_return_relaxed(nr_entries, &si->inuse_pages);
> >
> > -     spin_lock(&swap_avail_lock);
> > -     for_each_node(nid)
> > -             plist_add(&si->avail_lists[nid], &swap_avail_heads[nid]);
> > -     spin_unlock(&swap_avail_lock);
> > +     /*
> > +      * If device is not full, and SWAP_USAGE_OFFLIST_BIT is set,
> > +      * remove it from the plist.
> > +      */
> > +     if (unlikely(val & SWAP_USAGE_OFFLIST_BIT))
> > +             add_to_avail_list(si, false);
> > +}
> > +
> > +static void swap_range_alloc(struct swap_info_struct *si,
> > +                          unsigned int nr_entries)
> > +{
> > +     if (swap_usage_add(si, nr_entries)) {
> > +             if (si->cluster_info && vm_swap_full())
>
> We may not need check si->cluster_info here since it always exists now.

Good catch, it can be dropped indeed as an optimization, one previous
patch in this series is supposed to drop them all, I think I forgot
this one.

>
> > +                     schedule_work(&si->reclaim_work);
> > +     }
> >  }
> >
> >  static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> > @@ -925,8 +1031,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> >       for (i = 0; i < nr_entries; i++)
> >               clear_bit(offset + i, si->zeromap);
> >
> > -     if (si->inuse_pages == si->pages)
> > -             add_to_avail_list(si);
> >       if (si->flags & SWP_BLKDEV)
> >               swap_slot_free_notify =
> >                       si->bdev->bd_disk->fops->swap_slot_free_notify;
> > @@ -946,7 +1050,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
> >        */
> >       smp_wmb();
> >       atomic_long_add(nr_entries, &nr_swap_pages);
> > -     WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
> > +     swap_usage_sub(si, nr_entries);
> >  }
> >
> >  static int cluster_alloc_swap(struct swap_info_struct *si,
> > @@ -1036,19 +1140,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >               plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
> >               spin_unlock(&swap_avail_lock);
> >               spin_lock(&si->lock);
> > -             if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
> > -                     spin_lock(&swap_avail_lock);
> > -                     if (plist_node_empty(&si->avail_lists[node])) {
> > -                             spin_unlock(&si->lock);
> > -                             goto nextsi;
> > -                     }
> > -                     WARN(!(si->flags & SWP_WRITEOK),
> > -                          "swap_info %d in list but !SWP_WRITEOK\n",
> > -                          si->type);
> > -                     __del_from_avail_list(si);
> > -                     spin_unlock(&si->lock);
> > -                     goto nextsi;
> > -             }
> >               n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> >                                           n_goal, swp_entries, order);
> >               spin_unlock(&si->lock);
> > @@ -1057,7 +1148,6 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >               cond_resched();
> >
> >               spin_lock(&swap_avail_lock);
> > -nextsi:
> >               /*
> >                * if we got here, it's likely that si was almost full before,
> >                * and since scan_swap_map_slots() can drop the si->lock,
> > @@ -1789,7 +1879,7 @@ unsigned int count_swap_pages(int type, int free)
> >               if (sis->flags & SWP_WRITEOK) {
> >                       n = sis->pages;
> >                       if (free)
> > -                             n -= sis->inuse_pages;
> > +                             n -= swap_usage_in_pages(sis);
> >               }
> >               spin_unlock(&sis->lock);
> >       }
> > @@ -2124,7 +2214,7 @@ static int try_to_unuse(unsigned int type)
> >       swp_entry_t entry;
> >       unsigned int i;
> >
> > -     if (!READ_ONCE(si->inuse_pages))
> > +     if (!swap_usage_in_pages(si))
> >               goto success;
> >
> >  retry:
> > @@ -2137,7 +2227,7 @@ static int try_to_unuse(unsigned int type)
> >
> >       spin_lock(&mmlist_lock);
> >       p = &init_mm.mmlist;
> > -     while (READ_ONCE(si->inuse_pages) &&
> > +     while (swap_usage_in_pages(si) &&
> >              !signal_pending(current) &&
> >              (p = p->next) != &init_mm.mmlist) {
> >
> > @@ -2165,7 +2255,7 @@ static int try_to_unuse(unsigned int type)
> >       mmput(prev_mm);
> >
> >       i = 0;
> > -     while (READ_ONCE(si->inuse_pages) &&
> > +     while (swap_usage_in_pages(si) &&
> >              !signal_pending(current) &&
> >              (i = find_next_to_unuse(si, i)) != 0) {
> >
> > @@ -2200,7 +2290,7 @@ static int try_to_unuse(unsigned int type)
> >        * folio_alloc_swap(), temporarily hiding that swap.  It's easy
> >        * and robust (though cpu-intensive) just to keep retrying.
> >        */
> > -     if (READ_ONCE(si->inuse_pages)) {
> > +     if (swap_usage_in_pages(si)) {
> >               if (!signal_pending(current))
> >                       goto retry;
> >               return -EINTR;
> > @@ -2209,7 +2299,7 @@ static int try_to_unuse(unsigned int type)
> >  success:
> >       /*
> >        * Make sure that further cleanups after try_to_unuse() returns happen
> > -      * after swap_range_free() reduces si->inuse_pages to 0.
> > +      * after swap_range_free() reduces inuse_pages to 0.
>
> Here, I personally think the original si->inuse_pages may be better.

I updated this comment to avoid people from mis-using it directly,
anyway it's a trivial comment, can keep it unchanged.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage
  2024-12-30 17:46 ` [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Kairui Song
@ 2025-01-04  5:46   ` Baoquan He
  2025-01-13  5:34     ` Kairui Song
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-01-04  5:46 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The flag SWP_SCANNING was used as an indicator of whether a device
> is being scanned for allocation, and prevents swapoff. Combined with
> SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
> 
> 1. Swapoff clears SWP_WRITEOK, allocation requests will see
>    ~SWP_WRITEOK and abort as it's serialized by si->lock.
> 2. Swapoff unuses all allocated entries.
> 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
>    allocations will stop, preventing UAF.
> 4. Now swapoff can free everything safely.
> 
> This will make the allocation path have a hard dependency on
> si->lock. Allocation always have to acquire si->lock first for
> setting SWP_SCANNING and checking SWP_WRITEOK.
> 
> This commit removes this flag, and just uses the existing per-CPU
> refcount instead to prevent UAF in step 3, which serves well for
> such usage without dependency on si->lock, and scales very well too.
> Just hold a reference during the whole scan and allocation process.
> Swapoff will kill and wait for the counter.
> 
> And for preventing any allocation from happening after step 1 so the
> unuse in step 2 can ensure all slots are free, swapoff will acquire
> the ci->lock of each cluster one by one to ensure all allocations
> see ~SWP_WRITEOK and abort.

Changing to use si->users is great, while wondering why we need acquire =
each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take
the si off swap_avail_heads list. No matter what, we just need wait for
p->comm's completion and continue, why bothering to loop for the
ci->lock acquiring?

> 
> This way these dependences on si->lock are gone. And worth noting we
> can't kill the refcount as the first step for swapoff as the unuse
> process have to acquire the refcount.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |  1 -
>  mm/swapfile.c        | 90 ++++++++++++++++++++++++++++----------------
>  2 files changed, 57 insertions(+), 34 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e1eeea6307cd..02120f1005d5 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -219,7 +219,6 @@ enum {
>  	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
>  	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
>  					/* add others here before... */
> -	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
>  };
>  
>  #define SWAP_CLUSTER_MAX 32UL
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index e6e58cfb5178..99fd0b0d84a2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  {
>  	unsigned int nr_pages = 1 << order;
>  
> +	lockdep_assert_held(&ci->lock);
> +
>  	if (!(si->flags & SWP_WRITEOK))
>  		return false;
>  
> @@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>  {
>  	int n_ret = 0;
>  
> -	si->flags += SWP_SCANNING;
> -
>  	while (n_ret < nr) {
>  		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
>  
> @@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>  		slots[n_ret++] = swp_entry(si->type, offset);
>  	}
>  
> -	si->flags -= SWP_SCANNING;
> -
>  	return n_ret;
>  }
>  
> @@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	return cluster_alloc_swap(si, usage, nr, slots, order);
>  }
>  
> +static bool get_swap_device_info(struct swap_info_struct *si)
> +{
> +	if (!percpu_ref_tryget_live(&si->users))
> +		return false;
> +	/*
> +	 * Guarantee the si->users are checked before accessing other
> +	 * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is
> +	 * up to dated.
> +	 *
> +	 * Paired with the spin_unlock() after setup_swap_info() in
> +	 * enable_swap_info(), and smp_wmb() in swapoff.
> +	 */
> +	smp_rmb();
> +	return true;
> +}
> +
>  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  {
>  	int order = swap_entry_order(entry_order);
> @@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  		/* requeue si to after same-priority siblings */
>  		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>  		spin_unlock(&swap_avail_lock);
> -		spin_lock(&si->lock);
> -		n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> -					    n_goal, swp_entries, order);
> -		spin_unlock(&si->lock);
> -		if (n_ret || size > 1)
> -			goto check_out;
> -		cond_resched();
> +		if (get_swap_device_info(si)) {
> +			spin_lock(&si->lock);
> +			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
> +					n_goal, swp_entries, order);
> +			spin_unlock(&si->lock);
> +			put_swap_device(si);
> +			if (n_ret || size > 1)
> +				goto check_out;
> +			cond_resched();
> +		}
>  
>  		spin_lock(&swap_avail_lock);
>  		/*
> @@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>  	si = swp_swap_info(entry);
>  	if (!si)
>  		goto bad_nofile;
> -	if (!percpu_ref_tryget_live(&si->users))
> +	if (!get_swap_device_info(si))
>  		goto out;
> -	/*
> -	 * Guarantee the si->users are checked before accessing other
> -	 * fields of swap_info_struct.
> -	 *
> -	 * Paired with the spin_unlock() after setup_swap_info() in
> -	 * enable_swap_info().
> -	 */
> -	smp_rmb();
>  	offset = swp_offset(entry);
>  	if (offset >= si->max)
>  		goto put_out;
> @@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type)
>  		goto fail;
>  
>  	/* This is called for allocating swap entry, not cache */
> -	spin_lock(&si->lock);
> -	if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
> -		atomic_long_dec(&nr_swap_pages);
> -	spin_unlock(&si->lock);
> +	if (get_swap_device_info(si)) {
> +		spin_lock(&si->lock);
> +		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
> +			atomic_long_dec(&nr_swap_pages);
> +		spin_unlock(&si->lock);
> +		put_swap_device(si);
> +	}
>  fail:
>  	return entry;
>  }
> @@ -2562,6 +2574,25 @@ bool has_usable_swap(void)
>  	return ret;
>  }
>  
> +/*
> + * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range
> + * see the updated flags, so there will be no more allocations.
> + */
> +static void wait_for_allocation(struct swap_info_struct *si)
> +{
> +	unsigned long offset;
> +	unsigned long end = ALIGN(si->max, SWAPFILE_CLUSTER);
> +	struct swap_cluster_info *ci;
> +
> +	BUG_ON(si->flags & SWP_WRITEOK);
> +
> +	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
> +		ci = lock_cluster(si, offset);
> +		unlock_cluster(ci);
> +		offset += SWAPFILE_CLUSTER;
> +	}
> +}
> +
>  SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  {
>  	struct swap_info_struct *p = NULL;
> @@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	spin_unlock(&p->lock);
>  	spin_unlock(&swap_lock);
>  
> +	wait_for_allocation(p);
> +
>  	disable_swap_slots_cache_lock();
>  
>  	set_current_oom_origin();
> @@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	spin_lock(&p->lock);
>  	drain_mmlist();
>  
> -	/* wait for anyone still in scan_swap_map_slots */
> -	while (p->flags >= SWP_SCANNING) {
> -		spin_unlock(&p->lock);
> -		spin_unlock(&swap_lock);
> -		schedule_timeout_uninterruptible(1);
> -		spin_lock(&swap_lock);
> -		spin_lock(&p->lock);
> -	}
> -
>  	swap_file = p->swap_file;
>  	p->swap_file = NULL;
>  	p->max = 0;
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes
  2024-12-30 17:46 ` [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
@ 2025-01-06  8:43   ` Baoquan He
  2025-01-13  5:49     ` Kairui Song
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-01-06  8:43 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Currently, we are only using flags to indicate which list the cluster
> is on. Using one bit for each list type might be a waste, as the list
> type grows, we will consume too many bits. Additionally, the current
> mixed usage of '&' and '==' is a bit confusing.

I think this kind of converting can only happen when the type is
exclusive on each cluster. Then we can set and use
'ci->flags == CLUSTER_FLAG_XXX' to check it.

> 
> Make it clean by using an enum to define all possible cluster
> statuses. Only an off-list cluster will have the NONE (0) flag.
> And use a wrapper to annotate and sanitize all flag settings
> and list movements.
> 
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 17 +++++++---
>  mm/swapfile.c        | 75 +++++++++++++++++++++++---------------------
>  2 files changed, 52 insertions(+), 40 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 02120f1005d5..339d7f0192ff 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -257,10 +257,19 @@ struct swap_cluster_info {
>  	u8 order;
>  	struct list_head list;
>  };
> -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> -#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
> -#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */
> +
> +/* All on-list cluster must have a non-zero flag. */
> +enum swap_cluster_flags {
> +	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
> +	CLUSTER_FLAG_FREE,
> +	CLUSTER_FLAG_NONFULL,
> +	CLUSTER_FLAG_FRAG,
> +	/* Clusters with flags above are allocatable */
> +	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
> +	CLUSTER_FLAG_FULL,
> +	CLUSTER_FLAG_DISCARD,
> +	CLUSTER_FLAG_MAX,
> +};
>  
>  /*
>   * The first page in the swap file is the swap header, which is always marked
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 99fd0b0d84a2..7795a3d27273 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -403,7 +403,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  
>  static inline bool cluster_is_free(struct swap_cluster_info *info)
>  {
> -	return info->flags & CLUSTER_FLAG_FREE;
> +	return info->flags == CLUSTER_FLAG_FREE;
>  }
>  
>  static inline unsigned int cluster_index(struct swap_info_struct *si,
> @@ -434,6 +434,27 @@ static inline void unlock_cluster(struct swap_cluster_info *ci)
>  	spin_unlock(&ci->lock);
>  }
>  
> +static void cluster_move(struct swap_info_struct *si,
               ~~~~~~~~~~~~
Maybe rename it to move_cluster() which has the same naming style as
lock_cluster()/unlock_cluster()? This is what we usually do namin if a
function is action acts on objects.

Other than this, this patch looks great to me.

> +			 struct swap_cluster_info *ci, struct list_head *list,
> +			 enum swap_cluster_flags new_flags)
> +{
> +	VM_WARN_ON(ci->flags == new_flags);
> +	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> +
> +	if (ci->flags == CLUSTER_FLAG_NONE) {
> +		list_add_tail(&ci->list, list);
> +	} else {
> +		if (ci->flags == CLUSTER_FLAG_FRAG) {
> +			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> +			si->frag_cluster_nr[ci->order]--;
> +		}
> +		list_move_tail(&ci->list, list);
> +	}
> +	ci->flags = new_flags;
> +	if (new_flags == CLUSTER_FLAG_FRAG)
> +		si->frag_cluster_nr[ci->order]++;
> +}
> +
>  /* Add a cluster to discard list and schedule it to do discard */
>  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  		struct swap_cluster_info *ci)
> @@ -447,10 +468,8 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  	 */
>  	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> -
> -	VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
> -	list_move_tail(&ci->list, &si->discard_clusters);
> -	ci->flags = 0;
> +	VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE);
> +	cluster_move(si, ci, &si->discard_clusters, CLUSTER_FLAG_DISCARD);
>  	schedule_work(&si->discard_work);
>  }
>  
> @@ -458,12 +477,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
>  {
>  	lockdep_assert_held(&si->lock);
>  	lockdep_assert_held(&ci->lock);
> -
> -	if (ci->flags)
> -		list_move_tail(&ci->list, &si->free_clusters);
> -	else
> -		list_add_tail(&ci->list, &si->free_clusters);
> -	ci->flags = CLUSTER_FLAG_FREE;
> +	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>  	ci->order = 0;
>  }
>  
> @@ -479,6 +493,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
>  	while (!list_empty(&si->discard_clusters)) {
>  		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
>  		list_del(&ci->list);
> +		/* Must clear flag when taking a cluster off-list */
> +		ci->flags = CLUSTER_FLAG_NONE;
>  		idx = cluster_index(si, ci);
>  		spin_unlock(&si->lock);
>  
> @@ -519,9 +535,6 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
>  	lockdep_assert_held(&si->lock);
>  	lockdep_assert_held(&ci->lock);
>  
> -	if (ci->flags & CLUSTER_FLAG_FRAG)
> -		si->frag_cluster_nr[ci->order]--;
> -
>  	/*
>  	 * If the swap is discardable, prepare discard the cluster
>  	 * instead of free it immediately. The cluster will be freed
> @@ -573,13 +586,9 @@ static void dec_cluster_info_page(struct swap_info_struct *si,
>  		return;
>  	}
>  
> -	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> -		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
> -		if (ci->flags & CLUSTER_FLAG_FRAG)
> -			si->frag_cluster_nr[ci->order]--;
> -		list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]);
> -		ci->flags = CLUSTER_FLAG_NONFULL;
> -	}
> +	if (ci->flags != CLUSTER_FLAG_NONFULL)
> +		cluster_move(si, ci, &si->nonfull_clusters[ci->order],
> +			     CLUSTER_FLAG_NONFULL);
>  }
>  
>  static bool cluster_reclaim_range(struct swap_info_struct *si,
> @@ -663,11 +672,13 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  	if (!(si->flags & SWP_WRITEOK))
>  		return false;
>  
> +	VM_BUG_ON(ci->flags == CLUSTER_FLAG_NONE);
> +	VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE);
> +
>  	if (cluster_is_free(ci)) {
> -		if (nr_pages < SWAPFILE_CLUSTER) {
> -			list_move_tail(&ci->list, &si->nonfull_clusters[order]);
> -			ci->flags = CLUSTER_FLAG_NONFULL;
> -		}
> +		if (nr_pages < SWAPFILE_CLUSTER)
> +			cluster_move(si, ci, &si->nonfull_clusters[order],
> +				     CLUSTER_FLAG_NONFULL);
>  		ci->order = order;
>  	}
>  
> @@ -675,14 +686,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  	swap_range_alloc(si, nr_pages);
>  	ci->count += nr_pages;
>  
> -	if (ci->count == SWAPFILE_CLUSTER) {
> -		VM_BUG_ON(!(ci->flags &
> -			  (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
> -		if (ci->flags & CLUSTER_FLAG_FRAG)
> -			si->frag_cluster_nr[ci->order]--;
> -		list_move_tail(&ci->list, &si->full_clusters);
> -		ci->flags = CLUSTER_FLAG_FULL;
> -	}
> +	if (ci->count == SWAPFILE_CLUSTER)
> +		cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
>  
>  	return true;
>  }
> @@ -821,9 +826,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		while (!list_empty(&si->nonfull_clusters[order])) {
>  			ci = list_first_entry(&si->nonfull_clusters[order],
>  					      struct swap_cluster_info, list);
> -			list_move_tail(&ci->list, &si->frag_clusters[order]);
> -			ci->flags = CLUSTER_FLAG_FRAG;
> -			si->frag_cluster_nr[order]++;
> +			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
>  			frags++;
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
@ 2025-01-06 10:12   ` Baoquan He
  2025-01-08 11:09   ` Baoquan He
  1 sibling, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-06 10:12 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
......snip..
> +
> +/*
> + * Must be called after allocation, moves the cluster to full or frag list.
> + * Note: allocation doesn't acquire si lock, and may drop the ci lock for
> + * reclaim, so the cluster could be any where when called.
> + */
> +static void relocate_cluster(struct swap_info_struct *si,
> +			     struct swap_cluster_info *ci)
> +{
> +	lockdep_assert_held(&ci->lock);
> +
> +	/* Discard cluster must remain off-list or on discard list */
> +	if (cluster_is_discard(ci))
> +		return;
> +
> +	if (!ci->count) {
> +		free_cluster(si, ci);

relocate_cluster() is only called in alloc_swap_scan_cluster(), there
seems to be no chance to have 'ci->count == 0' case when allocating. Do
I miss anything here? 


> +	} else if (ci->count != SWAPFILE_CLUSTER) {
> +		if (ci->flags != CLUSTER_FLAG_FRAG)
> +			cluster_move(si, ci, &si->frag_clusters[ci->order],
> +				     CLUSTER_FLAG_FRAG);
> +	} else {
> +		if (ci->flags != CLUSTER_FLAG_FULL)
> +			cluster_move(si, ci, &si->full_clusters,
> +				     CLUSTER_FLAG_FULL);
> +	}
> +}
> +
>  /*
>   * The cluster corresponding to page_nr will be used. The cluster will not be
>   * added to free cluster list and its usage counter will be increased by 1.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
  2025-01-06 10:12   ` Baoquan He
@ 2025-01-08 11:09   ` Baoquan He
  2025-01-09  2:15     ` Kairui Song
  1 sibling, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-01-08 11:09 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
......snip.....
> ---
>  include/linux/swap.h |   3 +-
>  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
>  2 files changed, 246 insertions(+), 192 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 339d7f0192ff..c4ff31cb6bde 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -291,6 +291,7 @@ enum swap_cluster_flags {
>   * throughput.
>   */
>  struct percpu_cluster {
> +	local_lock_t lock; /* Protect the percpu_cluster above */
>  	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
>  };
>  
> @@ -313,7 +314,7 @@ struct swap_info_struct {
>  					/* list of cluster that contains at least one free slot */
>  	struct list_head frag_clusters[SWAP_NR_ORDERS];
>  					/* list of cluster that are fragmented or contented */
> -	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> +	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>  	unsigned int pages;		/* total of usable pages of swap */
>  	atomic_long_t inuse_pages;	/* number of those currently in use */
>  	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 7795a3d27273..dadd4fead689 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	folio_ref_sub(folio, nr_pages);
>  	folio_set_dirty(folio);
>  
> -	spin_lock(&si->lock);
>  	/* Only sinple page folio can be backed by zswap */
>  	if (nr_pages == 1)
>  		zswap_invalidate(entry);
>  	swap_entry_range_free(si, entry, nr_pages);
> -	spin_unlock(&si->lock);
>  	ret = nr_pages;
>  out_unlock:
>  	folio_unlock(folio);
> @@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
>  
>  static inline bool cluster_is_free(struct swap_cluster_info *info)
>  {
> -	return info->flags == CLUSTER_FLAG_FREE;
> +	return info->count == 0;

This is a little confusing. Maybe we should add one and call it
cluster_is_empty(). Because discarded clusters are also be able to pass
the checking here.

> +}
> +
> +static inline bool cluster_is_discard(struct swap_cluster_info *info)
> +{
> +	return info->flags == CLUSTER_FLAG_DISCARD;
> +}
> +
> +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
> +{
> +	if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
> +		return false;
> +	if (!order)
> +		return true;
> +	return cluster_is_free(ci) || order == ci->order;
>  }
>  
>  static inline unsigned int cluster_index(struct swap_info_struct *si,
> @@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct *si,
>  {
>  	VM_WARN_ON(ci->flags == new_flags);
>  	BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> +	lockdep_assert_held(&ci->lock);
>  
> -	if (ci->flags == CLUSTER_FLAG_NONE) {
> +	spin_lock(&si->lock);
> +	if (ci->flags == CLUSTER_FLAG_NONE)
>  		list_add_tail(&ci->list, list);
> -	} else {
> -		if (ci->flags == CLUSTER_FLAG_FRAG) {
> -			VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> -			si->frag_cluster_nr[ci->order]--;
> -		}
> +	else
>  		list_move_tail(&ci->list, list);
> -	}
> +	spin_unlock(&si->lock);
> +
> +	if (ci->flags == CLUSTER_FLAG_FRAG)
> +		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> +	else if (new_flags == CLUSTER_FLAG_FRAG)
> +		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
>  	ci->flags = new_flags;
> -	if (new_flags == CLUSTER_FLAG_FRAG)
> -		si->frag_cluster_nr[ci->order]++;
>  }
>  
>  /* Add a cluster to discard list and schedule it to do discard */
> @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>  
>  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>  {
> -	lockdep_assert_held(&si->lock);
>  	lockdep_assert_held(&ci->lock);
>  	cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
>  	ci->order = 0;
>  }
>  
> +/*
> + * Isolate and lock the first cluster that is not contented on a list,
> + * clean its flag before taken off-list. Cluster flag must be in sync
> + * with list status, so cluster updaters can always know the cluster
> + * list status without touching si lock.
> + *
> + * Note it's possible that all clusters on a list are contented so
> + * this returns NULL for an non-empty list.
> + */
> +static struct swap_cluster_info *cluster_isolate_lock(
> +		struct swap_info_struct *si, struct list_head *list)
> +{
> +	struct swap_cluster_info *ci, *ret = NULL;
> +
> +	spin_lock(&si->lock);
> +
> +	if (unlikely(!(si->flags & SWP_WRITEOK)))
> +		goto out;
> +
> +	list_for_each_entry(ci, list, list) {
> +		if (!spin_trylock(&ci->lock))
> +			continue;
> +
> +		/* We may only isolate and clear flags of following lists */
> +		VM_BUG_ON(!ci->flags);
> +		VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> +			  ci->flags != CLUSTER_FLAG_FULL);
> +
> +		list_del(&ci->list);
> +		ci->flags = CLUSTER_FLAG_NONE;
> +		ret = ci;
> +		break;
> +	}
> +out:
> +	spin_unlock(&si->lock);
> +
> +	return ret;
> +}
> +
>  /*
>   * Doing discard actually. After a cluster discard is finished, the cluster
> - * will be added to free cluster list. caller should hold si->lock.
> -*/
> -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> + * will be added to free cluster list. Discard cluster is a bit special as
> + * they don't participate in allocation or reclaim, so clusters marked as
> + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> + */
> +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
>  {
>  	struct swap_cluster_info *ci;
> +	bool ret = false;
>  	unsigned int idx;
>  
> +	spin_lock(&si->lock);
>  	while (!list_empty(&si->discard_clusters)) {
>  		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> +		/*
> +		 * Delete the cluster from list but don't clear its flags until
> +		 * discard is done, so isolation and relocation will skip it.
> +		 */
>  		list_del(&ci->list);

I don't understand above comment. ci has been taken off list. While
allocation need isolate from a usable list. Even though we clear
ci->flags now, how come isolation and relocation will touch it. I may
miss anything here.

> -		/* Must clear flag when taking a cluster off-list */
> -		ci->flags = CLUSTER_FLAG_NONE;
>  		idx = cluster_index(si, ci);
>  		spin_unlock(&si->lock);
> -
>  		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
>  				SWAPFILE_CLUSTER);
>  
> -		spin_lock(&si->lock);
>  		spin_lock(&ci->lock);
> -		__free_cluster(si, ci);
> +		/*
> +		 * Discard is done, clear its flags as it's now off-list,
> +		 * then return the cluster to allocation list.
> +		 */
> +		ci->flags = CLUSTER_FLAG_NONE;
>  		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>  				0, SWAPFILE_CLUSTER);
> +		__free_cluster(si, ci);
>  		spin_unlock(&ci->lock);
> +		ret = true;
> +		spin_lock(&si->lock);
>  	}
> +	spin_unlock(&si->lock);
> +	return ret;
>  }
>  
>  static void swap_discard_work(struct work_struct *work)
......snip....
> @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
>  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
>  					      unsigned char usage)
>  {
> -	struct percpu_cluster *cluster;
>  	struct swap_cluster_info *ci;
>  	unsigned int offset, found = 0;
>  
> -new_cluster:
> -	lockdep_assert_held(&si->lock);
> -	cluster = this_cpu_ptr(si->percpu_cluster);
> -	offset = cluster->next[order];
> +	/* Fast path using per CPU cluster */
> +	local_lock(&si->percpu_cluster->lock);
> +	offset = __this_cpu_read(si->percpu_cluster->next[order]);
>  	if (offset) {
> -		offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> +		ci = lock_cluster(si, offset);
> +		/* Cluster could have been used by another order */
> +		if (cluster_is_usable(ci, order)) {
> +			if (cluster_is_free(ci))
> +				offset = cluster_offset(si, ci);
> +			offset = alloc_swap_scan_cluster(si, offset, &found,
> +							 order, usage);
> +		} else {
> +			unlock_cluster(ci);
> +		}
>  		if (found)
>  			goto done;
>  	}
>  
> -	if (!list_empty(&si->free_clusters)) {
> -		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> -		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> -		/*
> -		 * Either we didn't touch the cluster due to swapoff,
> -		 * or the allocation must success.
> -		 */
> -		VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> -		goto done;
> +new_cluster:
> +	ci = cluster_isolate_lock(si, &si->free_clusters);
> +	if (ci) {
> +		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> +						 &found, order, usage);
> +		if (found)
> +			goto done;
>  	}
>  
>  	/* Try reclaim from full clusters if free clusters list is drained */
> @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		swap_reclaim_full_clusters(si, false);
>  
>  	if (order < PMD_ORDER) {
> -		unsigned int frags = 0;
> +		unsigned int frags = 0, frags_existing;
>  
> -		while (!list_empty(&si->nonfull_clusters[order])) {
> -			ci = list_first_entry(&si->nonfull_clusters[order],
> -					      struct swap_cluster_info, list);
> -			cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> +		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
> -			frags++;
> +			/*
> +			 * With `fragmenting` set to true, it will surely take
                                 ~~~~~~~~~~~
                         wondering what 'fragmenting' means here.


> +			 * the cluster off nonfull list
> +			 */
>  			if (found)
>  				goto done;
> +			frags++;
>  		}
>  
> -		/*
> -		 * Nonfull clusters are moved to frag tail if we reached
> -		 * here, count them too, don't over scan the frag list.
> -		 */
> -		while (frags < si->frag_cluster_nr[order]) {
> -			ci = list_first_entry(&si->frag_clusters[order],
> -					      struct swap_cluster_info, list);
> +		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> +		while (frags < frags_existing &&
> +		       (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> +			atomic_long_dec(&si->frag_cluster_nr[order]);
>  			/*
> -			 * Rotate the frag list to iterate, they were all failing
> -			 * high order allocation or moved here due to per-CPU usage,
> -			 * this help keeping usable cluster ahead.
> +			 * Rotate the frag list to iterate, they were all
> +			 * failing high order allocation or moved here due to
> +			 * per-CPU usage, but they could contain newly released
> +			 * reclaimable (eg. lazy-freed swap cache) slots.
>  			 */
> -			list_move_tail(&ci->list, &si->frag_clusters[order]);
>  			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
>  							 &found, order, usage);
> -			frags++;
>  			if (found)
>  				goto done;
> +			frags++;
>  		}
>  	}
>  
> -	if (!list_empty(&si->discard_clusters)) {
> -		/*
> -		 * we don't have free cluster but have some clusters in
> -		 * discarding, do discard now and reclaim them, then
> -		 * reread cluster_next_cpu since we dropped si->lock
> -		 */
> -		swap_do_scheduled_discard(si);
> +	/*
> +	 * We don't have free cluster but have some clusters in
> +	 * discarding, do discard now and reclaim them, then
> +	 * reread cluster_next_cpu since we dropped si->lock
> +	 */
> +	if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
>  		goto new_cluster;
> -	}
>  
>  	if (order)
>  		goto done;
.....



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 10/13] mm, swap: simplify percpu cluster updating
  2024-12-30 17:46 ` [PATCH v3 10/13] mm, swap: simplify percpu cluster updating Kairui Song
@ 2025-01-09  2:07   ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-09  2:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Instead of using a returning argument, we can simply store the next
> cluster offset to the fixed percpu location, which reduce the stack
> usage and simplify the function:
> 
> Object size:
> ./scripts/bloat-o-meter mm/swapfile.o mm/swapfile.o.new
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-271 (-271)
> Function                                     old     new   delta
> get_swap_pages                              2847    2733    -114
> alloc_swap_scan_cluster                      894     737    -157
> Total: Before=30833, After=30562, chg -0.88%
> 
> Stack usage:
> Before:
> swapfile.c:1190:5:get_swap_pages       240    static
> 
> After:
> swapfile.c:1185:5:get_swap_pages       216    static
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |  4 +--
>  mm/swapfile.c        | 66 +++++++++++++++++++-------------------------
>  2 files changed, 31 insertions(+), 39 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c4ff31cb6bde..4c1d2e69689f 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -275,9 +275,9 @@ enum swap_cluster_flags {
>   * The first page in the swap file is the swap header, which is always marked
>   * bad to prevent it from being allocated as an entry. This also prevents the
>   * cluster to which it belongs being marked free. Therefore 0 is safe to use as
> - * a sentinel to indicate next is not valid in percpu_cluster.
> + * a sentinel to indicate an entry is not valid.
>   */
> -#define SWAP_NEXT_INVALID	0
> +#define SWAP_ENTRY_INVALID	0
>  
>  #ifdef CONFIG_THP_SWAP
>  #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index dadd4fead689..60a650ba88fd 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -759,23 +759,23 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  	return true;
>  }
>  
> -static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
> -					    unsigned int *foundp, unsigned int order,
> +/* Try use a new cluster for current CPU and allocate from it. */
> +static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> +					    struct swap_cluster_info *ci,
> +					    unsigned long offset,
> +					    unsigned int order,
>  					    unsigned char usage)
>  {
> -	unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
> +	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
> +	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
>  	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
>  	unsigned int nr_pages = 1 << order;
>  	bool need_reclaim, ret;
> -	struct swap_cluster_info *ci;
>  
> -	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
>  	lockdep_assert_held(&ci->lock);
>  
> -	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER) {
> -		offset = SWAP_NEXT_INVALID;
> +	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
>  		goto out;
> -	}
>  
>  	for (end -= nr_pages; offset <= end; offset += nr_pages) {
>  		need_reclaim = false;
> @@ -789,34 +789,27 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
>  			 * cluster has no flag set, and change of list
>  			 * won't cause fragmentation.
>  			 */
> -			if (!cluster_is_usable(ci, order)) {
> -				offset = SWAP_NEXT_INVALID;
> +			if (!cluster_is_usable(ci, order))
>  				goto out;
> -			}
>  			if (cluster_is_free(ci))
>  				offset = start;
>  			/* Reclaim failed but cluster is usable, try next */
>  			if (!ret)
>  				continue;
>  		}
> -		if (!cluster_alloc_range(si, ci, offset, usage, order)) {
> -			offset = SWAP_NEXT_INVALID;
> -			goto out;
> -		}
> -		*foundp = offset;
> -		if (ci->count == SWAPFILE_CLUSTER) {
> -			offset = SWAP_NEXT_INVALID;
> -			goto out;
> -		}
> +		if (!cluster_alloc_range(si, ci, offset, usage, order))
> +			break;
> +		found = offset;
>  		offset += nr_pages;
> +		if (ci->count < SWAPFILE_CLUSTER && offset <= end)
> +			next = offset;
>  		break;
>  	}
> -	if (offset > end)
> -		offset = SWAP_NEXT_INVALID;
>  out:
>  	relocate_cluster(si, ci);
>  	unlock_cluster(ci);
> -	return offset;
> +	__this_cpu_write(si->percpu_cluster->next[order], next);
> +	return found;
>  }
>  
>  /* Return true if reclaimed a whole cluster */
> @@ -885,8 +878,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		if (cluster_is_usable(ci, order)) {
>  			if (cluster_is_free(ci))
>  				offset = cluster_offset(si, ci);
> -			offset = alloc_swap_scan_cluster(si, offset, &found,
> -							 order, usage);
> +			found = alloc_swap_scan_cluster(si, ci, offset,
> +							order, usage);
>  		} else {
>  			unlock_cluster(ci);
>  		}
> @@ -897,8 +890,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  new_cluster:
>  	ci = cluster_isolate_lock(si, &si->free_clusters);
>  	if (ci) {
> -		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> -						 &found, order, usage);
> +		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +						order, usage);
>  		if (found)
>  			goto done;
>  	}
> @@ -911,8 +904,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		unsigned int frags = 0, frags_existing;
>  
>  		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> -			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> -							 &found, order, usage);
> +			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +							order, usage);
>  			/*
>  			 * With `fragmenting` set to true, it will surely take
>  			 * the cluster off nonfull list
> @@ -932,8 +925,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  			 * per-CPU usage, but they could contain newly released
>  			 * reclaimable (eg. lazy-freed swap cache) slots.
>  			 */
> -			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> -							 &found, order, usage);
> +			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +							order, usage);
>  			if (found)
>  				goto done;
>  			frags++;
> @@ -959,21 +952,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  		 */
>  		while ((ci = cluster_isolate_lock(si, &si->frag_clusters[o]))) {
>  			atomic_long_dec(&si->frag_cluster_nr[o]);
> -			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> -							 &found, order, usage);
> +			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +							0, usage);
>  			if (found)
>  				goto done;
>  		}
>  
>  		while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[o]))) {
> -			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> -							 &found, order, usage);
> +			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +							0, usage);
>  			if (found)
>  				goto done;
>  		}
>  	}
>  done:
> -	__this_cpu_write(si->percpu_cluster->next[order], offset);
>  	local_unlock(&si->percpu_cluster->lock);

Do you think if we still need hold the si->percpu_cluster->lock till the
end of cluster_alloc_swap_entry() invocation? If so, we may need hold the
lock during the whole period when going through percpu_cluster->next,
free_cluster, nonfull, frag_clusters until we get one available slot, even
though we keep upating the si->percpu_cluster->next[order]. I can't see
the point by changing it like this.

>  
>  	return found;
> @@ -3194,7 +3186,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  
>  		cluster = per_cpu_ptr(si->percpu_cluster, cpu);
>  		for (i = 0; i < SWAP_NR_ORDERS; i++)
> -			cluster->next[i] = SWAP_NEXT_INVALID;
> +			cluster->next[i] = SWAP_ENTRY_INVALID;
>  		local_lock_init(&cluster->lock);
>  	}
>  
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2025-01-08 11:09   ` Baoquan He
@ 2025-01-09  2:15     ` Kairui Song
  2025-01-10 11:23       ` Baoquan He
  0 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2025-01-09  2:15 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
>

Thanks for the very detailed review!

> On 12/31/24 at 01:46am, Kairui Song wrote:
> ......snip.....
> > ---
> >  include/linux/swap.h |   3 +-
> >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> >  2 files changed, 246 insertions(+), 192 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 339d7f0192ff..c4ff31cb6bde 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> >   * throughput.
> >   */
> >  struct percpu_cluster {
> > +     local_lock_t lock; /* Protect the percpu_cluster above */
> >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> >  };
> >
> > @@ -313,7 +314,7 @@ struct swap_info_struct {
> >                                       /* list of cluster that contains at least one free slot */
> >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> >                                       /* list of cluster that are fragmented or contented */
> > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> >       unsigned int pages;             /* total of usable pages of swap */
> >       atomic_long_t inuse_pages;      /* number of those currently in use */
> >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 7795a3d27273..dadd4fead689 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -261,12 +261,10 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >       folio_ref_sub(folio, nr_pages);
> >       folio_set_dirty(folio);
> >
> > -     spin_lock(&si->lock);
> >       /* Only sinple page folio can be backed by zswap */
> >       if (nr_pages == 1)
> >               zswap_invalidate(entry);
> >       swap_entry_range_free(si, entry, nr_pages);
> > -     spin_unlock(&si->lock);
> >       ret = nr_pages;
> >  out_unlock:
> >       folio_unlock(folio);
> > @@ -403,7 +401,21 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> >
> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
> >  {
> > -     return info->flags == CLUSTER_FLAG_FREE;
> > +     return info->count == 0;
>
> This is a little confusing. Maybe we should add one and call it
> cluster_is_empty(). Because discarded clusters are also be able to pass
> the checking here.

Good idea, agree on this, this new name is better.

>
> > +}
> > +
> > +static inline bool cluster_is_discard(struct swap_cluster_info *info)
> > +{
> > +     return info->flags == CLUSTER_FLAG_DISCARD;
> > +}
> > +
> > +static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
> > +{
> > +     if (unlikely(ci->flags > CLUSTER_FLAG_USABLE))
> > +             return false;
> > +     if (!order)
> > +             return true;
> > +     return cluster_is_free(ci) || order == ci->order;
> >  }
> >
> >  static inline unsigned int cluster_index(struct swap_info_struct *si,
> > @@ -440,19 +452,20 @@ static void cluster_move(struct swap_info_struct *si,
> >  {
> >       VM_WARN_ON(ci->flags == new_flags);
> >       BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> > +     lockdep_assert_held(&ci->lock);
> >
> > -     if (ci->flags == CLUSTER_FLAG_NONE) {
> > +     spin_lock(&si->lock);
> > +     if (ci->flags == CLUSTER_FLAG_NONE)
> >               list_add_tail(&ci->list, list);
> > -     } else {
> > -             if (ci->flags == CLUSTER_FLAG_FRAG) {
> > -                     VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> > -                     si->frag_cluster_nr[ci->order]--;
> > -             }
> > +     else
> >               list_move_tail(&ci->list, list);
> > -     }
> > +     spin_unlock(&si->lock);
> > +
> > +     if (ci->flags == CLUSTER_FLAG_FRAG)
> > +             atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> > +     else if (new_flags == CLUSTER_FLAG_FRAG)
> > +             atomic_long_inc(&si->frag_cluster_nr[ci->order]);
> >       ci->flags = new_flags;
> > -     if (new_flags == CLUSTER_FLAG_FRAG)
> > -             si->frag_cluster_nr[ci->order]++;
> >  }
> >
> >  /* Add a cluster to discard list and schedule it to do discard */
> > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >
> >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >  {
> > -     lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> >       ci->order = 0;
> >  }
> >
> > +/*
> > + * Isolate and lock the first cluster that is not contented on a list,
> > + * clean its flag before taken off-list. Cluster flag must be in sync
> > + * with list status, so cluster updaters can always know the cluster
> > + * list status without touching si lock.
> > + *
> > + * Note it's possible that all clusters on a list are contented so
> > + * this returns NULL for an non-empty list.
> > + */
> > +static struct swap_cluster_info *cluster_isolate_lock(
> > +             struct swap_info_struct *si, struct list_head *list)
> > +{
> > +     struct swap_cluster_info *ci, *ret = NULL;
> > +
> > +     spin_lock(&si->lock);
> > +
> > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > +             goto out;
> > +
> > +     list_for_each_entry(ci, list, list) {
> > +             if (!spin_trylock(&ci->lock))
> > +                     continue;
> > +
> > +             /* We may only isolate and clear flags of following lists */
> > +             VM_BUG_ON(!ci->flags);
> > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > +                       ci->flags != CLUSTER_FLAG_FULL);
> > +
> > +             list_del(&ci->list);
> > +             ci->flags = CLUSTER_FLAG_NONE;
> > +             ret = ci;
> > +             break;
> > +     }
> > +out:
> > +     spin_unlock(&si->lock);
> > +
> > +     return ret;
> > +}
> > +
> >  /*
> >   * Doing discard actually. After a cluster discard is finished, the cluster
> > - * will be added to free cluster list. caller should hold si->lock.
> > -*/
> > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > + * will be added to free cluster list. Discard cluster is a bit special as
> > + * they don't participate in allocation or reclaim, so clusters marked as
> > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > + */
> > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> >  {
> >       struct swap_cluster_info *ci;
> > +     bool ret = false;
> >       unsigned int idx;
> >
> > +     spin_lock(&si->lock);
> >       while (!list_empty(&si->discard_clusters)) {
> >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > +             /*
> > +              * Delete the cluster from list but don't clear its flags until
> > +              * discard is done, so isolation and relocation will skip it.
> > +              */
> >               list_del(&ci->list);
>
> I don't understand above comment. ci has been taken off list. While
> allocation need isolate from a usable list. Even though we clear
> ci->flags now, how come isolation and relocation will touch it. I may
> miss anything here.

There are many cases, one possible and common situation is that the
percpu cluster (si->percpu_cluster of another CPU) is still pointing
to it.

Also, this commit removed protection of si lock on allocation, and
allocation path may also drop ci lock to call reclaim, which means one
cluster could be used or freed by anyone before allocator reacquire
the ci lock again. In that case, the allocator could see a discard
cluster.

So we don't clear the discard flag, in case anyone misuse it.

I can add more inline comments on this, this is already some related
comments above the function relocate_cluster, could add some more
referencing that.

>
> > -             /* Must clear flag when taking a cluster off-list */
> > -             ci->flags = CLUSTER_FLAG_NONE;
> >               idx = cluster_index(si, ci);
> >               spin_unlock(&si->lock);
> > -
> >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> >                               SWAPFILE_CLUSTER);
> >
> > -             spin_lock(&si->lock);
> >               spin_lock(&ci->lock);
> > -             __free_cluster(si, ci);
> > +             /*
> > +              * Discard is done, clear its flags as it's now off-list,
> > +              * then return the cluster to allocation list.
> > +              */
> > +             ci->flags = CLUSTER_FLAG_NONE;
> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                               0, SWAPFILE_CLUSTER);
> > +             __free_cluster(si, ci);
> >               spin_unlock(&ci->lock);
> > +             ret = true;
> > +             spin_lock(&si->lock);
> >       }
> > +     spin_unlock(&si->lock);
> > +     return ret;
> >  }
> >
> >  static void swap_discard_work(struct work_struct *work)
> ......snip....
> > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> >                                             unsigned char usage)
> >  {
> > -     struct percpu_cluster *cluster;
> >       struct swap_cluster_info *ci;
> >       unsigned int offset, found = 0;
> >
> > -new_cluster:
> > -     lockdep_assert_held(&si->lock);
> > -     cluster = this_cpu_ptr(si->percpu_cluster);
> > -     offset = cluster->next[order];
> > +     /* Fast path using per CPU cluster */
> > +     local_lock(&si->percpu_cluster->lock);
> > +     offset = __this_cpu_read(si->percpu_cluster->next[order]);
> >       if (offset) {
> > -             offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > +             ci = lock_cluster(si, offset);
> > +             /* Cluster could have been used by another order */
> > +             if (cluster_is_usable(ci, order)) {
> > +                     if (cluster_is_free(ci))
> > +                             offset = cluster_offset(si, ci);
> > +                     offset = alloc_swap_scan_cluster(si, offset, &found,
> > +                                                      order, usage);
> > +             } else {
> > +                     unlock_cluster(ci);
> > +             }
> >               if (found)
> >                       goto done;
> >       }
> >
> > -     if (!list_empty(&si->free_clusters)) {
> > -             ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > -             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > -             /*
> > -              * Either we didn't touch the cluster due to swapoff,
> > -              * or the allocation must success.
> > -              */
> > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > -             goto done;
> > +new_cluster:
> > +     ci = cluster_isolate_lock(si, &si->free_clusters);
> > +     if (ci) {
> > +             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > +                                              &found, order, usage);
> > +             if (found)
> > +                     goto done;
> >       }
> >
> >       /* Try reclaim from full clusters if free clusters list is drained */
> > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >               swap_reclaim_full_clusters(si, false);
> >
> >       if (order < PMD_ORDER) {
> > -             unsigned int frags = 0;
> > +             unsigned int frags = 0, frags_existing;
> >
> > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > -                     ci = list_first_entry(&si->nonfull_clusters[order],
> > -                                           struct swap_cluster_info, list);
> > -                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > +             while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> >                                                        &found, order, usage);
> > -                     frags++;
> > +                     /*
> > +                      * With `fragmenting` set to true, it will surely take
>                                  ~~~~~~~~~~~
>                          wondering what 'fragmenting' means here.

This comment is a bit out of context indeed, it actually trying to say
the alloc_swap_scan_cluster call above should move the cluster to
tail. I'll update the comment.



>
> > +                      * the cluster off nonfull list
> > +                      */
> >                       if (found)
> >                               goto done;
> > +                     frags++;
> >               }
> >
> > -             /*
> > -              * Nonfull clusters are moved to frag tail if we reached
> > -              * here, count them too, don't over scan the frag list.
> > -              */
> > -             while (frags < si->frag_cluster_nr[order]) {
> > -                     ci = list_first_entry(&si->frag_clusters[order],
> > -                                           struct swap_cluster_info, list);
> > +             frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > +             while (frags < frags_existing &&
> > +                    (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> >                       /*
> > -                      * Rotate the frag list to iterate, they were all failing
> > -                      * high order allocation or moved here due to per-CPU usage,
> > -                      * this help keeping usable cluster ahead.
> > +                      * Rotate the frag list to iterate, they were all
> > +                      * failing high order allocation or moved here due to
> > +                      * per-CPU usage, but they could contain newly released
> > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> >                        */
> > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> >                                                        &found, order, usage);
> > -                     frags++;
> >                       if (found)
> >                               goto done;
> > +                     frags++;
> >               }
> >       }
> >
> > -     if (!list_empty(&si->discard_clusters)) {
> > -             /*
> > -              * we don't have free cluster but have some clusters in
> > -              * discarding, do discard now and reclaim them, then
> > -              * reread cluster_next_cpu since we dropped si->lock
> > -              */
> > -             swap_do_scheduled_discard(si);
> > +     /*
> > +      * We don't have free cluster but have some clusters in
> > +      * discarding, do discard now and reclaim them, then
> > +      * reread cluster_next_cpu since we dropped si->lock
> > +      */
> > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> >               goto new_cluster;
> > -     }
> >
> >       if (order)
> >               goto done;
> .....
>
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation
  2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
@ 2025-01-09  4:04   ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-09  4:04 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Direct reclaim can skip the whole folio after reclaimed a set of
> folio based slots. Also simplify the code for allocation, reduce
> indention.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 59 +++++++++++++++++++++++++--------------------------
>  1 file changed, 29 insertions(+), 30 deletions(-)

This actually can be split as two patches. Anyway,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b0a9071cfe1d..f8002f110104 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -604,23 +604,28 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
>  				  unsigned long start, unsigned long end)
>  {
>  	unsigned char *map = si->swap_map;
> -	unsigned long offset;
> +	unsigned long offset = start;
> +	int nr_reclaim;
>  
>  	spin_unlock(&ci->lock);
>  	spin_unlock(&si->lock);
>  
> -	for (offset = start; offset < end; offset++) {
> +	do {
>  		switch (READ_ONCE(map[offset])) {
>  		case 0:
> -			continue;
> +			offset++;
> +			break;
>  		case SWAP_HAS_CACHE:
> -			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0)
> -				continue;
> -			goto out;
> +			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
> +			if (nr_reclaim > 0)
> +				offset += nr_reclaim;
> +			else
> +				goto out;
> +			break;
>  		default:
>  			goto out;
>  		}
> -	}
> +	} while (offset < end);
>  out:
>  	spin_lock(&si->lock);
>  	spin_lock(&ci->lock);
> @@ -838,35 +843,30 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  							 &found, order, usage);
>  			frags++;
>  			if (found)
> -				break;
> +				goto done;
>  		}
>  
> -		if (!found) {
> +		/*
> +		 * Nonfull clusters are moved to frag tail if we reached
> +		 * here, count them too, don't over scan the frag list.
> +		 */
> +		while (frags < si->frag_cluster_nr[order]) {
> +			ci = list_first_entry(&si->frag_clusters[order],
> +					      struct swap_cluster_info, list);
>  			/*
> -			 * Nonfull clusters are moved to frag tail if we reached
> -			 * here, count them too, don't over scan the frag list.
> +			 * Rotate the frag list to iterate, they were all failing
> +			 * high order allocation or moved here due to per-CPU usage,
> +			 * this help keeping usable cluster ahead.
>  			 */
> -			while (frags < si->frag_cluster_nr[order]) {
> -				ci = list_first_entry(&si->frag_clusters[order],
> -						      struct swap_cluster_info, list);
> -				/*
> -				 * Rotate the frag list to iterate, they were all failing
> -				 * high order allocation or moved here due to per-CPU usage,
> -				 * this help keeping usable cluster ahead.
> -				 */
> -				list_move_tail(&ci->list, &si->frag_clusters[order]);
> -				offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> -								 &found, order, usage);
> -				frags++;
> -				if (found)
> -					break;
> -			}
> +			list_move_tail(&ci->list, &si->frag_clusters[order]);
> +			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> +							 &found, order, usage);
> +			frags++;
> +			if (found)
> +				goto done;
>  		}
>  	}
>  
> -	if (found)
> -		goto done;
> -
>  	if (!list_empty(&si->discard_clusters)) {
>  		/*
>  		 * we don't have free cluster but have some clusters in
> @@ -904,7 +904,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>  				goto done;
>  		}
>  	}
> -
>  done:
>  	cluster->next[order] = offset;
>  	return found;
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller
  2024-12-30 17:46 ` [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
@ 2025-01-09  4:05   ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-09  4:05 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The name of the function is confusing, and the code is much easier to
> follow after folding, also rename the confusing naming "p" to more
> meaningful "si".
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 39 +++++++++++++++------------------------
>  1 file changed, 15 insertions(+), 24 deletions(-)

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f8002f110104..574059158627 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1375,22 +1375,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
>  	return NULL;
>  }
>  
> -static struct swap_info_struct *swap_info_get_cont(swp_entry_t entry,
> -					struct swap_info_struct *q)
> -{
> -	struct swap_info_struct *p;
> -
> -	p = _swap_info_get(entry);
> -
> -	if (p != q) {
> -		if (q != NULL)
> -			spin_unlock(&q->lock);
> -		if (p != NULL)
> -			spin_lock(&p->lock);
> -	}
> -	return p;
> -}
> -
>  static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
>  					      unsigned long offset,
>  					      unsigned char usage)
> @@ -1687,14 +1671,14 @@ static int swp_entry_cmp(const void *ent1, const void *ent2)
>  
>  void swapcache_free_entries(swp_entry_t *entries, int n)
>  {
> -	struct swap_info_struct *p, *prev;
> +	struct swap_info_struct *si, *prev;
>  	int i;
>  
>  	if (n <= 0)
>  		return;
>  
>  	prev = NULL;
> -	p = NULL;
> +	si = NULL;
>  
>  	/*
>  	 * Sort swap entries by swap device, so each lock is only taken once.
> @@ -1704,13 +1688,20 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
>  	if (nr_swapfiles > 1)
>  		sort(entries, n, sizeof(entries[0]), swp_entry_cmp, NULL);
>  	for (i = 0; i < n; ++i) {
> -		p = swap_info_get_cont(entries[i], prev);
> -		if (p)
> -			swap_entry_range_free(p, entries[i], 1);
> -		prev = p;
> +		si = _swap_info_get(entries[i]);
> +
> +		if (si != prev) {
> +			if (prev != NULL)
> +				spin_unlock(&prev->lock);
> +			if (si != NULL)
> +				spin_lock(&si->lock);
> +		}
> +		if (si)
> +			swap_entry_range_free(si, entries[i], 1);
> +		prev = si;
>  	}
> -	if (p)
> -		spin_unlock(&p->lock);
> +	if (si)
> +		spin_unlock(&si->lock);
>  }
>  
>  int __swap_count(swp_entry_t entry)
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 03/13] mm, swap: remove old allocation path for HDD
  2024-12-30 17:46 ` [PATCH v3 03/13] mm, swap: remove old allocation path for HDD Kairui Song
@ 2025-01-09  4:06   ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-09  4:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> We are currently using different swap allocation algorithm for HDD and
> non-HDD. This leads to the existence of a different set of locks, and
> the code path is heavily bloated, causing difficulties for further
> optimization and maintenance.
> 
> This commit removes all HDD swap allocation and related dead code,
> and uses the cluster allocation algorithm instead.
> 
> The performance may drop temporarily, but this should be negligible:
> The main advantage of the legacy HDD allocation algorithm is that it
> tends to use continuous slots, but swap device gets fragmented quickly
> anyway, and the attempt to use continuous slots will fail easily.
> 
> This commit also enables mTHP swap on HDD, which is expected to be
> beneficial, and following commits will adapt and optimize the cluster
> allocator for HDD.
> 
> Suggested-by: Chris Li <chrisl@kernel.org>
> Suggested-by: "Huang, Ying" <ying.huang@linux.alibaba.com>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |   3 -
>  mm/swapfile.c        | 235 ++-----------------------------------------
>  2 files changed, 9 insertions(+), 229 deletions(-)

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 187715eec3cb..0c681aa5cb98 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -310,9 +310,6 @@ struct swap_info_struct {
>  	unsigned int highest_bit;	/* index of last free in swap_map */
>  	unsigned int pages;		/* total of usable pages of swap */
>  	unsigned int inuse_pages;	/* number of those currently in use */
> -	unsigned int cluster_next;	/* likely index for next allocation */
> -	unsigned int cluster_nr;	/* countdown to next cluster search */
> -	unsigned int __percpu *cluster_next_cpu; /*percpu index for next allocation */
>  	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
>  	struct rb_root swap_extent_root;/* root of the swap extent rbtree */
>  	struct block_device *bdev;	/* swap device or bdev of swap file */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 574059158627..fca58d43b836 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1001,49 +1001,6 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  	WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries);
>  }
>  
> -static void set_cluster_next(struct swap_info_struct *si, unsigned long next)
> -{
> -	unsigned long prev;
> -
> -	if (!(si->flags & SWP_SOLIDSTATE)) {
> -		si->cluster_next = next;
> -		return;
> -	}
> -
> -	prev = this_cpu_read(*si->cluster_next_cpu);
> -	/*
> -	 * Cross the swap address space size aligned trunk, choose
> -	 * another trunk randomly to avoid lock contention on swap
> -	 * address space if possible.
> -	 */
> -	if ((prev >> SWAP_ADDRESS_SPACE_SHIFT) !=
> -	    (next >> SWAP_ADDRESS_SPACE_SHIFT)) {
> -		/* No free swap slots available */
> -		if (si->highest_bit <= si->lowest_bit)
> -			return;
> -		next = get_random_u32_inclusive(si->lowest_bit, si->highest_bit);
> -		next = ALIGN_DOWN(next, SWAP_ADDRESS_SPACE_PAGES);
> -		next = max_t(unsigned int, next, si->lowest_bit);
> -	}
> -	this_cpu_write(*si->cluster_next_cpu, next);
> -}
> -
> -static bool swap_offset_available_and_locked(struct swap_info_struct *si,
> -					     unsigned long offset)
> -{
> -	if (data_race(!si->swap_map[offset])) {
> -		spin_lock(&si->lock);
> -		return true;
> -	}
> -
> -	if (vm_swap_full() && READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
> -		spin_lock(&si->lock);
> -		return true;
> -	}
> -
> -	return false;
> -}
> -
>  static int cluster_alloc_swap(struct swap_info_struct *si,
>  			     unsigned char usage, int nr,
>  			     swp_entry_t slots[], int order)
> @@ -1071,13 +1028,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  			       unsigned char usage, int nr,
>  			       swp_entry_t slots[], int order)
>  {
> -	unsigned long offset;
> -	unsigned long scan_base;
> -	unsigned long last_in_cluster = 0;
> -	int latency_ration = LATENCY_LIMIT;
>  	unsigned int nr_pages = 1 << order;
> -	int n_ret = 0;
> -	bool scanned_many = false;
>  
>  	/*
>  	 * We try to cluster swap pages by allocating them sequentially
> @@ -1089,7 +1040,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  	 * But we do now try to find an empty cluster.  -Andrea
>  	 * And we let swap pages go all over an SSD partition.  Hugh
>  	 */
> -
>  	if (order > 0) {
>  		/*
>  		 * Should not even be attempting large allocations when huge
> @@ -1109,158 +1059,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  			return 0;
>  	}
>  
> -	if (si->cluster_info)
> -		return cluster_alloc_swap(si, usage, nr, slots, order);
> -
> -	si->flags += SWP_SCANNING;
> -
> -	/* For HDD, sequential access is more important. */
> -	scan_base = si->cluster_next;
> -	offset = scan_base;
> -
> -	if (unlikely(!si->cluster_nr--)) {
> -		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
> -			si->cluster_nr = SWAPFILE_CLUSTER - 1;
> -			goto checks;
> -		}
> -
> -		spin_unlock(&si->lock);
> -
> -		/*
> -		 * If seek is expensive, start searching for new cluster from
> -		 * start of partition, to minimize the span of allocated swap.
> -		 */
> -		scan_base = offset = si->lowest_bit;
> -		last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
> -
> -		/* Locate the first empty (unaligned) cluster */
> -		for (; last_in_cluster <= READ_ONCE(si->highest_bit); offset++) {
> -			if (si->swap_map[offset])
> -				last_in_cluster = offset + SWAPFILE_CLUSTER;
> -			else if (offset == last_in_cluster) {
> -				spin_lock(&si->lock);
> -				offset -= SWAPFILE_CLUSTER - 1;
> -				si->cluster_next = offset;
> -				si->cluster_nr = SWAPFILE_CLUSTER - 1;
> -				goto checks;
> -			}
> -			if (unlikely(--latency_ration < 0)) {
> -				cond_resched();
> -				latency_ration = LATENCY_LIMIT;
> -			}
> -		}
> -
> -		offset = scan_base;
> -		spin_lock(&si->lock);
> -		si->cluster_nr = SWAPFILE_CLUSTER - 1;
> -	}
> -
> -checks:
> -	if (!(si->flags & SWP_WRITEOK))
> -		goto no_page;
> -	if (!si->highest_bit)
> -		goto no_page;
> -	if (offset > si->highest_bit)
> -		scan_base = offset = si->lowest_bit;
> -
> -	/* reuse swap entry of cache-only swap if not busy. */
> -	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> -		int swap_was_freed;
> -		spin_unlock(&si->lock);
> -		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
> -		spin_lock(&si->lock);
> -		/* entry was freed successfully, try to use this again */
> -		if (swap_was_freed > 0)
> -			goto checks;
> -		goto scan; /* check next one */
> -	}
> -
> -	if (si->swap_map[offset]) {
> -		if (!n_ret)
> -			goto scan;
> -		else
> -			goto done;
> -	}
> -	memset(si->swap_map + offset, usage, nr_pages);
> -
> -	swap_range_alloc(si, offset, nr_pages);
> -	slots[n_ret++] = swp_entry(si->type, offset);
> -
> -	/* got enough slots or reach max slots? */
> -	if ((n_ret == nr) || (offset >= si->highest_bit))
> -		goto done;
> -
> -	/* search for next available slot */
> -
> -	/* time to take a break? */
> -	if (unlikely(--latency_ration < 0)) {
> -		if (n_ret)
> -			goto done;
> -		spin_unlock(&si->lock);
> -		cond_resched();
> -		spin_lock(&si->lock);
> -		latency_ration = LATENCY_LIMIT;
> -	}
> -
> -	if (si->cluster_nr && !si->swap_map[++offset]) {
> -		/* non-ssd case, still more slots in cluster? */
> -		--si->cluster_nr;
> -		goto checks;
> -	}
> -
> -	/*
> -	 * Even if there's no free clusters available (fragmented),
> -	 * try to scan a little more quickly with lock held unless we
> -	 * have scanned too many slots already.
> -	 */
> -	if (!scanned_many) {
> -		unsigned long scan_limit;
> -
> -		if (offset < scan_base)
> -			scan_limit = scan_base;
> -		else
> -			scan_limit = si->highest_bit;
> -		for (; offset <= scan_limit && --latency_ration > 0;
> -		     offset++) {
> -			if (!si->swap_map[offset])
> -				goto checks;
> -		}
> -	}
> -
> -done:
> -	if (order == 0)
> -		set_cluster_next(si, offset + 1);
> -	si->flags -= SWP_SCANNING;
> -	return n_ret;
> -
> -scan:
> -	VM_WARN_ON(order > 0);
> -	spin_unlock(&si->lock);
> -	while (++offset <= READ_ONCE(si->highest_bit)) {
> -		if (unlikely(--latency_ration < 0)) {
> -			cond_resched();
> -			latency_ration = LATENCY_LIMIT;
> -			scanned_many = true;
> -		}
> -		if (swap_offset_available_and_locked(si, offset))
> -			goto checks;
> -	}
> -	offset = si->lowest_bit;
> -	while (offset < scan_base) {
> -		if (unlikely(--latency_ration < 0)) {
> -			cond_resched();
> -			latency_ration = LATENCY_LIMIT;
> -			scanned_many = true;
> -		}
> -		if (swap_offset_available_and_locked(si, offset))
> -			goto checks;
> -		offset++;
> -	}
> -	spin_lock(&si->lock);
> -
> -no_page:
> -	si->flags -= SWP_SCANNING;
> -	return n_ret;
> +	return cluster_alloc_swap(si, usage, nr, slots, order);
>  }
>  
>  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> @@ -2871,8 +2670,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	mutex_unlock(&swapon_mutex);
>  	free_percpu(p->percpu_cluster);
>  	p->percpu_cluster = NULL;
> -	free_percpu(p->cluster_next_cpu);
> -	p->cluster_next_cpu = NULL;
>  	vfree(swap_map);
>  	kvfree(zeromap);
>  	kvfree(cluster_info);
> @@ -3184,8 +2981,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>  	}
>  
>  	si->lowest_bit  = 1;
> -	si->cluster_next = 1;
> -	si->cluster_nr = 0;
>  
>  	maxpages = swapfile_maximum_size;
>  	last_page = swap_header->info.last_page;
> @@ -3271,7 +3066,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  						unsigned long maxpages)
>  {
>  	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
> -	unsigned long col = si->cluster_next / SWAPFILE_CLUSTER % SWAP_CLUSTER_COLS;
>  	struct swap_cluster_info *cluster_info;
>  	unsigned long i, j, k, idx;
>  	int cpu, err = -ENOMEM;
> @@ -3283,15 +3077,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  	for (i = 0; i < nr_clusters; i++)
>  		spin_lock_init(&cluster_info[i].lock);
>  
> -	si->cluster_next_cpu = alloc_percpu(unsigned int);
> -	if (!si->cluster_next_cpu)
> -		goto err_free;
> -
> -	/* Random start position to help with wear leveling */
> -	for_each_possible_cpu(cpu)
> -		per_cpu(*si->cluster_next_cpu, cpu) =
> -		get_random_u32_inclusive(1, si->highest_bit);
> -
>  	si->percpu_cluster = alloc_percpu(struct percpu_cluster);
>  	if (!si->percpu_cluster)
>  		goto err_free;
> @@ -3333,7 +3118,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>  	 * sharing same address space.
>  	 */
>  	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> -		j = (k + col) % SWAP_CLUSTER_COLS;
> +		j = k % SWAP_CLUSTER_COLS;
>  		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
>  			struct swap_cluster_info *ci;
>  			idx = i * SWAP_CLUSTER_COLS + j;
> @@ -3483,18 +3268,18 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  
>  	if (si->bdev && bdev_nonrot(si->bdev)) {
>  		si->flags |= SWP_SOLIDSTATE;
> -
> -		cluster_info = setup_clusters(si, swap_header, maxpages);
> -		if (IS_ERR(cluster_info)) {
> -			error = PTR_ERR(cluster_info);
> -			cluster_info = NULL;
> -			goto bad_swap_unlock_inode;
> -		}
>  	} else {
>  		atomic_inc(&nr_rotate_swap);
>  		inced_nr_rotate_swap = true;
>  	}
>  
> +	cluster_info = setup_clusters(si, swap_header, maxpages);
> +	if (IS_ERR(cluster_info)) {
> +		error = PTR_ERR(cluster_info);
> +		cluster_info = NULL;
> +		goto bad_swap_unlock_inode;
> +	}
> +
>  	if ((swap_flags & SWAP_FLAG_DISCARD) &&
>  	    si->bdev && bdev_max_discard_sectors(si->bdev)) {
>  		/*
> @@ -3575,8 +3360,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
>  bad_swap:
>  	free_percpu(si->percpu_cluster);
>  	si->percpu_cluster = NULL;
> -	free_percpu(si->cluster_next_cpu);
> -	si->cluster_next_cpu = NULL;
>  	inode = NULL;
>  	destroy_swap_extents(si);
>  	swap_cgroup_swapoff(si->type);
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 04/13] mm, swap: use cluster lock for HDD
  2024-12-30 17:46 ` [PATCH v3 04/13] mm, swap: use cluster lock " Kairui Song
@ 2025-01-09  4:07   ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-09  4:07 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Cluster lock (ci->lock) was introduce to reduce contention for certain
                              ~~~~~~~~~ typo, introduced.

> operations. Using cluster lock for HDD is not helpful as HDD have a poor
> performance, so locking isn't the bottleneck. But having different set
> of locks for HDD / non-HDD prevents further rework of device lock
> (si->lock).
> 
> This commit just changed all lock_cluster_or_swap_info to lock_cluster,
> which is a safe and straight conversion since cluster info is always
> allocated now, also removed all cluster_info related checks.
> 
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 107 ++++++++++++++++----------------------------------
>  1 file changed, 34 insertions(+), 73 deletions(-)

Other than the nit in patch log,  LGTM,

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index fca58d43b836..d0e5b9fa0c48 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -58,10 +58,9 @@ static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry
>  static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
>  			     unsigned int nr_entries);
>  static bool folio_swapcache_freeable(struct folio *folio);
> -static struct swap_cluster_info *lock_cluster_or_swap_info(
> -		struct swap_info_struct *si, unsigned long offset);
> -static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> -					struct swap_cluster_info *ci);
> +static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> +					      unsigned long offset);
> +static void unlock_cluster(struct swap_cluster_info *ci);
>  
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -222,9 +221,9 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	 * swap_map is HAS_CACHE only, which means the slots have no page table
>  	 * reference or pending writeback, and can't be allocated to others.
>  	 */
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  	need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  	if (!need_reclaim)
>  		goto out_unlock;
>  
> @@ -404,45 +403,15 @@ static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si
>  {
>  	struct swap_cluster_info *ci;
>  
> -	ci = si->cluster_info;
> -	if (ci) {
> -		ci += offset / SWAPFILE_CLUSTER;
> -		spin_lock(&ci->lock);
> -	}
> -	return ci;
> -}
> -
> -static inline void unlock_cluster(struct swap_cluster_info *ci)
> -{
> -	if (ci)
> -		spin_unlock(&ci->lock);
> -}
> -
> -/*
> - * Determine the locking method in use for this device.  Return
> - * swap_cluster_info if SSD-style cluster-based locking is in place.
> - */
> -static inline struct swap_cluster_info *lock_cluster_or_swap_info(
> -		struct swap_info_struct *si, unsigned long offset)
> -{
> -	struct swap_cluster_info *ci;
> -
> -	/* Try to use fine-grained SSD-style locking if available: */
> -	ci = lock_cluster(si, offset);
> -	/* Otherwise, fall back to traditional, coarse locking: */
> -	if (!ci)
> -		spin_lock(&si->lock);
> +	ci = &si->cluster_info[offset / SWAPFILE_CLUSTER];
> +	spin_lock(&ci->lock);
>  
>  	return ci;
>  }
>  
> -static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> -					       struct swap_cluster_info *ci)
> +static inline void unlock_cluster(struct swap_cluster_info *ci)
>  {
> -	if (ci)
> -		unlock_cluster(ci);
> -	else
> -		spin_unlock(&si->lock);
> +	spin_unlock(&ci->lock);
>  }
>  
>  /* Add a cluster to discard list and schedule it to do discard */
> @@ -558,9 +527,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
>  	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
>  	struct swap_cluster_info *ci;
>  
> -	if (!cluster_info)
> -		return;
> -
>  	ci = cluster_info + idx;
>  	ci->count++;
>  
> @@ -576,9 +542,6 @@ static void inc_cluster_info_page(struct swap_info_struct *si,
>  static void dec_cluster_info_page(struct swap_info_struct *si,
>  				  struct swap_cluster_info *ci, int nr_pages)
>  {
> -	if (!si->cluster_info)
> -		return;
> -
>  	VM_BUG_ON(ci->count < nr_pages);
>  	VM_BUG_ON(cluster_is_free(ci));
>  	lockdep_assert_held(&si->lock);
> @@ -1007,8 +970,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>  {
>  	int n_ret = 0;
>  
> -	VM_BUG_ON(!si->cluster_info);
> -
>  	si->flags += SWP_SCANNING;
>  
>  	while (n_ret < nr) {
> @@ -1052,10 +1013,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>  		}
>  
>  		/*
> -		 * Swapfile is not block device or not using clusters so unable
> +		 * Swapfile is not block device so unable
>  		 * to allocate large entries.
>  		 */
> -		if (!(si->flags & SWP_BLKDEV) || !si->cluster_info)
> +		if (!(si->flags & SWP_BLKDEV))
>  			return 0;
>  	}
>  
> @@ -1295,9 +1256,9 @@ static unsigned char __swap_entry_free(struct swap_info_struct *si,
>  	unsigned long offset = swp_offset(entry);
>  	unsigned char usage;
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  	usage = __swap_entry_free_locked(si, offset, 1);
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  	if (!usage)
>  		free_swap_slot(entry);
>  
> @@ -1320,14 +1281,14 @@ static bool __swap_entries_free(struct swap_info_struct *si,
>  	if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)
>  		goto fallback;
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  	if (!swap_is_last_map(si, offset, nr, &has_cache)) {
> -		unlock_cluster_or_swap_info(si, ci);
> +		unlock_cluster(ci);
>  		goto fallback;
>  	}
>  	for (i = 0; i < nr; i++)
>  		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  
>  	if (!has_cache) {
>  		for (i = 0; i < nr; i++)
> @@ -1383,7 +1344,7 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
>  	DECLARE_BITMAP(to_free, BITS_PER_LONG) = { 0 };
>  	int i, nr;
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  	while (nr_pages) {
>  		nr = min(BITS_PER_LONG, nr_pages);
>  		for (i = 0; i < nr; i++) {
> @@ -1391,18 +1352,18 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
>  				bitmap_set(to_free, i, 1);
>  		}
>  		if (!bitmap_empty(to_free, BITS_PER_LONG)) {
> -			unlock_cluster_or_swap_info(si, ci);
> +			unlock_cluster(ci);
>  			for_each_set_bit(i, to_free, BITS_PER_LONG)
>  				free_swap_slot(swp_entry(si->type, offset + i));
>  			if (nr == nr_pages)
>  				return;
>  			bitmap_clear(to_free, 0, BITS_PER_LONG);
> -			ci = lock_cluster_or_swap_info(si, offset);
> +			ci = lock_cluster(si, offset);
>  		}
>  		offset += nr;
>  		nr_pages -= nr;
>  	}
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  }
>  
>  /*
> @@ -1441,9 +1402,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  	if (!si)
>  		return;
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  	if (size > 1 && swap_is_has_cache(si, offset, size)) {
> -		unlock_cluster_or_swap_info(si, ci);
> +		unlock_cluster(ci);
>  		spin_lock(&si->lock);
>  		swap_entry_range_free(si, entry, size);
>  		spin_unlock(&si->lock);
> @@ -1451,14 +1412,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  	}
>  	for (int i = 0; i < size; i++, entry.val++) {
>  		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> -			unlock_cluster_or_swap_info(si, ci);
> +			unlock_cluster(ci);
>  			free_swap_slot(entry);
>  			if (i == size - 1)
>  				return;
> -			lock_cluster_or_swap_info(si, offset);
> +			lock_cluster(si, offset);
>  		}
>  	}
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  }
>  
>  static int swp_entry_cmp(const void *ent1, const void *ent2)
> @@ -1522,9 +1483,9 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
>  	struct swap_cluster_info *ci;
>  	int count;
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  	count = swap_count(si->swap_map[offset]);
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  	return count;
>  }
>  
> @@ -1547,7 +1508,7 @@ int swp_swapcount(swp_entry_t entry)
>  
>  	offset = swp_offset(entry);
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  
>  	count = swap_count(si->swap_map[offset]);
>  	if (!(count & COUNT_CONTINUED))
> @@ -1570,7 +1531,7 @@ int swp_swapcount(swp_entry_t entry)
>  		n *= (SWAP_CONT_MAX + 1);
>  	} while (tmp_count & COUNT_CONTINUED);
>  out:
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  	return count;
>  }
>  
> @@ -1585,8 +1546,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
>  	int i;
>  	bool ret = false;
>  
> -	ci = lock_cluster_or_swap_info(si, offset);
> -	if (!ci || nr_pages == 1) {
> +	ci = lock_cluster(si, offset);
> +	if (nr_pages == 1) {
>  		if (swap_count(map[roffset]))
>  			ret = true;
>  		goto unlock_out;
> @@ -1598,7 +1559,7 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
>  		}
>  	}
>  unlock_out:
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  	return ret;
>  }
>  
> @@ -3428,7 +3389,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>  	offset = swp_offset(entry);
>  	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
>  	VM_WARN_ON(usage == 1 && nr > 1);
> -	ci = lock_cluster_or_swap_info(si, offset);
> +	ci = lock_cluster(si, offset);
>  
>  	err = 0;
>  	for (i = 0; i < nr; i++) {
> @@ -3483,7 +3444,7 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
>  	}
>  
>  unlock_out:
> -	unlock_cluster_or_swap_info(si, ci);
> +	unlock_cluster(ci);
>  	return err;
>  }
>  
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 05/13] mm, swap: clean up device availability check
  2024-12-30 17:46 ` [PATCH v3 05/13] mm, swap: clean up device availability check Kairui Song
@ 2025-01-09  4:08   ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-01-09  4:08 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 12/31/24 at 01:46am, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> Remove highest_bit and lowest_bit. After the HDD allocation path
> has been removed, the only purpose of these two fields is to determine
> whether the device is full or not, which can instead be determined
> by checking the inuse_pages.
> 
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  fs/btrfs/inode.c     |  1 -
>  fs/f2fs/data.c       |  1 -
>  fs/iomap/swapfile.c  |  1 -
>  include/linux/swap.h |  2 --
>  mm/page_io.c         |  1 -
>  mm/swapfile.c        | 38 ++++++++------------------------------
>  6 files changed, 8 insertions(+), 36 deletions(-)

Reviewed-by: Baoquan He <bhe@redhat.com>

> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 488edca8333a..a1ba78afab2c 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -10044,7 +10044,6 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file,
>  	*span = bsi.highest_ppage - bsi.lowest_ppage + 1;
>  	sis->max = bsi.nr_pages;
>  	sis->pages = bsi.nr_pages - 1;
> -	sis->highest_bit = bsi.nr_pages - 1;
>  	return bsi.nr_extents;
>  }
>  #else
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index a2478c2afb3a..a9eddd782dbc 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -4043,7 +4043,6 @@ static int check_swap_activate(struct swap_info_struct *sis,
>  		cur_lblock = 1;	/* force Empty message */
>  	sis->max = cur_lblock;
>  	sis->pages = cur_lblock - 1;
> -	sis->highest_bit = cur_lblock - 1;
>  out:
>  	if (not_aligned)
>  		f2fs_warn(sbi, "Swapfile (%u) is not align to section: 1) creat(), 2) ioctl(F2FS_IOC_SET_PIN_FILE), 3) fallocate(%lu * N)",
> diff --git a/fs/iomap/swapfile.c b/fs/iomap/swapfile.c
> index 5fc0ac36dee3..b90d0eda9e51 100644
> --- a/fs/iomap/swapfile.c
> +++ b/fs/iomap/swapfile.c
> @@ -189,7 +189,6 @@ int iomap_swapfile_activate(struct swap_info_struct *sis,
>  	*pagespan = 1 + isi.highest_ppage - isi.lowest_ppage;
>  	sis->max = isi.nr_pages;
>  	sis->pages = isi.nr_pages - 1;
> -	sis->highest_bit = isi.nr_pages - 1;
>  	return isi.nr_extents;
>  }
>  EXPORT_SYMBOL_GPL(iomap_swapfile_activate);
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 0c681aa5cb98..0c222017b5c6 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -306,8 +306,6 @@ struct swap_info_struct {
>  	struct list_head frag_clusters[SWAP_NR_ORDERS];
>  					/* list of cluster that are fragmented or contented */
>  	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> -	unsigned int lowest_bit;	/* index of first free in swap_map */
> -	unsigned int highest_bit;	/* index of last free in swap_map */
>  	unsigned int pages;		/* total of usable pages of swap */
>  	unsigned int inuse_pages;	/* number of those currently in use */
>  	struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 4b4ea8e49cf6..9b983de351f9 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -163,7 +163,6 @@ int generic_swapfile_activate(struct swap_info_struct *sis,
>  		page_no = 1;	/* force Empty message */
>  	sis->max = page_no;
>  	sis->pages = page_no - 1;
> -	sis->highest_bit = page_no - 1;
>  out:
>  	return ret;
>  bad_bmap:
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index d0e5b9fa0c48..7963a0c646a4 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -55,7 +55,7 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
>  static void free_swap_count_continuations(struct swap_info_struct *);
>  static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
>  				  unsigned int nr_pages);
> -static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> +static void swap_range_alloc(struct swap_info_struct *si,
>  			     unsigned int nr_entries);
>  static bool folio_swapcache_freeable(struct folio *folio);
>  static struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> @@ -650,7 +650,7 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
>  	}
>  
>  	memset(si->swap_map + start, usage, nr_pages);
> -	swap_range_alloc(si, start, nr_pages);
> +	swap_range_alloc(si, nr_pages);
>  	ci->count += nr_pages;
>  
>  	if (ci->count == SWAPFILE_CLUSTER) {
> @@ -888,19 +888,11 @@ static void del_from_avail_list(struct swap_info_struct *si)
>  	spin_unlock(&swap_avail_lock);
>  }
>  
> -static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> +static void swap_range_alloc(struct swap_info_struct *si,
>  			     unsigned int nr_entries)
>  {
> -	unsigned int end = offset + nr_entries - 1;
> -
> -	if (offset == si->lowest_bit)
> -		si->lowest_bit += nr_entries;
> -	if (end == si->highest_bit)
> -		WRITE_ONCE(si->highest_bit, si->highest_bit - nr_entries);
>  	WRITE_ONCE(si->inuse_pages, si->inuse_pages + nr_entries);
>  	if (si->inuse_pages == si->pages) {
> -		si->lowest_bit = si->max;
> -		si->highest_bit = 0;
>  		del_from_avail_list(si);
>  
>  		if (si->cluster_info && vm_swap_full())
> @@ -933,15 +925,8 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
>  	for (i = 0; i < nr_entries; i++)
>  		clear_bit(offset + i, si->zeromap);
>  
> -	if (offset < si->lowest_bit)
> -		si->lowest_bit = offset;
> -	if (end > si->highest_bit) {
> -		bool was_full = !si->highest_bit;
> -
> -		WRITE_ONCE(si->highest_bit, end);
> -		if (was_full && (si->flags & SWP_WRITEOK))
> -			add_to_avail_list(si);
> -	}
> +	if (si->inuse_pages == si->pages)
> +		add_to_avail_list(si);
>  	if (si->flags & SWP_BLKDEV)
>  		swap_slot_free_notify =
>  			si->bdev->bd_disk->fops->swap_slot_free_notify;
> @@ -1051,15 +1036,12 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
>  		spin_unlock(&swap_avail_lock);
>  		spin_lock(&si->lock);
> -		if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
> +		if ((si->inuse_pages == si->pages) || !(si->flags & SWP_WRITEOK)) {
>  			spin_lock(&swap_avail_lock);
>  			if (plist_node_empty(&si->avail_lists[node])) {
>  				spin_unlock(&si->lock);
>  				goto nextsi;
>  			}
> -			WARN(!si->highest_bit,
> -			     "swap_info %d in list but !highest_bit\n",
> -			     si->type);
>  			WARN(!(si->flags & SWP_WRITEOK),
>  			     "swap_info %d in list but !SWP_WRITEOK\n",
>  			     si->type);
> @@ -2441,8 +2423,8 @@ static void _enable_swap_info(struct swap_info_struct *si)
>  	 */
>  	plist_add(&si->list, &swap_active_head);
>  
> -	/* add to available list iff swap device is not full */
> -	if (si->highest_bit)
> +	/* add to available list if swap device is not full */
> +	if (si->inuse_pages < si->pages)
>  		add_to_avail_list(si);
>  }
>  
> @@ -2606,7 +2588,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
>  	drain_mmlist();
>  
>  	/* wait for anyone still in scan_swap_map_slots */
> -	p->highest_bit = 0;		/* cuts scans short */
>  	while (p->flags >= SWP_SCANNING) {
>  		spin_unlock(&p->lock);
>  		spin_unlock(&swap_lock);
> @@ -2941,8 +2922,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>  		return 0;
>  	}
>  
> -	si->lowest_bit  = 1;
> -
>  	maxpages = swapfile_maximum_size;
>  	last_page = swap_header->info.last_page;
>  	if (!last_page) {
> @@ -2959,7 +2938,6 @@ static unsigned long read_swap_header(struct swap_info_struct *si,
>  		if ((unsigned int)maxpages == 0)
>  			maxpages = UINT_MAX;
>  	}
> -	si->highest_bit = maxpages - 1;
>  
>  	if (!maxpages)
>  		return 0;
> -- 
> 2.47.1
> 
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2025-01-09  2:15     ` Kairui Song
@ 2025-01-10 11:23       ` Baoquan He
  2025-01-13  6:33         ` Kairui Song
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-01-10 11:23 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 01/09/25 at 10:15am, Kairui Song wrote:
> On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> >
> 
> Thanks for the very detailed review!
> 
> > On 12/31/24 at 01:46am, Kairui Song wrote:
> > ......snip.....
> > > ---
> > >  include/linux/swap.h |   3 +-
> > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > >   * throughput.
> > >   */
> > >  struct percpu_cluster {
> > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > >  };
> > >
> > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > >                                       /* list of cluster that contains at least one free slot */
> > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > >                                       /* list of cluster that are fragmented or contented */
> > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > >       unsigned int pages;             /* total of usable pages of swap */
> > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 7795a3d27273..dadd4fead689 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
...snip...
> > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > >
> > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > >  {
> > > -     lockdep_assert_held(&si->lock);
> > >       lockdep_assert_held(&ci->lock);
> > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > >       ci->order = 0;
> > >  }
> > >
> > > +/*
> > > + * Isolate and lock the first cluster that is not contented on a list,
> > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > + * with list status, so cluster updaters can always know the cluster
> > > + * list status without touching si lock.
> > > + *
> > > + * Note it's possible that all clusters on a list are contented so
> > > + * this returns NULL for an non-empty list.
> > > + */
> > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > +             struct swap_info_struct *si, struct list_head *list)
> > > +{
> > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > +
> > > +     spin_lock(&si->lock);
> > > +
> > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > +             goto out;
> > > +
> > > +     list_for_each_entry(ci, list, list) {
> > > +             if (!spin_trylock(&ci->lock))
> > > +                     continue;
> > > +
> > > +             /* We may only isolate and clear flags of following lists */
> > > +             VM_BUG_ON(!ci->flags);
> > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > +
> > > +             list_del(&ci->list);
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > +             ret = ci;
> > > +             break;
> > > +     }
> > > +out:
> > > +     spin_unlock(&si->lock);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > >  /*
> > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > - * will be added to free cluster list. caller should hold si->lock.
> > > -*/
> > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > + */
> > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > >  {
> > >       struct swap_cluster_info *ci;
> > > +     bool ret = false;
> > >       unsigned int idx;
> > >
> > > +     spin_lock(&si->lock);
> > >       while (!list_empty(&si->discard_clusters)) {
> > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > +             /*
> > > +              * Delete the cluster from list but don't clear its flags until
> > > +              * discard is done, so isolation and relocation will skip it.
> > > +              */
> > >               list_del(&ci->list);
> >
> > I don't understand above comment. ci has been taken off list. While
> > allocation need isolate from a usable list. Even though we clear
> > ci->flags now, how come isolation and relocation will touch it. I may
> > miss anything here.
> 
> There are many cases, one possible and common situation is that the
> percpu cluster (si->percpu_cluster of another CPU) is still pointing
> to it.
> 
> Also, this commit removed protection of si lock on allocation, and
> allocation path may also drop ci lock to call reclaim, which means one
> cluster could be used or freed by anyone before allocator reacquire
> the ci lock again. In that case, the allocator could see a discard
> cluster.
> 
> So we don't clear the discard flag, in case anyone misuse it.
> 
> I can add more inline comments on this, this is already some related
> comments above the function relocate_cluster, could add some more
> referencing that.

Thanks for your great explanation. I understand that si->percpu_cluster
could point to a discarded ci, and a ci could be got from non-full,
frag lists but later become discarded if that ci is freed on other cpu
during cluster_reclaim_range() invocation. I haven't got how isolation
could see a discarded ci in cluster_isolate_lock(). Could you help give
an exmaple on how that happen?

Surely, I understand keeping the discarded flag is very necessary so
that checking like cluster_is_usable() will return expected value.

And by the way, I haven't got when the ' if (!ci->count)' case could
happen in relocate_cluster() since we have filtered away discarded ci
with the 'if (cluster_is_discard(ci))' checking. I asked in another
thread, could you help explain it?

static void relocate_cluster(struct swap_info_struct *si,
                             struct swap_cluster_info *ci)
{               
        lockdep_assert_held(&ci->lock); 
                
        /* Discard cluster must remain off-list or on discard list */
        if (cluster_is_discard(ci))
                return;
                
        if (!ci->count) {
                free_cluster(si, ci);
...
}
> 
> >
> > > -             /* Must clear flag when taking a cluster off-list */
> > > -             ci->flags = CLUSTER_FLAG_NONE;
> > >               idx = cluster_index(si, ci);
> > >               spin_unlock(&si->lock);
> > > -
> > >               discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > >                               SWAPFILE_CLUSTER);
> > >
> > > -             spin_lock(&si->lock);
> > >               spin_lock(&ci->lock);
> > > -             __free_cluster(si, ci);
> > > +             /*
> > > +              * Discard is done, clear its flags as it's now off-list,
> > > +              * then return the cluster to allocation list.
> > > +              */
> > > +             ci->flags = CLUSTER_FLAG_NONE;
> > >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > >                               0, SWAPFILE_CLUSTER);
> > > +             __free_cluster(si, ci);
> > >               spin_unlock(&ci->lock);
> > > +             ret = true;
> > > +             spin_lock(&si->lock);
> > >       }
> > > +     spin_unlock(&si->lock);
> > > +     return ret;
> > >  }
> > >
> > >  static void swap_discard_work(struct work_struct *work)
> > ......snip....
> > > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> > >  static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > >                                             unsigned char usage)
> > >  {
> > > -     struct percpu_cluster *cluster;
> > >       struct swap_cluster_info *ci;
> > >       unsigned int offset, found = 0;
> > >
> > > -new_cluster:
> > > -     lockdep_assert_held(&si->lock);
> > > -     cluster = this_cpu_ptr(si->percpu_cluster);
> > > -     offset = cluster->next[order];
> > > +     /* Fast path using per CPU cluster */
> > > +     local_lock(&si->percpu_cluster->lock);
> > > +     offset = __this_cpu_read(si->percpu_cluster->next[order]);
> > >       if (offset) {
> > > -             offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > > +             ci = lock_cluster(si, offset);
> > > +             /* Cluster could have been used by another order */
> > > +             if (cluster_is_usable(ci, order)) {
> > > +                     if (cluster_is_free(ci))
> > > +                             offset = cluster_offset(si, ci);
> > > +                     offset = alloc_swap_scan_cluster(si, offset, &found,
> > > +                                                      order, usage);
> > > +             } else {
> > > +                     unlock_cluster(ci);
> > > +             }
> > >               if (found)
> > >                       goto done;
> > >       }
> > >
> > > -     if (!list_empty(&si->free_clusters)) {
> > > -             ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > > -             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > > -             /*
> > > -              * Either we didn't touch the cluster due to swapoff,
> > > -              * or the allocation must success.
> > > -              */
> > > -             VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > > -             goto done;
> > > +new_cluster:
> > > +     ci = cluster_isolate_lock(si, &si->free_clusters);
> > > +     if (ci) {
> > > +             offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > +                                              &found, order, usage);
> > > +             if (found)
> > > +                     goto done;
> > >       }
> > >
> > >       /* Try reclaim from full clusters if free clusters list is drained */
> > > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > >               swap_reclaim_full_clusters(si, false);
> > >
> > >       if (order < PMD_ORDER) {
> > > -             unsigned int frags = 0;
> > > +             unsigned int frags = 0, frags_existing;
> > >
> > > -             while (!list_empty(&si->nonfull_clusters[order])) {
> > > -                     ci = list_first_entry(&si->nonfull_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > -                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > > +             while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > > +                     /*
> > > +                      * With `fragmenting` set to true, it will surely take
> >                                  ~~~~~~~~~~~
> >                          wondering what 'fragmenting' means here.
> 
> This comment is a bit out of context indeed, it actually trying to say
> the alloc_swap_scan_cluster call above should move the cluster to
> tail. I'll update the comment.
> 
> 
> 
> >
> > > +                      * the cluster off nonfull list
> > > +                      */
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >
> > > -             /*
> > > -              * Nonfull clusters are moved to frag tail if we reached
> > > -              * here, count them too, don't over scan the frag list.
> > > -              */
> > > -             while (frags < si->frag_cluster_nr[order]) {
> > > -                     ci = list_first_entry(&si->frag_clusters[order],
> > > -                                           struct swap_cluster_info, list);
> > > +             frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > > +             while (frags < frags_existing &&
> > > +                    (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > > +                     atomic_long_dec(&si->frag_cluster_nr[order]);
> > >                       /*
> > > -                      * Rotate the frag list to iterate, they were all failing
> > > -                      * high order allocation or moved here due to per-CPU usage,
> > > -                      * this help keeping usable cluster ahead.
> > > +                      * Rotate the frag list to iterate, they were all
> > > +                      * failing high order allocation or moved here due to
> > > +                      * per-CPU usage, but they could contain newly released
> > > +                      * reclaimable (eg. lazy-freed swap cache) slots.
> > >                        */
> > > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> > >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > >                                                        &found, order, usage);
> > > -                     frags++;
> > >                       if (found)
> > >                               goto done;
> > > +                     frags++;
> > >               }
> > >       }
> > >
> > > -     if (!list_empty(&si->discard_clusters)) {
> > > -             /*
> > > -              * we don't have free cluster but have some clusters in
> > > -              * discarding, do discard now and reclaim them, then
> > > -              * reread cluster_next_cpu since we dropped si->lock
> > > -              */
> > > -             swap_do_scheduled_discard(si);
> > > +     /*
> > > +      * We don't have free cluster but have some clusters in
> > > +      * discarding, do discard now and reclaim them, then
> > > +      * reread cluster_next_cpu since we dropped si->lock
> > > +      */
> > > +     if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > >               goto new_cluster;
> > > -     }
> > >
> > >       if (order)
> > >               goto done;
> > .....
> >
> >
> 



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage
  2025-01-04  5:46   ` Baoquan He
@ 2025-01-13  5:34     ` Kairui Song
  2025-01-20  2:39       ` Baoquan He
  0 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2025-01-13  5:34 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Sat, Jan 4, 2025 at 1:46 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 12/31/24 at 01:46am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > The flag SWP_SCANNING was used as an indicator of whether a device
> > is being scanned for allocation, and prevents swapoff. Combined with
> > SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
> >
> > 1. Swapoff clears SWP_WRITEOK, allocation requests will see
> >    ~SWP_WRITEOK and abort as it's serialized by si->lock.
> > 2. Swapoff unuses all allocated entries.
> > 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
> >    allocations will stop, preventing UAF.
> > 4. Now swapoff can free everything safely.
> >
> > This will make the allocation path have a hard dependency on
> > si->lock. Allocation always have to acquire si->lock first for
> > setting SWP_SCANNING and checking SWP_WRITEOK.
> >
> > This commit removes this flag, and just uses the existing per-CPU
> > refcount instead to prevent UAF in step 3, which serves well for
> > such usage without dependency on si->lock, and scales very well too.
> > Just hold a reference during the whole scan and allocation process.
> > Swapoff will kill and wait for the counter.
> >
> > And for preventing any allocation from happening after step 1 so the
> > unuse in step 2 can ensure all slots are free, swapoff will acquire
> > the ci->lock of each cluster one by one to ensure all allocations
> > see ~SWP_WRITEOK and abort.
>
> Changing to use si->users is great, while wondering why we need acquire =
> each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take
> the si off swap_avail_heads list. No matter what, we just need wait for
> p->comm's completion and continue, why bothering to loop for the
> ci->lock acquiring?
>

Hi Baoquan,

Waiting for p->comm's completion must be done after unuse is called
(unuse will need to take the si->users refcound, so it can't be dead
yet), but unuse must be called after no one will allocate any new
entry. That is guaranteed by the loop ci->lock acquiring.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes
  2025-01-06  8:43   ` Baoquan He
@ 2025-01-13  5:49     ` Kairui Song
  0 siblings, 0 replies; 35+ messages in thread
From: Kairui Song @ 2025-01-13  5:49 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Mon, Jan 6, 2025 at 4:43 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 12/31/24 at 01:46am, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently, we are only using flags to indicate which list the cluster
> > is on. Using one bit for each list type might be a waste, as the list
> > type grows, we will consume too many bits. Additionally, the current
> > mixed usage of '&' and '==' is a bit confusing.
>
> I think this kind of converting can only happen when the type is
> exclusive on each cluster. Then we can set and use
> 'ci->flags == CLUSTER_FLAG_XXX' to check it.

Hi Baoquan,

Not sure what you mean. The flags are exclusive, and after this
commit, we are always using "ci->flags == CLUSTER_FLAG_XXX" to check
the flag.

>
> >
> > Make it clean by using an enum to define all possible cluster
> > statuses. Only an off-list cluster will have the NONE (0) flag.
> > And use a wrapper to annotate and sanitize all flag settings
> > and list movements.
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/swap.h | 17 +++++++---
> >  mm/swapfile.c        | 75 +++++++++++++++++++++++---------------------
> >  2 files changed, 52 insertions(+), 40 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 02120f1005d5..339d7f0192ff 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -257,10 +257,19 @@ struct swap_cluster_info {
> >       u8 order;
> >       struct list_head list;
> >  };
> > -#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > -#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> > -#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
> > -#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */
> > +
> > +/* All on-list cluster must have a non-zero flag. */
> > +enum swap_cluster_flags {
> > +     CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
> > +     CLUSTER_FLAG_FREE,
> > +     CLUSTER_FLAG_NONFULL,
> > +     CLUSTER_FLAG_FRAG,
> > +     /* Clusters with flags above are allocatable */
> > +     CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
> > +     CLUSTER_FLAG_FULL,
> > +     CLUSTER_FLAG_DISCARD,
> > +     CLUSTER_FLAG_MAX,
> > +};
> >
> >  /*
> >   * The first page in the swap file is the swap header, which is always marked
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 99fd0b0d84a2..7795a3d27273 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -403,7 +403,7 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> >
> >  static inline bool cluster_is_free(struct swap_cluster_info *info)
> >  {
> > -     return info->flags & CLUSTER_FLAG_FREE;
> > +     return info->flags == CLUSTER_FLAG_FREE;
> >  }
> >
> >  static inline unsigned int cluster_index(struct swap_info_struct *si,
> > @@ -434,6 +434,27 @@ static inline void unlock_cluster(struct swap_cluster_info *ci)
> >       spin_unlock(&ci->lock);
> >  }
> >
> > +static void cluster_move(struct swap_info_struct *si,
>                ~~~~~~~~~~~~
> Maybe rename it to move_cluster() which has the same naming style as
> lock_cluster()/unlock_cluster()? This is what we usually do namin if a
> function is action acts on objects.

Good idea.


>
> Other than this, this patch looks great to me.
>
> > +                      struct swap_cluster_info *ci, struct list_head *list,
> > +                      enum swap_cluster_flags new_flags)
> > +{
> > +     VM_WARN_ON(ci->flags == new_flags);
> > +     BUILD_BUG_ON(1 << sizeof(ci->flags) * BITS_PER_BYTE < CLUSTER_FLAG_MAX);
> > +
> > +     if (ci->flags == CLUSTER_FLAG_NONE) {
> > +             list_add_tail(&ci->list, list);
> > +     } else {
> > +             if (ci->flags == CLUSTER_FLAG_FRAG) {
> > +                     VM_WARN_ON(!si->frag_cluster_nr[ci->order]);
> > +                     si->frag_cluster_nr[ci->order]--;
> > +             }
> > +             list_move_tail(&ci->list, list);
> > +     }
> > +     ci->flags = new_flags;
> > +     if (new_flags == CLUSTER_FLAG_FRAG)
> > +             si->frag_cluster_nr[ci->order]++;
> > +}
> > +
> >  /* Add a cluster to discard list and schedule it to do discard */
> >  static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >               struct swap_cluster_info *ci)
> > @@ -447,10 +468,8 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >        */
> >       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> > -
> > -     VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
> > -     list_move_tail(&ci->list, &si->discard_clusters);
> > -     ci->flags = 0;
> > +     VM_BUG_ON(ci->flags == CLUSTER_FLAG_FREE);
> > +     cluster_move(si, ci, &si->discard_clusters, CLUSTER_FLAG_DISCARD);
> >       schedule_work(&si->discard_work);
> >  }
> >
> > @@ -458,12 +477,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> >  {
> >       lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> > -
> > -     if (ci->flags)
> > -             list_move_tail(&ci->list, &si->free_clusters);
> > -     else
> > -             list_add_tail(&ci->list, &si->free_clusters);
> > -     ci->flags = CLUSTER_FLAG_FREE;
> > +     cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> >       ci->order = 0;
> >  }
> >
> > @@ -479,6 +493,8 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> >       while (!list_empty(&si->discard_clusters)) {
> >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> >               list_del(&ci->list);
> > +             /* Must clear flag when taking a cluster off-list */
> > +             ci->flags = CLUSTER_FLAG_NONE;
> >               idx = cluster_index(si, ci);
> >               spin_unlock(&si->lock);
> >
> > @@ -519,9 +535,6 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
> >       lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> >
> > -     if (ci->flags & CLUSTER_FLAG_FRAG)
> > -             si->frag_cluster_nr[ci->order]--;
> > -
> >       /*
> >        * If the swap is discardable, prepare discard the cluster
> >        * instead of free it immediately. The cluster will be freed
> > @@ -573,13 +586,9 @@ static void dec_cluster_info_page(struct swap_info_struct *si,
> >               return;
> >       }
> >
> > -     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> > -             VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
> > -             if (ci->flags & CLUSTER_FLAG_FRAG)
> > -                     si->frag_cluster_nr[ci->order]--;
> > -             list_move_tail(&ci->list, &si->nonfull_clusters[ci->order]);
> > -             ci->flags = CLUSTER_FLAG_NONFULL;
> > -     }
> > +     if (ci->flags != CLUSTER_FLAG_NONFULL)
> > +             cluster_move(si, ci, &si->nonfull_clusters[ci->order],
> > +                          CLUSTER_FLAG_NONFULL);
> >  }
> >
> >  static bool cluster_reclaim_range(struct swap_info_struct *si,
> > @@ -663,11 +672,13 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> >       if (!(si->flags & SWP_WRITEOK))
> >               return false;
> >
> > +     VM_BUG_ON(ci->flags == CLUSTER_FLAG_NONE);
> > +     VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE);
> > +
> >       if (cluster_is_free(ci)) {
> > -             if (nr_pages < SWAPFILE_CLUSTER) {
> > -                     list_move_tail(&ci->list, &si->nonfull_clusters[order]);
> > -                     ci->flags = CLUSTER_FLAG_NONFULL;
> > -             }
> > +             if (nr_pages < SWAPFILE_CLUSTER)
> > +                     cluster_move(si, ci, &si->nonfull_clusters[order],
> > +                                  CLUSTER_FLAG_NONFULL);
> >               ci->order = order;
> >       }
> >
> > @@ -675,14 +686,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> >       swap_range_alloc(si, nr_pages);
> >       ci->count += nr_pages;
> >
> > -     if (ci->count == SWAPFILE_CLUSTER) {
> > -             VM_BUG_ON(!(ci->flags &
> > -                       (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
> > -             if (ci->flags & CLUSTER_FLAG_FRAG)
> > -                     si->frag_cluster_nr[ci->order]--;
> > -             list_move_tail(&ci->list, &si->full_clusters);
> > -             ci->flags = CLUSTER_FLAG_FULL;
> > -     }
> > +     if (ci->count == SWAPFILE_CLUSTER)
> > +             cluster_move(si, ci, &si->full_clusters, CLUSTER_FLAG_FULL);
> >
> >       return true;
> >  }
> > @@ -821,9 +826,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >               while (!list_empty(&si->nonfull_clusters[order])) {
> >                       ci = list_first_entry(&si->nonfull_clusters[order],
> >                                             struct swap_cluster_info, list);
> > -                     list_move_tail(&ci->list, &si->frag_clusters[order]);
> > -                     ci->flags = CLUSTER_FLAG_FRAG;
> > -                     si->frag_cluster_nr[order]++;
> > +                     cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> >                       offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> >                                                        &found, order, usage);
> >                       frags++;
> > --
> > 2.47.1
> >
> >
>
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2025-01-10 11:23       ` Baoquan He
@ 2025-01-13  6:33         ` Kairui Song
  2025-01-13  8:07           ` Kairui Song
  0 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2025-01-13  6:33 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@redhat.com> wrote:
>
> On 01/09/25 at 10:15am, Kairui Song wrote:
> > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> >
> > Thanks for the very detailed review!
> >
> > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > ......snip.....
> > > > ---
> > > >  include/linux/swap.h |   3 +-
> > > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > > >
> > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > > --- a/include/linux/swap.h
> > > > +++ b/include/linux/swap.h
> > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > >   * throughput.
> > > >   */
> > > >  struct percpu_cluster {
> > > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > >  };
> > > >
> > > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > >                                       /* list of cluster that contains at least one free slot */
> > > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > >                                       /* list of cluster that are fragmented or contented */
> > > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > >       unsigned int pages;             /* total of usable pages of swap */
> > > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > index 7795a3d27273..dadd4fead689 100644
> > > > --- a/mm/swapfile.c
> > > > +++ b/mm/swapfile.c
> ...snip...
> > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > > >
> > > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > >  {
> > > > -     lockdep_assert_held(&si->lock);
> > > >       lockdep_assert_held(&ci->lock);
> > > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > >       ci->order = 0;
> > > >  }
> > > >
> > > > +/*
> > > > + * Isolate and lock the first cluster that is not contented on a list,
> > > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > > + * with list status, so cluster updaters can always know the cluster
> > > > + * list status without touching si lock.
> > > > + *
> > > > + * Note it's possible that all clusters on a list are contented so
> > > > + * this returns NULL for an non-empty list.
> > > > + */
> > > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > > +             struct swap_info_struct *si, struct list_head *list)
> > > > +{
> > > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > > +
> > > > +     spin_lock(&si->lock);
> > > > +
> > > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > > +             goto out;
> > > > +
> > > > +     list_for_each_entry(ci, list, list) {
> > > > +             if (!spin_trylock(&ci->lock))
> > > > +                     continue;
> > > > +
> > > > +             /* We may only isolate and clear flags of following lists */
> > > > +             VM_BUG_ON(!ci->flags);
> > > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > > +
> > > > +             list_del(&ci->list);
> > > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > > +             ret = ci;
> > > > +             break;
> > > > +     }
> > > > +out:
> > > > +     spin_unlock(&si->lock);
> > > > +
> > > > +     return ret;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > > - * will be added to free cluster list. caller should hold si->lock.
> > > > -*/
> > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > > + */
> > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > >  {
> > > >       struct swap_cluster_info *ci;
> > > > +     bool ret = false;
> > > >       unsigned int idx;
> > > >
> > > > +     spin_lock(&si->lock);
> > > >       while (!list_empty(&si->discard_clusters)) {
> > > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > > +             /*
> > > > +              * Delete the cluster from list but don't clear its flags until
> > > > +              * discard is done, so isolation and relocation will skip it.
> > > > +              */
> > > >               list_del(&ci->list);
> > >
> > > I don't understand above comment. ci has been taken off list. While
> > > allocation need isolate from a usable list. Even though we clear
> > > ci->flags now, how come isolation and relocation will touch it. I may
> > > miss anything here.
> >
> > There are many cases, one possible and common situation is that the
> > percpu cluster (si->percpu_cluster of another CPU) is still pointing
> > to it.
> >
> > Also, this commit removed protection of si lock on allocation, and
> > allocation path may also drop ci lock to call reclaim, which means one
> > cluster could be used or freed by anyone before allocator reacquire
> > the ci lock again. In that case, the allocator could see a discard
> > cluster.
> >
> > So we don't clear the discard flag, in case anyone misuse it.
> >
> > I can add more inline comments on this, this is already some related
> > comments above the function relocate_cluster, could add some more
> > referencing that.

Hi Baoquan,

>
> Thanks for your great explanation. I understand that si->percpu_cluster
> could point to a discarded ci, and a ci could be got from non-full,
> frag lists but later become discarded if that ci is freed on other cpu
> during cluster_reclaim_range() invocation. I haven't got how isolation
> could see a discarded ci in cluster_isolate_lock(). Could you help give
> an exmaple on how that happen?

cluster_isolate_lock shouldn't see a discard cluster, and there is a
VM_BUG_ON for that.

>
> Surely, I understand keeping the discarded flag is very necessary so
> that checking like cluster_is_usable() will return expected value.
>
> And by the way, I haven't got when the ' if (!ci->count)' case could
> happen in relocate_cluster() since we have filtered away discarded ci
> with the 'if (cluster_is_discard(ci))' checking. I asked in another
> thread, could you help explain it?

Many swap devices doesn't need discard so the cluster could be freed
directly. And actually the ci->count check in relocate_cluster is not
necessarily related to that.

The caller of relocate_cluster, may fail an allocation (eg. race with
swapoff) and that could end up calling relocate_cluster with a empty
cluster, such cluster should go to the free list (swapoff might fail
too).

The swapoff case is extremely rare but let's just be more robust here,
covering free cluster have almost no overhead but save a lot of
efforts. I can add some comments on this.

>
> static void relocate_cluster(struct swap_info_struct *si,
>                              struct swap_cluster_info *ci)
> {
>         lockdep_assert_held(&ci->lock);
>
>         /* Discard cluster must remain off-list or on discard list */
>         if (cluster_is_discard(ci))
>                 return;
>
>         if (!ci->count) {
>                 free_cluster(si, ci);
> ...
> }


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
  2025-01-13  6:33         ` Kairui Song
@ 2025-01-13  8:07           ` Kairui Song
  0 siblings, 0 replies; 35+ messages in thread
From: Kairui Song @ 2025-01-13  8:07 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Mon, Jan 13, 2025 at 2:33 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Fri, Jan 10, 2025 at 7:24 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 01/09/25 at 10:15am, Kairui Song wrote:
> > > On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > >
> > > Thanks for the very detailed review!
> > >
> > > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > > ......snip.....
> > > > > ---
> > > > >  include/linux/swap.h |   3 +-
> > > > >  mm/swapfile.c        | 435 ++++++++++++++++++++++++-------------------
> > > > >  2 files changed, 246 insertions(+), 192 deletions(-)
> > > > >
> > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > > > --- a/include/linux/swap.h
> > > > > +++ b/include/linux/swap.h
> > > > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > > >   * throughput.
> > > > >   */
> > > > >  struct percpu_cluster {
> > > > > +     local_lock_t lock; /* Protect the percpu_cluster above */
> > > > >       unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > > >  };
> > > > >
> > > > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > > >                                       /* list of cluster that contains at least one free slot */
> > > > >       struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > > >                                       /* list of cluster that are fragmented or contented */
> > > > > -     unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > > > +     atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > > >       unsigned int pages;             /* total of usable pages of swap */
> > > > >       atomic_long_t inuse_pages;      /* number of those currently in use */
> > > > >       struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > > > index 7795a3d27273..dadd4fead689 100644
> > > > > --- a/mm/swapfile.c
> > > > > +++ b/mm/swapfile.c
> > ...snip...
> > > > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > > > >
> > > > >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > > >  {
> > > > > -     lockdep_assert_held(&si->lock);
> > > > >       lockdep_assert_held(&ci->lock);
> > > > >       cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > > >       ci->order = 0;
> > > > >  }
> > > > >
> > > > > +/*
> > > > > + * Isolate and lock the first cluster that is not contented on a list,
> > > > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > > > + * with list status, so cluster updaters can always know the cluster
> > > > > + * list status without touching si lock.
> > > > > + *
> > > > > + * Note it's possible that all clusters on a list are contented so
> > > > > + * this returns NULL for an non-empty list.
> > > > > + */
> > > > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > > > +             struct swap_info_struct *si, struct list_head *list)
> > > > > +{
> > > > > +     struct swap_cluster_info *ci, *ret = NULL;
> > > > > +
> > > > > +     spin_lock(&si->lock);
> > > > > +
> > > > > +     if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > > > +             goto out;
> > > > > +
> > > > > +     list_for_each_entry(ci, list, list) {
> > > > > +             if (!spin_trylock(&ci->lock))
> > > > > +                     continue;
> > > > > +
> > > > > +             /* We may only isolate and clear flags of following lists */
> > > > > +             VM_BUG_ON(!ci->flags);
> > > > > +             VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > > > +                       ci->flags != CLUSTER_FLAG_FULL);
> > > > > +
> > > > > +             list_del(&ci->list);
> > > > > +             ci->flags = CLUSTER_FLAG_NONE;
> > > > > +             ret = ci;
> > > > > +             break;
> > > > > +     }
> > > > > +out:
> > > > > +     spin_unlock(&si->lock);
> > > > > +
> > > > > +     return ret;
> > > > > +}
> > > > > +
> > > > >  /*
> > > > >   * Doing discard actually. After a cluster discard is finished, the cluster
> > > > > - * will be added to free cluster list. caller should hold si->lock.
> > > > > -*/
> > > > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > > > + */
> > > > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > > >  {
> > > > >       struct swap_cluster_info *ci;
> > > > > +     bool ret = false;
> > > > >       unsigned int idx;
> > > > >
> > > > > +     spin_lock(&si->lock);
> > > > >       while (!list_empty(&si->discard_clusters)) {
> > > > >               ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > > > +             /*
> > > > > +              * Delete the cluster from list but don't clear its flags until
> > > > > +              * discard is done, so isolation and relocation will skip it.
> > > > > +              */
> > > > >               list_del(&ci->list);
> > > >
> > > > I don't understand above comment. ci has been taken off list. While
> > > > allocation need isolate from a usable list. Even though we clear
> > > > ci->flags now, how come isolation and relocation will touch it. I may
> > > > miss anything here.
> > >
> > > There are many cases, one possible and common situation is that the
> > > percpu cluster (si->percpu_cluster of another CPU) is still pointing
> > > to it.
> > >
> > > Also, this commit removed protection of si lock on allocation, and
> > > allocation path may also drop ci lock to call reclaim, which means one
> > > cluster could be used or freed by anyone before allocator reacquire
> > > the ci lock again. In that case, the allocator could see a discard
> > > cluster.
> > >
> > > So we don't clear the discard flag, in case anyone misuse it.
> > >
> > > I can add more inline comments on this, this is already some related
> > > comments above the function relocate_cluster, could add some more
> > > referencing that.
>
> Hi Baoquan,
>
> >
> > Thanks for your great explanation. I understand that si->percpu_cluster
> > could point to a discarded ci, and a ci could be got from non-full,
> > frag lists but later become discarded if that ci is freed on other cpu
> > during cluster_reclaim_range() invocation. I haven't got how isolation
> > could see a discarded ci in cluster_isolate_lock(). Could you help give
> > an exmaple on how that happen?
>
> cluster_isolate_lock shouldn't see a discard cluster, and there is a
> VM_BUG_ON for that.

Oh, now I realize what you mean, the comment in
swap_do_scheduled_discard mentioned cluster_isolate_lock may see a
discard cluster, that is not true, it was added in an early version of
this series and forgot to update the comment. I'll just drop that.

>
> >
> > Surely, I understand keeping the discarded flag is very necessary so
> > that checking like cluster_is_usable() will return expected value.
> >
> > And by the way, I haven't got when the ' if (!ci->count)' case could
> > happen in relocate_cluster() since we have filtered away discarded ci
> > with the 'if (cluster_is_discard(ci))' checking. I asked in another
> > thread, could you help explain it?
>
> Many swap devices doesn't need discard so the cluster could be freed
> directly. And actually the ci->count check in relocate_cluster is not
> necessarily related to that.
>
> The caller of relocate_cluster, may fail an allocation (eg. race with
> swapoff) and that could end up calling relocate_cluster with a empty
> cluster, such cluster should go to the free list (swapoff might fail
> too).
>
> The swapoff case is extremely rare but let's just be more robust here,
> covering free cluster have almost no overhead but save a lot of
> efforts. I can add some comments on this.
>
> >
> > static void relocate_cluster(struct swap_info_struct *si,
> >                              struct swap_cluster_info *ci)
> > {
> >         lockdep_assert_held(&ci->lock);
> >
> >         /* Discard cluster must remain off-list or on discard list */
> >         if (cluster_is_discard(ci))
> >                 return;
> >
> >         if (!ci->count) {
> >                 free_cluster(si, ci);
> > ...
> > }


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage
  2025-01-13  5:34     ` Kairui Song
@ 2025-01-20  2:39       ` Baoquan He
  2025-01-27  9:19         ` Kairui Song
  0 siblings, 1 reply; 35+ messages in thread
From: Baoquan He @ 2025-01-20  2:39 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 01/13/25 at 01:34pm, Kairui Song wrote:
> On Sat, Jan 4, 2025 at 1:46 PM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > The flag SWP_SCANNING was used as an indicator of whether a device
> > > is being scanned for allocation, and prevents swapoff. Combined with
> > > SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
> > >
> > > 1. Swapoff clears SWP_WRITEOK, allocation requests will see
> > >    ~SWP_WRITEOK and abort as it's serialized by si->lock.
> > > 2. Swapoff unuses all allocated entries.
> > > 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
> > >    allocations will stop, preventing UAF.
> > > 4. Now swapoff can free everything safely.
> > >
> > > This will make the allocation path have a hard dependency on
> > > si->lock. Allocation always have to acquire si->lock first for
> > > setting SWP_SCANNING and checking SWP_WRITEOK.
> > >
> > > This commit removes this flag, and just uses the existing per-CPU
> > > refcount instead to prevent UAF in step 3, which serves well for
> > > such usage without dependency on si->lock, and scales very well too.
> > > Just hold a reference during the whole scan and allocation process.
> > > Swapoff will kill and wait for the counter.
> > >
> > > And for preventing any allocation from happening after step 1 so the
> > > unuse in step 2 can ensure all slots are free, swapoff will acquire
> > > the ci->lock of each cluster one by one to ensure all allocations
> > > see ~SWP_WRITEOK and abort.
> >
> > Changing to use si->users is great, while wondering why we need acquire =
> > each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take
> > the si off swap_avail_heads list. No matter what, we just need wait for
> > p->comm's completion and continue, why bothering to loop for the
> > ci->lock acquiring?
> >
> 
> Hi Baoquan,
> 
> Waiting for p->comm's completion must be done after unuse is called
> (unuse will need to take the si->users refcound, so it can't be dead
> yet), but unuse must be called after no one will allocate any new
> entry. That is guaranteed by the loop ci->lock acquiring.

Sorry for late response, Kairui. I went trought the code flow of swap
allocation several times, however haven't made clear how loop ci->lock
acquiring is needed here.  Once si->flags &= ~SWP_WRITEOK is executed in
del_from_avail_list() when swaping off, even though the allocation
action is still on going, it will be failed in cluster_alloc_range()
by the 'if (!(si->flags & SWP_WRITEOK))' checking. Then that allocation
requirement will be failed and returned, means no new swap entry|slot
allcation will be done. Then unuse won't be impacted at all. In this
case, why do we care about it?

Please forgive my stupidity, could you elaborate in which case this kind
of still ongoging swap allocation will happen during its swap device's
off? Could you give an example of the concurrent execution flows?

Thanks
Baoquan



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage
  2025-01-20  2:39       ` Baoquan He
@ 2025-01-27  9:19         ` Kairui Song
  2025-02-05  9:18           ` Baoquan He
  0 siblings, 1 reply; 35+ messages in thread
From: Kairui Song @ 2025-01-27  9:19 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On Mon, Jan 20, 2025 at 10:39 AM Baoquan He <bhe@redhat.com> wrote:
>
> On 01/13/25 at 01:34pm, Kairui Song wrote:
> > On Sat, Jan 4, 2025 at 1:46 PM Baoquan He <bhe@redhat.com> wrote:
> > >
> > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > > From: Kairui Song <kasong@tencent.com>
> > > >
> > > > The flag SWP_SCANNING was used as an indicator of whether a device
> > > > is being scanned for allocation, and prevents swapoff. Combined with
> > > > SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
> > > >
> > > > 1. Swapoff clears SWP_WRITEOK, allocation requests will see
> > > >    ~SWP_WRITEOK and abort as it's serialized by si->lock.
> > > > 2. Swapoff unuses all allocated entries.
> > > > 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
> > > >    allocations will stop, preventing UAF.
> > > > 4. Now swapoff can free everything safely.
> > > >
> > > > This will make the allocation path have a hard dependency on
> > > > si->lock. Allocation always have to acquire si->lock first for
> > > > setting SWP_SCANNING and checking SWP_WRITEOK.
> > > >
> > > > This commit removes this flag, and just uses the existing per-CPU
> > > > refcount instead to prevent UAF in step 3, which serves well for
> > > > such usage without dependency on si->lock, and scales very well too.
> > > > Just hold a reference during the whole scan and allocation process.
> > > > Swapoff will kill and wait for the counter.
> > > >
> > > > And for preventing any allocation from happening after step 1 so the
> > > > unuse in step 2 can ensure all slots are free, swapoff will acquire
> > > > the ci->lock of each cluster one by one to ensure all allocations
> > > > see ~SWP_WRITEOK and abort.
> > >
> > > Changing to use si->users is great, while wondering why we need acquire =
> > > each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take
> > > the si off swap_avail_heads list. No matter what, we just need wait for
> > > p->comm's completion and continue, why bothering to loop for the
> > > ci->lock acquiring?
> > >
> >
> > Hi Baoquan,
> >
> > Waiting for p->comm's completion must be done after unuse is called
> > (unuse will need to take the si->users refcound, so it can't be dead
> > yet), but unuse must be called after no one will allocate any new
> > entry. That is guaranteed by the loop ci->lock acquiring.
>
> Sorry for late response, Kairui. I went trought the code flow of swap
> allocation several times, however haven't made clear how loop ci->lock
> acquiring is needed here.  Once si->flags &= ~SWP_WRITEOK is executed in
> del_from_avail_list() when swaping off, even though the allocation
> action is still on going, it will be failed in cluster_alloc_range()
> by the 'if (!(si->flags & SWP_WRITEOK))' checking. Then that allocation

Hi Baoquan,

Thanks for the careful review.

> requirement will be failed and returned, means no new swap entry|slot
> allcation will be done. Then unuse won't be impacted at all. In this
> case, why do we care about it?
>
> Please forgive my stupidity, could you elaborate in which case this kind
> of still ongoging swap allocation will happen during its swap device's
> off? Could you give an example of the concurrent execution flows?

There is no barrier or lock between clear the flag and try_to_unuse,
so nothing guarantees the "if (!(si->flags & SWP_WRITEOK))" in
cluster_alloc_range will see the updated flag. The loop ci->lock acts
like a full memory barrier, ensuring any allocation after the loop
lock will definitely see the updated flags, and try_to_unuse will only
go on after all allocation have either stopped or will see the updated
flags. In practice this problem is almost impossible to happen, but in
theory possible.

>
> Thanks
> Baoquan
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage
  2025-01-27  9:19         ` Kairui Song
@ 2025-02-05  9:18           ` Baoquan He
  0 siblings, 0 replies; 35+ messages in thread
From: Baoquan He @ 2025-02-05  9:18 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Barry Song, Ryan Roberts,
	Hugh Dickins, Yosry Ahmed, Huang, Ying, Nhat Pham,
	Johannes Weiner, Kalesh Singh, linux-kernel

On 01/27/25 at 05:19pm, Kairui Song wrote:
> On Mon, Jan 20, 2025 at 10:39 AM Baoquan He <bhe@redhat.com> wrote:
> >
> > On 01/13/25 at 01:34pm, Kairui Song wrote:
> > > On Sat, Jan 4, 2025 at 1:46 PM Baoquan He <bhe@redhat.com> wrote:
> > > >
> > > > On 12/31/24 at 01:46am, Kairui Song wrote:
> > > > > From: Kairui Song <kasong@tencent.com>
> > > > >
> > > > > The flag SWP_SCANNING was used as an indicator of whether a device
> > > > > is being scanned for allocation, and prevents swapoff. Combined with
> > > > > SWP_WRITEOK, they work as a set of barriers for a clean swapoff:
> > > > >
> > > > > 1. Swapoff clears SWP_WRITEOK, allocation requests will see
> > > > >    ~SWP_WRITEOK and abort as it's serialized by si->lock.
> > > > > 2. Swapoff unuses all allocated entries.
> > > > > 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing
> > > > >    allocations will stop, preventing UAF.
> > > > > 4. Now swapoff can free everything safely.
> > > > >
> > > > > This will make the allocation path have a hard dependency on
> > > > > si->lock. Allocation always have to acquire si->lock first for
> > > > > setting SWP_SCANNING and checking SWP_WRITEOK.
> > > > >
> > > > > This commit removes this flag, and just uses the existing per-CPU
> > > > > refcount instead to prevent UAF in step 3, which serves well for
> > > > > such usage without dependency on si->lock, and scales very well too.
> > > > > Just hold a reference during the whole scan and allocation process.
> > > > > Swapoff will kill and wait for the counter.
> > > > >
> > > > > And for preventing any allocation from happening after step 1 so the
> > > > > unuse in step 2 can ensure all slots are free, swapoff will acquire
> > > > > the ci->lock of each cluster one by one to ensure all allocations
> > > > > see ~SWP_WRITEOK and abort.
> > > >
> > > > Changing to use si->users is great, while wondering why we need acquire =
> > > > each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take
> > > > the si off swap_avail_heads list. No matter what, we just need wait for
> > > > p->comm's completion and continue, why bothering to loop for the
> > > > ci->lock acquiring?
> > > >
> > >
> > > Hi Baoquan,
> > >
> > > Waiting for p->comm's completion must be done after unuse is called
> > > (unuse will need to take the si->users refcound, so it can't be dead
> > > yet), but unuse must be called after no one will allocate any new
> > > entry. That is guaranteed by the loop ci->lock acquiring.
> >
> > Sorry for late response, Kairui. I went trought the code flow of swap
> > allocation several times, however haven't made clear how loop ci->lock
> > acquiring is needed here.  Once si->flags &= ~SWP_WRITEOK is executed in
> > del_from_avail_list() when swaping off, even though the allocation
> > action is still on going, it will be failed in cluster_alloc_range()
> > by the 'if (!(si->flags & SWP_WRITEOK))' checking. Then that allocation
> 
> Hi Baoquan,
> 
> Thanks for the careful review.
> 
> > requirement will be failed and returned, means no new swap entry|slot
> > allcation will be done. Then unuse won't be impacted at all. In this
> > case, why do we care about it?
> >
> > Please forgive my stupidity, could you elaborate in which case this kind
> > of still ongoging swap allocation will happen during its swap device's
> > off? Could you give an example of the concurrent execution flows?
> 
> There is no barrier or lock between clear the flag and try_to_unuse,
> so nothing guarantees the "if (!(si->flags & SWP_WRITEOK))" in
> cluster_alloc_range will see the updated flag. The loop ci->lock acts
> like a full memory barrier, ensuring any allocation after the loop
> lock will definitely see the updated flags, and try_to_unuse will only
> go on after all allocation have either stopped or will see the updated
> flags. In practice this problem is almost impossible to happen, but in
> theory possible.

Got it now. swap_avail_lock is not taken during allocation, and we don't
take it when accessing si->flags in cluster_alloc_range() becasue that
could bring in new lock contention.

Thanks a lot for patient explanation.



^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2025-02-05  9:18 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
2025-01-09  4:04   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
2025-01-09  4:05   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 03/13] mm, swap: remove old allocation path for HDD Kairui Song
2025-01-09  4:06   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 04/13] mm, swap: use cluster lock " Kairui Song
2025-01-09  4:07   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 05/13] mm, swap: clean up device availability check Kairui Song
2025-01-09  4:08   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 06/13] mm, swap: clean up plist removal and adding Kairui Song
2025-01-02  8:59   ` Baoquan He
2025-01-03  8:07     ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Kairui Song
2025-01-04  5:46   ` Baoquan He
2025-01-13  5:34     ` Kairui Song
2025-01-20  2:39       ` Baoquan He
2025-01-27  9:19         ` Kairui Song
2025-02-05  9:18           ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
2025-01-06  8:43   ` Baoquan He
2025-01-13  5:49     ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
2025-01-06 10:12   ` Baoquan He
2025-01-08 11:09   ` Baoquan He
2025-01-09  2:15     ` Kairui Song
2025-01-10 11:23       ` Baoquan He
2025-01-13  6:33         ` Kairui Song
2025-01-13  8:07           ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 10/13] mm, swap: simplify percpu cluster updating Kairui Song
2025-01-09  2:07   ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
2024-12-30 17:46 ` [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices Kairui Song
2024-12-30 17:46 ` [PATCH v3 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).