[PATCH 0/2] mm, swap: improve cluster scan strategy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] mm, swap: improve cluster scan strategy
@ 2025-08-04 17:24 Kairui Song
  2025-08-04 17:24 ` [PATCH 1/2] mm, swap: don't scan every fragment cluster Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Kairui Song @ 2025-08-04 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series improves the large allocation performance and reduces
the failure rate. Some design of the cluster alloactor was later
found to be improvable after thorough testing.

For example, build kernel test with make -j96 and 10G ZRAM with 64kB
mTHP enabled shows better performance and a lower failure rate:

Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
After:  sys time: 5538.3s    64kB/swpout: 1813133  64kB/swpout_fallback: 0

System time is cut in half, and the failure rate drops to zero. Larger
allocations in a hybrid workload also showed a major improvement:

512kB swap failure rate:
Before: swpout:11971  swpout_fallback:2218
After:  swpout:14606  swpout_fallback:4

2M swap failure rate:
Before: swpout:12     swpout_fallback:1578
After:  swpout:1253   swpout_fallback:15

Kairui Song (2):
  mm, swap: don't scan every fragment cluster
  mm, swap: prefer nonfull over free clusters

 include/linux/swap.h |  1 -
 mm/swapfile.c        | 68 +++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 33 deletions(-)

-- 
2.50.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] mm, swap: don't scan every fragment cluster
  2025-08-04 17:24 [PATCH 0/2] mm, swap: improve cluster scan strategy Kairui Song
@ 2025-08-04 17:24 ` Kairui Song
  2025-08-05 23:30   ` Chris Li
  2025-08-04 17:24 ` [PATCH 2/2] mm, swap: prefer nonfull over free clusters Kairui Song
  2025-08-05 23:26 ` [PATCH 0/2] mm, swap: improve cluster scan strategy Chris Li
  2 siblings, 1 reply; 10+ messages in thread
From: Kairui Song @ 2025-08-04 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Fragment clusters were mostly failing high order allocation already.
The reason we scan it now is that a swap slot may get freed without
releasing the swap cache, so a swap map entry will end up in HAS_CACHE
only status, and the cluster won't be moved back to non-full or free
cluster list.

Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
device usage is low (!vm_swap_full()) since swap will try to lazy free
the swap cache.

It's unlikely to cause any real issue. Fragmentation is only an issue
when the device is getting full, and by  that time, swap will already
be releasing the swap cache aggressively. And swap cache reclaim happens
when the allocator scans a cluster too. Scanning one fragment cluster
should be good enough to reclaim these pinned slots.

And besides, only high order allocation requires iterating over a
cluster list, order 0 allocation will succeed on the first attempt.
And high order allocation failure isn't a serious problem.

So the iteration of fragment clusters is trivial, but it will slow down
mTHP allocation by a lot when the fragment cluster list is long.
So it's better to drop this fragment cluster iteration design. Only
scanning one fragment cluster is good enough in case any cluster is
stuck in the fragment list; this ensures order 0 allocation never
falls, and large allocations still have an acceptable success rate.

Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
only:

Before: sys time: 4407.28s
After:  sys time: 4425.22s

Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:

Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17813

Change to 8G ZRAM:

Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 173056
After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 189562

Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0

Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010

The performance is a lot better and large order allocation failure rate
is only very slightly higher or unchanged.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 -
 mm/swapfile.c        | 30 ++++++++----------------------
 2 files changed, 8 insertions(+), 23 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..a060d102e0d1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,7 +310,6 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b4f3cc712580..5fdb3cb2b8b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
 	else
 		list_move_tail(&ci->list, list);
 	spin_unlock(&si->lock);
-
-	if (ci->flags == CLUSTER_FLAG_FRAG)
-		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
-	else if (new_flags == CLUSTER_FLAG_FRAG)
-		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags = new_flags;
 }
 
@@ -926,32 +921,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);
 
 	if (order < PMD_ORDER) {
-		unsigned int frags = 0, frags_existing;
-
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			/* Clusters failed to allocate are moved to frag_clusters */
-			frags++;
 		}
 
-		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
-		while (frags < frags_existing &&
-		       (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
-			atomic_long_dec(&si->frag_cluster_nr[order]);
-			/*
-			 * Rotate the frag list to iterate, they were all
-			 * failing high order allocation or moved here due to
-			 * per-CPU usage, but they could contain newly released
-			 * reclaimable (eg. lazy-freed swap cache) slots.
-			 */
+		/*
+		 * Scan only one fragment cluster is good enough. Order 0
+		 * allocation will surely success, and large allocation
+		 * failure is not critical. Scanning one cluster still
+		 * keeps the list rotated and reclaimed (for HAS_CACHE).
+		 */
+		ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
+		if (ci) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			frags++;
 		}
 	}
 
@@ -972,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
 		while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
-			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							0, usage);
 			if (found)
@@ -3224,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
 
 	/*
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] mm, swap: prefer nonfull over free clusters
  2025-08-04 17:24 [PATCH 0/2] mm, swap: improve cluster scan strategy Kairui Song
  2025-08-04 17:24 ` [PATCH 1/2] mm, swap: don't scan every fragment cluster Kairui Song
@ 2025-08-04 17:24 ` Kairui Song
  2025-08-05 23:35   ` Chris Li
  2025-08-06  0:03   ` Nhat Pham
  2025-08-05 23:26 ` [PATCH 0/2] mm, swap: improve cluster scan strategy Chris Li
  2 siblings, 2 replies; 10+ messages in thread
From: Kairui Song @ 2025-08-04 17:24 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

We prefer a free cluster over a nonfull cluster whenever a CPU local
cluster is drained to respect the SSD discard behavior [1]. It's not
a best practice for non-discarding devices. And this is causing a
chigher fragmentation rate.

So for a non-discarding device, prefer nonfull over free clusters. This
reduces the fragmentation issue by a lot.

Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:

Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071

Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:

Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0

Performance is basically unchanged, and the large allocation failure rate
is lower. Enabling all mTHP sizes showed a more significant result:

Using the same test setup with 10G ZRAM and enabling all mTHP sizes:

128kB swap failure rate:
Before: swpout:449548 swpout_fallback:55894
After:  swpout:497519 swpout_fallback:3204

256kB swap failure rate:
Before: swpout:63938  swpout_fallback:2154
After:  swpout:65698  swpout_fallback:324

512kB swap failure rate:
Before: swpout:11971  swpout_fallback:2218
After:  swpout:14606  swpout_fallback:4

2M swap failure rate:
Before: swpout:12     swpout_fallback:1578
After:  swpout:1253   swpout_fallback:15

The success rate of large allocations is much higher.

Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5fdb3cb2b8b7..4a0cf4fb348d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	}
 
 new_cluster:
-	ci = isolate_lock_cluster(si, &si->free_clusters);
-	if (ci) {
-		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
-		if (found)
-			goto done;
+	/*
+	 * If the device need discard, prefer new cluster over nonfull
+	 * to spread out the writes.
+	 */
+	if (si->flags & SWP_PAGE_DISCARD) {
+		ci = isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
 	}
 
-	/* Try reclaim from full clusters if free clusters list is drained */
-	if (vm_swap_full())
-		swap_reclaim_full_clusters(si, false);
-
 	if (order < PMD_ORDER) {
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
@@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			if (found)
 				goto done;
 		}
+	}
 
+	if (!(si->flags & SWP_PAGE_DISCARD)) {
+		ci = isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
+	}
+
+	/* Try reclaim full clusters if free and nonfull lists are drained */
+	if (vm_swap_full())
+		swap_reclaim_full_clusters(si, false);
+
+	if (order < PMD_ORDER) {
 		/*
 		 * Scan only one fragment cluster is good enough. Order 0
 		 * allocation will surely success, and large allocation
-- 
2.50.1



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/2] mm, swap: improve cluster scan strategy
  2025-08-04 17:24 [PATCH 0/2] mm, swap: improve cluster scan strategy Kairui Song
  2025-08-04 17:24 ` [PATCH 1/2] mm, swap: don't scan every fragment cluster Kairui Song
  2025-08-04 17:24 ` [PATCH 2/2] mm, swap: prefer nonfull over free clusters Kairui Song
@ 2025-08-05 23:26 ` Chris Li
  2 siblings, 0 replies; 10+ messages in thread
From: Chris Li @ 2025-08-05 23:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This series improves the large allocation performance and reduces
> the failure rate. Some design of the cluster alloactor was later
> found to be improvable after thorough testing.

Nit: If you have a next version of this series, please include a bit
of detail on how you get the improvement to kick off the discussion.
Right now the cover letter just said I have some cool changes and here
is the number. e.g. limit the fragment list search to the first
cluster.

>
> For example, build kernel test with make -j96 and 10G ZRAM with 64kB
> mTHP enabled shows better performance and a lower failure rate:
>
> Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
> After:  sys time: 5538.3s    64kB/swpout: 1813133  64kB/swpout_fallback: 0
>
> System time is cut in half, and the failure rate drops to zero. Larger
> allocations in a hybrid workload also showed a major improvement:

That is a big improvement. Congrats.

>
> 512kB swap failure rate:
> Before: swpout:11971  swpout_fallback:2218
> After:  swpout:14606  swpout_fallback:4
>
> 2M swap failure rate:
> Before: swpout:12     swpout_fallback:1578
> After:  swpout:1253   swpout_fallback:15

The number looks very good.

Chris



Chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster
  2025-08-04 17:24 ` [PATCH 1/2] mm, swap: don't scan every fragment cluster Kairui Song
@ 2025-08-05 23:30   ` Chris Li
  2025-08-06  3:02     ` Kairui Song
  0 siblings, 1 reply; 10+ messages in thread
From: Chris Li @ 2025-08-05 23:30 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

Looks good to me with minor nit picks on commit messages and comments.

Let me know if you will refresh a version or not.

Nit: I suggest the patch title use positive terms, something along the lines:
"Only scan one cluster in fragment list"
"Don't scan" seems to describe what the patch does not do rather than
what the patch does.

On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Fragment clusters were mostly failing high order allocation already.
> The reason we scan it now is that a swap slot may get freed without
> releasing the swap cache, so a swap map entry will end up in HAS_CACHE
> only status, and the cluster won't be moved back to non-full or free
> cluster list.
>
> Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap

Nit: Please clarify what "this" here means. I assume scanning fragment lists.
From the context it can almost mean "map entry will end up in HAS_CACHE".


> device usage is low (!vm_swap_full()) since swap will try to lazy free
> the swap cache.
>
> It's unlikely to cause any real issue. Fragmentation is only an issue
> when the device is getting full, and by  that time, swap will already
> be releasing the swap cache aggressively. And swap cache reclaim happens
> when the allocator scans a cluster too. Scanning one fragment cluster
> should be good enough to reclaim these pinned slots.
>
> And besides, only high order allocation requires iterating over a
> cluster list, order 0 allocation will succeed on the first attempt.
> And high order allocation failure isn't a serious problem.
>
> So the iteration of fragment clusters is trivial, but it will slow down
> mTHP allocation by a lot when the fragment cluster list is long.
> So it's better to drop this fragment cluster iteration design. Only
> scanning one fragment cluster is good enough in case any cluster is
> stuck in the fragment list; this ensures order 0 allocation never
> falls, and large allocations still have an acceptable success rate.
>
> Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> only:
>
> Before: sys time: 4407.28s
> After:  sys time: 4425.22s
>
> Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
>
> Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
> After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17813
>
> Change to 8G ZRAM:
>
> Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 173056
> After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 189562
>
> Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
> After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0
>
> Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
> After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010
>
> The performance is a lot better and large order allocation failure rate
> is only very slightly higher or unchanged.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h |  1 -
>  mm/swapfile.c        | 30 ++++++++----------------------
>  2 files changed, 8 insertions(+), 23 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fe6ed2cc3fd..a060d102e0d1 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -310,7 +310,6 @@ struct swap_info_struct {
>                                         /* list of cluster that contains at least one free slot */
>         struct list_head frag_clusters[SWAP_NR_ORDERS];
>                                         /* list of cluster that are fragmented or contented */
> -       atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];

Nit: please have some comment in the commit log that why remove the
frag_cluster_nr counter.
I feel this change can be split out from the main change of this
patch. The main performance improvement is from only scanning one
fragment cluster rather than the full list right? Delete the counter
helps, but in a much smaller number.

Chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
  2025-08-04 17:24 ` [PATCH 2/2] mm, swap: prefer nonfull over free clusters Kairui Song
@ 2025-08-05 23:35   ` Chris Li
  2025-08-06  0:03   ` Nhat Pham
  1 sibling, 0 replies; 10+ messages in thread
From: Chris Li @ 2025-08-05 23:35 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

On Mon, Aug 4, 2025 at 10:25 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We prefer a free cluster over a nonfull cluster whenever a CPU local
> cluster is drained to respect the SSD discard behavior [1]. It's not
> a best practice for non-discarding devices. And this is causing a
> chigher fragmentation rate.

Not only does it cause a higher fragmentation rate. It also causes
limit working set size over a long period of continued swapping can
write to the whole swapping partition. That is bad from the SSD point
of view if the swap page access pattern is random. Because at random
access patterns, very few clusters can have all 512 free, which can
reach to the discard. The previously preferred new cluster approach
works best with batched short to medium running cycle jobs, so at the
end of batch, there is a time where most of the working of swap is
released. That can release the nonfull cluster to a free cluster. For
long running jobs and random access of swap entry, very low change
frees a cluster to discard.

This patch will cause the limit working set to only write to a limited
swap area. Which is a good thing from the SSD wearing point of view.

Chris

> So for a non-discarding device, prefer nonfull over free clusters. This
> reduces the fragmentation issue by a lot.
>
> Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
>
> Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
>
> Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
>
> Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
>
> Performance is basically unchanged, and the large allocation failure rate
> is lower. Enabling all mTHP sizes showed a more significant result:
>
> Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
>
> 128kB swap failure rate:
> Before: swpout:449548 swpout_fallback:55894
> After:  swpout:497519 swpout_fallback:3204
>
> 256kB swap failure rate:
> Before: swpout:63938  swpout_fallback:2154
> After:  swpout:65698  swpout_fallback:324
>
> 512kB swap failure rate:
> Before: swpout:11971  swpout_fallback:2218
> After:  swpout:14606  swpout_fallback:4
>
> 2M swap failure rate:
> Before: swpout:12     swpout_fallback:1578
> After:  swpout:1253   swpout_fallback:15
>
> The success rate of large allocations is much higher.
>
> Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         }
>
>  new_cluster:
> -       ci = isolate_lock_cluster(si, &si->free_clusters);
> -       if (ci) {
> -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                               order, usage);
> -               if (found)
> -                       goto done;
> +       /*
> +        * If the device need discard, prefer new cluster over nonfull
> +        * to spread out the writes.
> +        */
> +       if (si->flags & SWP_PAGE_DISCARD) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
>         }
>
> -       /* Try reclaim from full clusters if free clusters list is drained */
> -       if (vm_swap_full())
> -               swap_reclaim_full_clusters(si, false);
> -
>         if (order < PMD_ORDER) {
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         if (found)
>                                 goto done;
>                 }
> +       }
>
> +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
> +       }
> +
> +       /* Try reclaim full clusters if free and nonfull lists are drained */
> +       if (vm_swap_full())
> +               swap_reclaim_full_clusters(si, false);
> +
> +       if (order < PMD_ORDER) {
>                 /*
>                  * Scan only one fragment cluster is good enough. Order 0
>                  * allocation will surely success, and large allocation
> --
> 2.50.1
>
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
  2025-08-04 17:24 ` [PATCH 2/2] mm, swap: prefer nonfull over free clusters Kairui Song
  2025-08-05 23:35   ` Chris Li
@ 2025-08-06  0:03   ` Nhat Pham
  2025-08-06  0:30     ` Chris Li
  2025-08-06  3:38     ` Kairui Song
  1 sibling, 2 replies; 10+ messages in thread
From: Nhat Pham @ 2025-08-06  0:03 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Chris Li, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> We prefer a free cluster over a nonfull cluster whenever a CPU local
> cluster is drained to respect the SSD discard behavior [1]. It's not
> a best practice for non-discarding devices. And this is causing a
> chigher fragmentation rate.
>
> So for a non-discarding device, prefer nonfull over free clusters. This
> reduces the fragmentation issue by a lot.
>
> Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
>
> Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
>
> Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
>
> Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
>
> Performance is basically unchanged, and the large allocation failure rate
> is lower. Enabling all mTHP sizes showed a more significant result:
>
> Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
>
> 128kB swap failure rate:
> Before: swpout:449548 swpout_fallback:55894
> After:  swpout:497519 swpout_fallback:3204
>
> 256kB swap failure rate:
> Before: swpout:63938  swpout_fallback:2154
> After:  swpout:65698  swpout_fallback:324
>
> 512kB swap failure rate:
> Before: swpout:11971  swpout_fallback:2218
> After:  swpout:14606  swpout_fallback:4
>
> 2M swap failure rate:
> Before: swpout:12     swpout_fallback:1578
> After:  swpout:1253   swpout_fallback:15
>
> The success rate of large allocations is much higher.
>
> Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> Signed-off-by: Kairui Song <kasong@tencent.com>

Nice! I agree with Chris' analysis too. It's less of a problem for
vswap (because there's no physical/SSD implication over there), but
this patch makes sense in the context of swapfile allocator.

FWIW:
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

> ---
>  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
>  1 file changed, 28 insertions(+), 10 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>         }
>
>  new_cluster:
> -       ci = isolate_lock_cluster(si, &si->free_clusters);
> -       if (ci) {
> -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> -                                               order, usage);
> -               if (found)
> -                       goto done;
> +       /*
> +        * If the device need discard, prefer new cluster over nonfull
> +        * to spread out the writes.
> +        */
> +       if (si->flags & SWP_PAGE_DISCARD) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
>         }
>
> -       /* Try reclaim from full clusters if free clusters list is drained */
> -       if (vm_swap_full())
> -               swap_reclaim_full_clusters(si, false);
> -
>         if (order < PMD_ORDER) {
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                         if (found)
>                                 goto done;
>                 }
> +       }
>
> +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> +               ci = isolate_lock_cluster(si, &si->free_clusters);
> +               if (ci) {
> +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> +                                                       order, usage);
> +                       if (found)
> +                               goto done;
> +               }
> +       }

Seems like this pattern is repeated a couple of places -
isolate_lock_cluster from one of the lists, and if successful, then
try to allocate (alloc_swap_scan_cluster) from it.

Might be refactorable in a future clean up patch.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
  2025-08-06  0:03   ` Nhat Pham
@ 2025-08-06  0:30     ` Chris Li
  2025-08-06  3:38     ` Kairui Song
  1 sibling, 0 replies; 10+ messages in thread
From: Chris Li @ 2025-08-06  0:30 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kairui Song, linux-mm, Andrew Morton, Kemeng Shi, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

On Tue, Aug 5, 2025 at 5:03 PM Nhat Pham <nphamcs@gmail.com> wrote:
>

> >
> > +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> > +               ci = isolate_lock_cluster(si, &si->free_clusters);
> > +               if (ci) {
> > +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > +                                                       order, usage);
> > +                       if (found)
> > +                               goto done;
> > +               }
> > +       }
>
> Seems like this pattern is repeated a couple of places -
> isolate_lock_cluster from one of the lists, and if successful, then
> try to allocate (alloc_swap_scan_cluster) from it.
>
> Might be refactorable in a future clean up patch.
>
Yes, agree. I noticed that as well. Incidentally I am writing a RFC
patch to clean it up when I saw your email coming in. Another reason
to clean it up is that, isolate_lock_cluster() must be paired with
relocate_cluster(), otherwise we have a dangling cluster not in the
list. We'd better pair the isolate() and relocate() in the same
function for better visibility.

Chris


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] mm, swap: don't scan every fragment cluster
  2025-08-05 23:30   ` Chris Li
@ 2025-08-06  3:02     ` Kairui Song
  0 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-08-06  3:02 UTC (permalink / raw)
  To: Chris Li
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

Chris Li <chrisl@kernel.org> 于 2025年8月6日周三 07:30写道：
>
> Looks good to me with minor nit picks on commit messages and comments.
>
> Let me know if you will refresh a version or not.

I'll send a V2 to improve the series. I think no code change is
needed, the change log can be improved.

> Nit: I suggest the patch title use positive terms, something along the lines:
> "Only scan one cluster in fragment list"
> "Don't scan" seems to describe what the patch does not do rather than
> what the patch does.

Good idea.

>
> On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Fragment clusters were mostly failing high order allocation already.
> > The reason we scan it now is that a swap slot may get freed without
> > releasing the swap cache, so a swap map entry will end up in HAS_CACHE
> > only status, and the cluster won't be moved back to non-full or free
> > cluster list.
> >
> > Usually this only happens for !SWP_SYNCHRONOUS_IO devices when the swap
>
> Nit: Please clarify what "this" here means. I assume scanning fragment lists.
> From the context it can almost mean "map entry will end up in HAS_CACHE".

Yes.

>
>
> > device usage is low (!vm_swap_full()) since swap will try to lazy free
> > the swap cache.
> >
> > It's unlikely to cause any real issue. Fragmentation is only an issue
> > when the device is getting full, and by  that time, swap will already
> > be releasing the swap cache aggressively. And swap cache reclaim happens
> > when the allocator scans a cluster too. Scanning one fragment cluster
> > should be good enough to reclaim these pinned slots.
> >
> > And besides, only high order allocation requires iterating over a
> > cluster list, order 0 allocation will succeed on the first attempt.
> > And high order allocation failure isn't a serious problem.
> >
> > So the iteration of fragment clusters is trivial, but it will slow down
> > mTHP allocation by a lot when the fragment cluster list is long.
> > So it's better to drop this fragment cluster iteration design. Only
> > scanning one fragment cluster is good enough in case any cluster is
> > stuck in the fragment list; this ensures order 0 allocation never
> > falls, and large allocations still have an acceptable success rate.
> >
> > Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> > defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> > only:
> >
> > Before: sys time: 4407.28s
> > After:  sys time: 4425.22s
> >
> > Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
> >
> > Before: sys time: 10230.22s  64kB/swpout: 1793044  64kB/swpout_fallback: 17653
> > After:  sys time: 5527.90s   64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> >
> > Change to 8G ZRAM:
> >
> > Before: sys time: 21929.17s  64kB/swpout: 1634681  64kB/swpout_fallback: 173056
> > After:  sys time: 6121.01s   64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> >
> > Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 7368.41s  64kB/swpout:1787599  swpout_fallback: 0
> > After:  sys time: 7338.27s  64kB/swpout:1783106  swpout_fallback: 0
> >
> > Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 28139.60s 64kB/swpout:1645421  swpout_fallback: 148408
> > After:  sys time: 8941.90s  64kB/swpout:1592973  swpout_fallback: 265010
> >
> > The performance is a lot better and large order allocation failure rate
> > is only very slightly higher or unchanged.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  include/linux/swap.h |  1 -
> >  mm/swapfile.c        | 30 ++++++++----------------------
> >  2 files changed, 8 insertions(+), 23 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index 2fe6ed2cc3fd..a060d102e0d1 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -310,7 +310,6 @@ struct swap_info_struct {
> >                                         /* list of cluster that contains at least one free slot */
> >         struct list_head frag_clusters[SWAP_NR_ORDERS];
> >                                         /* list of cluster that are fragmented or contented */
> > -       atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>
> Nit: please have some comment in the commit log that why remove the
> frag_cluster_nr counter.
> I feel this change can be split out from the main change of this
> patch. The main performance improvement is from only scanning one
> fragment cluster rather than the full list right? Delete the counter
> helps, but in a much smaller number.

RIght, I can split this into two patches, removing the counter has
basically no measurable performance effect, it's just no longer used
after this change.

>
> Chris
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] mm, swap: prefer nonfull over free clusters
  2025-08-06  0:03   ` Nhat Pham
  2025-08-06  0:30     ` Chris Li
@ 2025-08-06  3:38     ` Kairui Song
  1 sibling, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-08-06  3:38 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Chris Li, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

On Wed, Aug 6, 2025 at 8:06 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Aug 4, 2025 at 10:24 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > We prefer a free cluster over a nonfull cluster whenever a CPU local
> > cluster is drained to respect the SSD discard behavior [1]. It's not
> > a best practice for non-discarding devices. And this is causing a
> > chigher fragmentation rate.
> >
> > So for a non-discarding device, prefer nonfull over free clusters. This
> > reduces the fragmentation issue by a lot.
> >
> > Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:
> >
> > Before: sys time: 6121.0s  64kB/swpout: 1638155  64kB/swpout_fallback: 189562
> > After:  sys time: 6145.3s  64kB/swpout: 1761110  64kB/swpout_fallback: 66071
> >
> > Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:
> >
> > Before: sys time 5527.9s  64kB/swpout: 1789358  64kB/swpout_fallback: 17813
> > After:  sys time 5538.3s  64kB/swpout: 1813133  64kB/swpout_fallback: 0
> >
> > Performance is basically unchanged, and the large allocation failure rate
> > is lower. Enabling all mTHP sizes showed a more significant result:
> >
> > Using the same test setup with 10G ZRAM and enabling all mTHP sizes:
> >
> > 128kB swap failure rate:
> > Before: swpout:449548 swpout_fallback:55894
> > After:  swpout:497519 swpout_fallback:3204
> >
> > 256kB swap failure rate:
> > Before: swpout:63938  swpout_fallback:2154
> > After:  swpout:65698  swpout_fallback:324
> >
> > 512kB swap failure rate:
> > Before: swpout:11971  swpout_fallback:2218
> > After:  swpout:14606  swpout_fallback:4
> >
> > 2M swap failure rate:
> > Before: swpout:12     swpout_fallback:1578
> > After:  swpout:1253   swpout_fallback:15
> >
> > The success rate of large allocations is much higher.
> >
> > Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Nice! I agree with Chris' analysis too. It's less of a problem for
> vswap (because there's no physical/SSD implication over there), but
> this patch makes sense in the context of swapfile allocator.
>
> FWIW:
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Thanks!

>
> > ---
> >  mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
> >  1 file changed, 28 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 5fdb3cb2b8b7..4a0cf4fb348d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >         }
> >
> >  new_cluster:
> > -       ci = isolate_lock_cluster(si, &si->free_clusters);
> > -       if (ci) {
> > -               found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > -                                               order, usage);
> > -               if (found)
> > -                       goto done;
> > +       /*
> > +        * If the device need discard, prefer new cluster over nonfull
> > +        * to spread out the writes.
> > +        */
> > +       if (si->flags & SWP_PAGE_DISCARD) {
> > +               ci = isolate_lock_cluster(si, &si->free_clusters);
> > +               if (ci) {
> > +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > +                                                       order, usage);
> > +                       if (found)
> > +                               goto done;
> > +               }
> >         }
> >
> > -       /* Try reclaim from full clusters if free clusters list is drained */
> > -       if (vm_swap_full())
> > -               swap_reclaim_full_clusters(si, false);
> > -
> >         if (order < PMD_ORDER) {
> >                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > @@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                         if (found)
> >                                 goto done;
> >                 }
> > +       }
> >
> > +       if (!(si->flags & SWP_PAGE_DISCARD)) {
> > +               ci = isolate_lock_cluster(si, &si->free_clusters);
> > +               if (ci) {
> > +                       found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> > +                                                       order, usage);
> > +                       if (found)
> > +                               goto done;
> > +               }
> > +       }
>
> Seems like this pattern is repeated a couple of places -
> isolate_lock_cluster from one of the lists, and if successful, then
> try to allocate (alloc_swap_scan_cluster) from it.

Indeed, I've been thinking about it but there are some other issues
that need to be cleaned up before this one.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-06  3:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-04 17:24 [PATCH 0/2] mm, swap: improve cluster scan strategy Kairui Song
2025-08-04 17:24 ` [PATCH 1/2] mm, swap: don't scan every fragment cluster Kairui Song
2025-08-05 23:30   ` Chris Li
2025-08-06  3:02     ` Kairui Song
2025-08-04 17:24 ` [PATCH 2/2] mm, swap: prefer nonfull over free clusters Kairui Song
2025-08-05 23:35   ` Chris Li
2025-08-06  0:03   ` Nhat Pham
2025-08-06  0:30     ` Chris Li
2025-08-06  3:38     ` Kairui Song
2025-08-05 23:26 ` [PATCH 0/2] mm, swap: improve cluster scan strategy Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).