[PATCH v2 0/3] mm, swap: improve cluster scan strategy

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] mm, swap: improve cluster scan strategy
@ 2025-08-06 16:17 Kairui Song
  2025-08-06 16:17 ` [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list Kairui Song
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kairui Song @ 2025-08-06 16:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

This series improves the large allocation performance and reduces
the failure rate. Some design of the cluster alloactor was later
found to be improvable after thorough testing.

The allocator spent too much effort scanning the fragment list, which
is not helpful in most setups, but causes serious contention of the
list lock (si->lock). Besides, the allocator prefers free clusters
when searching for a new cluster due to historical reasons, which
causes fragmentation issues.

So make the allocator only scan one cluster for high order allocation,
and prefer nonfull cluster. This both improves the performance and
reduces fragmentation.

For example, build kernel test with make -j96 and 10G ZRAM with 64kB
mTHP enabled shows better performance and a lower failure rate:

Before: sys time: 11609.69s  64kB/swpout: 1787051  64kB/swpout_fallback: 20917
After:  sys time: 5587.53s   64kB/swpout: 1811598  64kB/swpout_fallback: 0

System time is cut in half, and the failure rate drops to zero. Larger
allocations in a hybrid workload also showed a major improvement:

512kB swap failure rate:
Before: swpout:11663  swpout_fallback:1767
After:  swpout:14480  swpout_fallback:6

2M swap failure rate:
Before: swpout:24     swpout_fallback:1442
After:  swpout:1329   swpout_fallback:7

Kairui Song (3):
  mm, swap: only scan one cluster in fragment list
  mm, swap: remove fragment clusters counter
  mm, swap: prefer nonfull over free clusters

 include/linux/swap.h |  1 -
 mm/swapfile.c        | 68 +++++++++++++++++++++++---------------------
 2 files changed, 36 insertions(+), 33 deletions(-)

---

V1: https://lore.kernel.org/linux-mm/20250804172439.2331-1-ryncsn@gmail.com/
Changelog:
- Split into 3 patches, no code change [ Chris Li ]
- Rebase and rerun the test to see if removing the fragment cluster counter
  helps to improve the performance, as expected, it's marginal.
- Collect Ack/Review-by [ Nhat Pham, Chris Li ]

-- 
2.50.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list
  2025-08-06 16:17 [PATCH v2 0/3] mm, swap: improve cluster scan strategy Kairui Song
@ 2025-08-06 16:17 ` Kairui Song
  2025-08-07  5:32   ` Chris Li
  2025-08-06 16:17 ` [PATCH v2 2/3] mm, swap: remove fragment clusters counter Kairui Song
  2025-08-06 16:17 ` [PATCH v2 3/3] mm, swap: prefer nonfull over free clusters Kairui Song
  2 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2025-08-06 16:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

Fragment clusters were mostly failing high order allocation already.
The reason we scan it through now is that a swap slot may get freed
without releasing the swap cache, so a swap map entry will end up in
HAS_CACHE only status, and the cluster won't be moved back to non-full
or free cluster list. This may cause a higher allocation failure rate.

Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of
slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO
device's usage is low (!vm_swap_full()), it will try to lazy free
the swap cache.

But this fragment list scan out is a bit overkill. Fragmentation
is only an issue for the allocator when the device is getting full,
and by that time, swap will be releasing the swap cache aggressively
already. Only scan one fragment cluster at a time is good enough to
reclaim already pinned slots, and move the cluster back to nonfull.

And besides, only high order allocation requires iterating over the
list, order 0 allocation will succeed on the first attempt. And
high order allocation failure isn't a serious problem.

So the iteration of fragment clusters is trivial, but it will slow down
large allocation by a lot when the fragment cluster list is long.
So it's better to drop this fragment cluster iteration design.

Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
only:

Before: sys time: 4432.56s
After:  sys time: 4430.18s

Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:

Before: sys time: 11609.69s  64kB/swpout: 1787051  64kB/swpout_fallback: 20917
After:  sys time: 5572.85s   64kB/swpout: 1797612  64kB/swpout_fallback: 19254

Change to 8G ZRAM:

Before: sys time: 21524.35s  64kB/swpout: 1687142  64kB/swpout_fallback: 128496
After:  sys time: 6278.45s   64kB/swpout: 1679127  64kB/swpout_fallback: 130942

Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 7393.50s  64kB/swpout:1788246  swpout_fallback: 0
After:  sys time: 7399.88s  64kB/swpout:1784257  swpout_fallback: 0

Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:

Before: sys time: 26292.26s 64kB/swpout:1645236  swpout_fallback: 138945
After:  sys time: 9463.16s  64kB/swpout:1581376  swpout_fallback: 259979

The performance is a lot better for large folios, and the large order
allocation failure rate is only very slightly higher or unchanged even
for !SWP_SYNCHRONOUS_IO devices high pressure.

Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index b4f3cc712580..1f1110e37f68 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -926,32 +926,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		swap_reclaim_full_clusters(si, false);

 	if (order < PMD_ORDER) {
-		unsigned int frags = 0, frags_existing;
-
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			/* Clusters failed to allocate are moved to frag_clusters */
-			frags++;
 		}

-		frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
-		while (frags < frags_existing &&
-		       (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
-			atomic_long_dec(&si->frag_cluster_nr[order]);
-			/*
-			 * Rotate the frag list to iterate, they were all
-			 * failing high order allocation or moved here due to
-			 * per-CPU usage, but they could contain newly released
-			 * reclaimable (eg. lazy-freed swap cache) slots.
-			 */
+		/*
+		 * Scan only one fragment cluster is good enough. Order 0
+		 * allocation will surely success, and large allocation
+		 * failure is not critical. Scanning one cluster still
+		 * keeps the list rotated and reclaimed (for HAS_CACHE).
+		 */
+		ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
+		if (ci) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							order, usage);
 			if (found)
 				goto done;
-			frags++;
 		}
 	}

-- 
2.50.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] mm, swap: remove fragment clusters counter
  2025-08-06 16:17 [PATCH v2 0/3] mm, swap: improve cluster scan strategy Kairui Song
  2025-08-06 16:17 ` [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list Kairui Song
@ 2025-08-06 16:17 ` Kairui Song
  2025-08-06 16:40   ` Nhat Pham
                     ` (2 more replies)
  2025-08-06 16:17 ` [PATCH v2 3/3] mm, swap: prefer nonfull over free clusters Kairui Song
  2 siblings, 3 replies; 9+ messages in thread
From: Kairui Song @ 2025-08-06 16:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

It was used for calculating the iteration number when the swap allocator
wants to scan the whole fragment list. Now the allocator only scans one
fragment cluster at a time, so no one uses this counter anymore.

Remove it as a cleanup; the performance change is marginal:

Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup
memory limit, on top of tmpfs, 64kB mTHP enabled:

Before:  sys time: 6278.45s
After:   sys time: 6176.34s

Change to 8G ZRAM:

Before:  sys time: 5572.85s
After:   sys time: 5531.49s

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 1 -
 mm/swapfile.c        | 7 -------
 2 files changed, 8 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..a060d102e0d1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -310,7 +310,6 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
-	atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int pages;		/* total of usable pages of swap */
 	atomic_long_t inuse_pages;	/* number of those currently in use */
 	struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 1f1110e37f68..5fdb3cb2b8b7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
 	else
 		list_move_tail(&ci->list, list);
 	spin_unlock(&si->lock);
-
-	if (ci->flags == CLUSTER_FLAG_FRAG)
-		atomic_long_dec(&si->frag_cluster_nr[ci->order]);
-	else if (new_flags == CLUSTER_FLAG_FRAG)
-		atomic_long_inc(&si->frag_cluster_nr[ci->order]);
 	ci->flags = new_flags;
 }
 
@@ -965,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 		 * allocation, but reclaim may drop si->lock and race with another user.
 		 */
 		while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
-			atomic_long_dec(&si->frag_cluster_nr[o]);
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
 							0, usage);
 			if (found)
@@ -3217,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&si->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&si->frag_clusters[i]);
-		atomic_long_set(&si->frag_cluster_nr[i], 0);
 	}
 
 	/*
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] mm, swap: prefer nonfull over free clusters
  2025-08-06 16:17 [PATCH v2 0/3] mm, swap: improve cluster scan strategy Kairui Song
  2025-08-06 16:17 ` [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list Kairui Song
  2025-08-06 16:17 ` [PATCH v2 2/3] mm, swap: remove fragment clusters counter Kairui Song
@ 2025-08-06 16:17 ` Kairui Song
  2 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2025-08-06 16:17 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Chris Li, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel, Kairui Song

From: Kairui Song <kasong@tencent.com>

We prefer a free cluster over a nonfull cluster whenever a CPU local
cluster is drained to respect the SSD discard behavior [1]. It's not
a best practice for non-discarding devices. And this is causing a
higher fragmentation rate.

So for a non-discarding device, prefer nonfull over free clusters. This
reduces the fragmentation issue by a lot.

Testing with make -j96, defconfig, using 64k mTHP, 8G ZRAM:

Before: sys time: 6176.34s  64kB/swpout: 1659757  64kB/swpout_fallback: 139503
After:  sys time: 6194.11s  64kB/swpout: 1689470  64kB/swpout_fallback: 56147

Testing with make -j96, defconfig, using 64k mTHP, 10G ZRAM:

After:  sys time: 5531.49s  64kB/swpout: 1791142  64kB/swpout_fallback: 17676
After:  sys time: 5587.53s  64kB/swpout: 1811598  64kB/swpout_fallback: 0

Performance is basically unchanged, and the large allocation failure rate
is lower. Enabling all mTHP sizes showed a more significant result.

Using the same test setup with 10G ZRAM and enabling all mTHP sizes:

128kB swap failure rate:
Before: swpout:451599 swpout_fallback:54525
After:  swpout:502710 swpout_fallback:870

256kB swap failure rate:
Before: swpout:63652  swpout_fallback:2708
After:  swpout:65913  swpout_fallback:20

512kB swap failure rate:
Before: swpout:11663  swpout_fallback:1767
After:  swpout:14480  swpout_fallback:6

2M swap failure rate:
Before: swpout:24     swpout_fallback:1442
After:  swpout:1329   swpout_fallback:7

The success rate of large allocations is much higher.

Link: https://lore.kernel.org/linux-mm/87v8242vng.fsf@yhuang6-desk2.ccr.corp.intel.com/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 38 ++++++++++++++++++++++++++++----------
 1 file changed, 28 insertions(+), 10 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5fdb3cb2b8b7..4a0cf4fb348d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -908,18 +908,20 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	}
 
 new_cluster:
-	ci = isolate_lock_cluster(si, &si->free_clusters);
-	if (ci) {
-		found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
-						order, usage);
-		if (found)
-			goto done;
+	/*
+	 * If the device need discard, prefer new cluster over nonfull
+	 * to spread out the writes.
+	 */
+	if (si->flags & SWP_PAGE_DISCARD) {
+		ci = isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
 	}
 
-	/* Try reclaim from full clusters if free clusters list is drained */
-	if (vm_swap_full())
-		swap_reclaim_full_clusters(si, false);
-
 	if (order < PMD_ORDER) {
 		while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
 			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
@@ -927,7 +929,23 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 			if (found)
 				goto done;
 		}
+	}
 
+	if (!(si->flags & SWP_PAGE_DISCARD)) {
+		ci = isolate_lock_cluster(si, &si->free_clusters);
+		if (ci) {
+			found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
+							order, usage);
+			if (found)
+				goto done;
+		}
+	}
+
+	/* Try reclaim full clusters if free and nonfull lists are drained */
+	if (vm_swap_full())
+		swap_reclaim_full_clusters(si, false);
+
+	if (order < PMD_ORDER) {
 		/*
 		 * Scan only one fragment cluster is good enough. Order 0
 		 * allocation will surely success, and large allocation
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] mm, swap: remove fragment clusters counter
  2025-08-06 16:17 ` [PATCH v2 2/3] mm, swap: remove fragment clusters counter Kairui Song
@ 2025-08-06 16:40   ` Nhat Pham
  2025-08-07  5:34   ` Chris Li
  2025-09-02 12:17   ` Chris Li
  2 siblings, 0 replies; 9+ messages in thread
From: Nhat Pham @ 2025-08-06 16:40 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Chris Li, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

On Wed, Aug 6, 2025 at 9:18 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> It was used for calculating the iteration number when the swap allocator
> wants to scan the whole fragment list. Now the allocator only scans one
> fragment cluster at a time, so no one uses this counter anymore.
>
> Remove it as a cleanup; the performance change is marginal:
>
> Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup
> memory limit, on top of tmpfs, 64kB mTHP enabled:
>
> Before:  sys time: 6278.45s
> After:   sys time: 6176.34s
>
> Change to 8G ZRAM:
>
> Before:  sys time: 5572.85s
> After:   sys time: 5531.49s
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Reviewed-by: Nhat Pham <nphamcs@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list
  2025-08-06 16:17 ` [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list Kairui Song
@ 2025-08-07  5:32   ` Chris Li
  2025-08-07 18:26     ` Kairui Song
  0 siblings, 1 reply; 9+ messages in thread
From: Chris Li @ 2025-08-07  5:32 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Wed, Aug 6, 2025 at 9:18 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Fragment clusters were mostly failing high order allocation already.
> The reason we scan it through now is that a swap slot may get freed
> without releasing the swap cache, so a swap map entry will end up in
> HAS_CACHE only status, and the cluster won't be moved back to non-full
> or free cluster list. This may cause a higher allocation failure rate.
>
> Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of
> slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO
> device's usage is low (!vm_swap_full()), it will try to lazy free
> the swap cache.
>
> But this fragment list scan out is a bit overkill. Fragmentation
> is only an issue for the allocator when the device is getting full,
> and by that time, swap will be releasing the swap cache aggressively
> already. Only scan one fragment cluster at a time is good enough to

Only *scanning* one fragment cluster...

> reclaim already pinned slots, and move the cluster back to nonfull.
>
> And besides, only high order allocation requires iterating over the
> list, order 0 allocation will succeed on the first attempt. And
> high order allocation failure isn't a serious problem.
>
> So the iteration of fragment clusters is trivial, but it will slow down
> large allocation by a lot when the fragment cluster list is long.
> So it's better to drop this fragment cluster iteration design.

One side note is that we make some trade off here. We trade lower
success rates >4K swap entry allocation on fragment lists with overall
faster swap entry time, because we stop searching the fragment list
early.

I notice this patch will suffer from fragment list trap behavior. The
clusters go from free -> non full -> fragment -> free again (ignore
the full list for now). Only when the cluster is completely free
again, it will reset the cluster back to the free list. Otherwise
given random swap entry access pattern, and long life cycle of some
swap entry.  Free clusters are very hard to come by. Once it is in the
fragment list, it is not easy to move back to a non full list. The
cluster will eventually gravitate towards the fragment list and trap
there. This kind of problem is not easy to expose by the kernel
compile work load, which is a batch job in nature, with very few long
running processes. If most of the clusters in the swapfile are in the
fragment list. This will cause us to give up too early and force the
more expensive swap cache reclaim path more often.

To counter that fragmentation list trap effect,  one idea is that not
all clusters in the fragment list are equal. If we make the fragment
list into a few buckets by how empty it is. e.g. >50% vs <50%. I
expect the <50% free cluster has a very low success rate for order >0
allocation. Given an order "o", we can have a math formula P(o, count)
of the success rate if slots are even randomly distributed with count
slots used. The >50% free cluster will likely have a much higher
success rate.  We should set a different search termination threshold
for different bucket classes. That way we can give the cluster a
chance to move up or down the bucket class. We should try the high
free bucket before the low free bucket.

That can combat the fragmentation list trap behavior.

BTW, we can have some simple bucket statistics to see what is the
distribution of fragmented clusters. The bucket class threshold can
dynamically adjust using the overall fullness of the swapfile.

Chris

>
> Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> only:
>
> Before: sys time: 4432.56s
> After:  sys time: 4430.18s
>
> Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
>
> Before: sys time: 11609.69s  64kB/swpout: 1787051  64kB/swpout_fallback: 20917
> After:  sys time: 5572.85s   64kB/swpout: 1797612  64kB/swpout_fallback: 19254
>
> Change to 8G ZRAM:
>
> Before: sys time: 21524.35s  64kB/swpout: 1687142  64kB/swpout_fallback: 128496
> After:  sys time: 6278.45s   64kB/swpout: 1679127  64kB/swpout_fallback: 130942
>
> Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 7393.50s  64kB/swpout:1788246  swpout_fallback: 0
> After:  sys time: 7399.88s  64kB/swpout:1784257  swpout_fallback: 0
>
> Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
>
> Before: sys time: 26292.26s 64kB/swpout:1645236  swpout_fallback: 138945
> After:  sys time: 9463.16s  64kB/swpout:1581376  swpout_fallback: 259979
>
> The performance is a lot better for large folios, and the large order
> allocation failure rate is only very slightly higher or unchanged even
> for !SWP_SYNCHRONOUS_IO devices high pressure.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> Acked-by: Nhat Pham <nphamcs@gmail.com>
> ---
>  mm/swapfile.c | 23 ++++++++---------------
>  1 file changed, 8 insertions(+), 15 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index b4f3cc712580..1f1110e37f68 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -926,32 +926,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                 swap_reclaim_full_clusters(si, false);
>
>         if (order < PMD_ORDER) {
> -               unsigned int frags = 0, frags_existing;
> -
>                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
>                                                         order, usage);
>                         if (found)
>                                 goto done;
> -                       /* Clusters failed to allocate are moved to frag_clusters */
> -                       frags++;
>                 }
>
> -               frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> -               while (frags < frags_existing &&
> -                      (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
> -                       atomic_long_dec(&si->frag_cluster_nr[order]);
> -                       /*
> -                        * Rotate the frag list to iterate, they were all
> -                        * failing high order allocation or moved here due to
> -                        * per-CPU usage, but they could contain newly released
> -                        * reclaimable (eg. lazy-freed swap cache) slots.
> -                        */
> +               /*
> +                * Scan only one fragment cluster is good enough. Order 0
> +                * allocation will surely success, and large allocation
> +                * failure is not critical. Scanning one cluster still
> +                * keeps the list rotated and reclaimed (for HAS_CACHE).
> +                */
> +               ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
> +               if (ci) {
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
>                                                         order, usage);
>                         if (found)
>                                 goto done;
> -                       frags++;
>                 }
>         }
>
> --
> 2.50.1
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] mm, swap: remove fragment clusters counter
  2025-08-06 16:17 ` [PATCH v2 2/3] mm, swap: remove fragment clusters counter Kairui Song
  2025-08-06 16:40   ` Nhat Pham
@ 2025-08-07  5:34   ` Chris Li
  2025-09-02 12:17   ` Chris Li
  2 siblings, 0 replies; 9+ messages in thread
From: Chris Li @ 2025-08-07  5:34 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Wed, Aug 6, 2025 at 9:18 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> It was used for calculating the iteration number when the swap allocator
> wants to scan the whole fragment list. Now the allocator only scans one
> fragment cluster at a time, so no one uses this counter anymore.
>
> Remove it as a cleanup; the performance change is marginal:
>
> Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup
> memory limit, on top of tmpfs, 64kB mTHP enabled:
>
> Before:  sys time: 6278.45s
> After:   sys time: 6176.34s
>
> Change to 8G ZRAM:
>
> Before:  sys time: 5572.85s
> After:   sys time: 5531.49s
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 1 -
>  mm/swapfile.c        | 7 -------
>  2 files changed, 8 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fe6ed2cc3fd..a060d102e0d1 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -310,7 +310,6 @@ struct swap_info_struct {
>                                         /* list of cluster that contains at least one free slot */
>         struct list_head frag_clusters[SWAP_NR_ORDERS];
>                                         /* list of cluster that are fragmented or contented */
> -       atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>         unsigned int pages;             /* total of usable pages of swap */
>         atomic_long_t inuse_pages;      /* number of those currently in use */
>         struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1f1110e37f68..5fdb3cb2b8b7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
>         else
>                 list_move_tail(&ci->list, list);
>         spin_unlock(&si->lock);
> -
> -       if (ci->flags == CLUSTER_FLAG_FRAG)
> -               atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> -       else if (new_flags == CLUSTER_FLAG_FRAG)
> -               atomic_long_inc(&si->frag_cluster_nr[ci->order]);
>         ci->flags = new_flags;
>  }
>
> @@ -965,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                  * allocation, but reclaim may drop si->lock and race with another user.
>                  */
>                 while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
> -                       atomic_long_dec(&si->frag_cluster_nr[o]);
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
>                                                         0, usage);
>                         if (found)
> @@ -3217,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         for (i = 0; i < SWAP_NR_ORDERS; i++) {
>                 INIT_LIST_HEAD(&si->nonfull_clusters[i]);
>                 INIT_LIST_HEAD(&si->frag_clusters[i]);
> -               atomic_long_set(&si->frag_cluster_nr[i], 0);
>         }
>
>         /*
> --
> 2.50.1
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list
  2025-08-07  5:32   ` Chris Li
@ 2025-08-07 18:26     ` Kairui Song
  0 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2025-08-07 18:26 UTC (permalink / raw)
  To: Chris Li, Andrew Morton
  Cc: linux-mm, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Huang, Ying, linux-kernel

On Thu, Aug 7, 2025 at 1:32 PM Chris Li <chrisl@kernel.org> wrote:
>
> Acked-by: Chris Li <chrisl@kernel.org>
>
> Chris
>
> On Wed, Aug 6, 2025 at 9:18 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Fragment clusters were mostly failing high order allocation already.
> > The reason we scan it through now is that a swap slot may get freed
> > without releasing the swap cache, so a swap map entry will end up in
> > HAS_CACHE only status, and the cluster won't be moved back to non-full
> > or free cluster list. This may cause a higher allocation failure rate.
> >
> > Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of
> > slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO
> > device's usage is low (!vm_swap_full()), it will try to lazy free
> > the swap cache.
> >
> > But this fragment list scan out is a bit overkill. Fragmentation
> > is only an issue for the allocator when the device is getting full,
> > and by that time, swap will be releasing the swap cache aggressively
> > already. Only scan one fragment cluster at a time is good enough to
>
> Only *scanning* one fragment cluster...

Thanks.

Hi, Andrew, can help update this word in the commit message?

>
> > reclaim already pinned slots, and move the cluster back to nonfull.
> >
> > And besides, only high order allocation requires iterating over the
> > list, order 0 allocation will succeed on the first attempt. And
> > high order allocation failure isn't a serious problem.
> >
> > So the iteration of fragment clusters is trivial, but it will slow down
> > large allocation by a lot when the fragment cluster list is long.
> > So it's better to drop this fragment cluster iteration design.
>
> One side note is that we make some trade off here. We trade lower
> success rates >4K swap entry allocation on fragment lists with overall
> faster swap entry time, because we stop searching the fragment list
> early.
>
> I notice this patch will suffer from fragment list trap behavior. The
> clusters go from free -> non full -> fragment -> free again (ignore
> the full list for now). Only when the cluster is completely free
> again, it will reset the cluster back to the free list. Otherwise
> given random swap entry access pattern, and long life cycle of some
> swap entry.  Free clusters are very hard to come by. Once it is in the
> fragment list, it is not easy to move back to a non full list. The
> cluster will eventually gravitate towards the fragment list and trap
> there. This kind of problem is not easy to expose by the kernel
> compile work load, which is a batch job in nature, with very few long
> running processes. If most of the clusters in the swapfile are in the
> fragment list. This will cause us to give up too early and force the
> more expensive swap cache reclaim path more often.
>
> To counter that fragmentation list trap effect,  one idea is that not
> all clusters in the fragment list are equal. If we make the fragment
> list into a few buckets by how empty it is. e.g. >50% vs <50%. I
> expect the <50% free cluster has a very low success rate for order >0
> allocation. Given an order "o", we can have a math formula P(o, count)
> of the success rate if slots are even randomly distributed with count
> slots used. The >50% free cluster will likely have a much higher
> success rate.  We should set a different search termination threshold
> for different bucket classes. That way we can give the cluster a
> chance to move up or down the bucket class. We should try the high
> free bucket before the low free bucket.
>
> That can combat the fragmentation list trap behavior.

That's a very good point!

I'm also thinking about after we remove HAS_CACHE, maybe we can
improve the lazy free policy or scanning design making use of the
better defined swap allocation / freeing workflows.

Just a random idea for now, I'll keep your suggestion in mind.


> BTW, we can have some simple bucket statistics to see what is the
> distribution of fragmented clusters. The bucket class threshold can
> dynamically adjust using the overall fullness of the swapfile.
>
> Chris
>
> >
> > Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48,
> > defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio
> > only:
> >
> > Before: sys time: 4432.56s
> > After:  sys time: 4430.18s
> >
> > Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM:
> >
> > Before: sys time: 11609.69s  64kB/swpout: 1787051  64kB/swpout_fallback: 20917
> > After:  sys time: 5572.85s   64kB/swpout: 1797612  64kB/swpout_fallback: 19254
> >
> > Change to 8G ZRAM:
> >
> > Before: sys time: 21524.35s  64kB/swpout: 1687142  64kB/swpout_fallback: 128496
> > After:  sys time: 6278.45s   64kB/swpout: 1679127  64kB/swpout_fallback: 130942
> >
> > Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 7393.50s  64kB/swpout:1788246  swpout_fallback: 0
> > After:  sys time: 7399.88s  64kB/swpout:1784257  swpout_fallback: 0
> >
> > Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed:
> >
> > Before: sys time: 26292.26s 64kB/swpout:1645236  swpout_fallback: 138945
> > After:  sys time: 9463.16s  64kB/swpout:1581376  swpout_fallback: 259979
> >
> > The performance is a lot better for large folios, and the large order
> > allocation failure rate is only very slightly higher or unchanged even
> > for !SWP_SYNCHRONOUS_IO devices high pressure.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > Acked-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  mm/swapfile.c | 23 ++++++++---------------
> >  1 file changed, 8 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index b4f3cc712580..1f1110e37f68 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -926,32 +926,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> >                 swap_reclaim_full_clusters(si, false);
> >
> >         if (order < PMD_ORDER) {
> > -               unsigned int frags = 0, frags_existing;
> > -
> >                 while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> >                                                         order, usage);
> >                         if (found)
> >                                 goto done;
> > -                       /* Clusters failed to allocate are moved to frag_clusters */
> > -                       frags++;
> >                 }
> >
> > -               frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > -               while (frags < frags_existing &&
> > -                      (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) {
> > -                       atomic_long_dec(&si->frag_cluster_nr[order]);
> > -                       /*
> > -                        * Rotate the frag list to iterate, they were all
> > -                        * failing high order allocation or moved here due to
> > -                        * per-CPU usage, but they could contain newly released
> > -                        * reclaimable (eg. lazy-freed swap cache) slots.
> > -                        */
> > +               /*
> > +                * Scan only one fragment cluster is good enough. Order 0
> > +                * allocation will surely success, and large allocation
> > +                * failure is not critical. Scanning one cluster still
> > +                * keeps the list rotated and reclaimed (for HAS_CACHE).
> > +                */
> > +               ci = isolate_lock_cluster(si, &si->frag_clusters[order]);
> > +               if (ci) {
> >                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
> >                                                         order, usage);
> >                         if (found)
> >                                 goto done;
> > -                       frags++;
> >                 }
> >         }
> >
> > --
> > 2.50.1
> >
> >
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 2/3] mm, swap: remove fragment clusters counter
  2025-08-06 16:17 ` [PATCH v2 2/3] mm, swap: remove fragment clusters counter Kairui Song
  2025-08-06 16:40   ` Nhat Pham
  2025-08-07  5:34   ` Chris Li
@ 2025-09-02 12:17   ` Chris Li
  2 siblings, 0 replies; 9+ messages in thread
From: Chris Li @ 2025-09-02 12:17 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Huang, Ying, linux-kernel

Acked-by: Chris Li <chrisl@kernel.org>

Chris

On Wed, Aug 6, 2025 at 9:18 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> It was used for calculating the iteration number when the swap allocator
> wants to scan the whole fragment list. Now the allocator only scans one
> fragment cluster at a time, so no one uses this counter anymore.
>
> Remove it as a cleanup; the performance change is marginal:
>
> Build linux kernel using 10G ZRAM, make -j96, defconfig with 2G cgroup
> memory limit, on top of tmpfs, 64kB mTHP enabled:
>
> Before:  sys time: 6278.45s
> After:   sys time: 6176.34s
>
> Change to 8G ZRAM:
>
> Before:  sys time: 5572.85s
> After:   sys time: 5531.49s
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  include/linux/swap.h | 1 -
>  mm/swapfile.c        | 7 -------
>  2 files changed, 8 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fe6ed2cc3fd..a060d102e0d1 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -310,7 +310,6 @@ struct swap_info_struct {
>                                         /* list of cluster that contains at least one free slot */
>         struct list_head frag_clusters[SWAP_NR_ORDERS];
>                                         /* list of cluster that are fragmented or contented */
> -       atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
>         unsigned int pages;             /* total of usable pages of swap */
>         atomic_long_t inuse_pages;      /* number of those currently in use */
>         struct swap_sequential_cluster *global_cluster; /* Use one global cluster for rotating device */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 1f1110e37f68..5fdb3cb2b8b7 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -470,11 +470,6 @@ static void move_cluster(struct swap_info_struct *si,
>         else
>                 list_move_tail(&ci->list, list);
>         spin_unlock(&si->lock);
> -
> -       if (ci->flags == CLUSTER_FLAG_FRAG)
> -               atomic_long_dec(&si->frag_cluster_nr[ci->order]);
> -       else if (new_flags == CLUSTER_FLAG_FRAG)
> -               atomic_long_inc(&si->frag_cluster_nr[ci->order]);
>         ci->flags = new_flags;
>  }
>
> @@ -965,7 +960,6 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
>                  * allocation, but reclaim may drop si->lock and race with another user.
>                  */
>                 while ((ci = isolate_lock_cluster(si, &si->frag_clusters[o]))) {
> -                       atomic_long_dec(&si->frag_cluster_nr[o]);
>                         found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci),
>                                                         0, usage);
>                         if (found)
> @@ -3217,7 +3211,6 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
>         for (i = 0; i < SWAP_NR_ORDERS; i++) {
>                 INIT_LIST_HEAD(&si->nonfull_clusters[i]);
>                 INIT_LIST_HEAD(&si->frag_clusters[i]);
> -               atomic_long_set(&si->frag_cluster_nr[i], 0);
>         }
>
>         /*
> --
> 2.50.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-02 12:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-06 16:17 [PATCH v2 0/3] mm, swap: improve cluster scan strategy Kairui Song
2025-08-06 16:17 ` [PATCH v2 1/3] mm, swap: only scan one cluster in fragment list Kairui Song
2025-08-07  5:32   ` Chris Li
2025-08-07 18:26     ` Kairui Song
2025-08-06 16:17 ` [PATCH v2 2/3] mm, swap: remove fragment clusters counter Kairui Song
2025-08-06 16:40   ` Nhat Pham
2025-08-07  5:34   ` Chris Li
2025-09-02 12:17   ` Chris Li
2025-08-06 16:17 ` [PATCH v2 3/3] mm, swap: prefer nonfull over free clusters Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).