[PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
@ 2024-07-31  6:49 Chris Li
  2024-07-31  6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
                   ` (11 more replies)
  0 siblings, 12 replies; 32+ messages in thread
From: Chris Li @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

This is the short term solutions "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.

When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from the new empty cluster list.  It has a fragmentation issue
reported by Barry.

https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/

The reason is that all the empty clusters have been exhausted while
there are plenty of free swap entries in the cluster that are
not 100% free.

Remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.

This series gives the swap SSD allocation a new separate code path
from the HDD allocation. The new allocator use cluster list only
and do not global scan swap_map[] without lock any more.

This streamline the swap allocation for SSD. The code matches the
execution flow much better.

User impact: For users that allocate and free mix order mTHP swapping,
It greatly improves the success rate of the mTHP swap allocation after the
initial phase.

It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning
a lot of swap_map entries. 

With Barry's mthp test program V2:

Without:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
...
Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%

With: # with all 0.00% filter out
$ ./thp_swap_allocator_test -a | grep -v "0.00%"
$ # all result are 0.00%

$ ./thp_swap_allocator_test -a -s | grep -v "0.00%"
./thp_swap_allocator_test -a -s | grep -v "0.00%" 
Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 19: swpout inc: 219, swpout fallback inc: 7, Fallback percentage: 3.10%
Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 29: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 34: swpout inc: 220, swpout fallback inc: 8, Fallback percentage: 3.51%
Iteration 35: swpout inc: 222, swpout fallback inc: 11, Fallback percentage: 4.72%
Iteration 38: swpout inc: 217, swpout fallback inc: 4, Fallback percentage: 1.81%
Iteration 40: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
Iteration 42: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 43: swpout inc: 215, swpout fallback inc: 7, Fallback percentage: 3.15%
Iteration 47: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 49: swpout inc: 217, swpout fallback inc: 1, Fallback percentage: 0.46%
Iteration 52: swpout inc: 221, swpout fallback inc: 8, Fallback percentage: 3.49%
Iteration 56: swpout inc: 224, swpout fallback inc: 4, Fallback percentage: 1.75%
Iteration 58: swpout inc: 214, swpout fallback inc: 5, Fallback percentage: 2.28%
Iteration 62: swpout inc: 220, swpout fallback inc: 3, Fallback percentage: 1.35%
Iteration 64: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 67: swpout inc: 221, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 75: swpout inc: 220, swpout fallback inc: 9, Fallback percentage: 3.93%
Iteration 82: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 86: swpout inc: 211, swpout fallback inc: 12, Fallback percentage: 5.38%
Iteration 89: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 93: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 94: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 96: swpout inc: 221, swpout fallback inc: 6, Fallback percentage: 2.64%
Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 99: swpout inc: 227, swpout fallback inc: 3, Fallback percentage: 1.30%

$ ./thp_swap_allocator_test      
./thp_swap_allocator_test 
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%

$ ./thp_swap_allocator_test -s
 ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 97, swpout fallback inc: 135, Fallback percentage: 58.19%
Iteration 3: swpout inc: 42, swpout fallback inc: 192, Fallback percentage: 82.05%
Iteration 4: swpout inc: 19, swpout fallback inc: 214, Fallback percentage: 91.85%
Iteration 5: swpout inc: 12, swpout fallback inc: 213, Fallback percentage: 94.67%
Iteration 6: swpout inc: 11, swpout fallback inc: 217, Fallback percentage: 95.18%
Iteration 7: swpout inc: 9, swpout fallback inc: 214, Fallback percentage: 95.96%
Iteration 8: swpout inc: 8, swpout fallback inc: 213, Fallback percentage: 96.38%
Iteration 9: swpout inc: 2, swpout fallback inc: 223, Fallback percentage: 99.11%
Iteration 10: swpout inc: 2, swpout fallback inc: 228, Fallback percentage: 99.13%
Iteration 11: swpout inc: 4, swpout fallback inc: 214, Fallback percentage: 98.17%
Iteration 12: swpout inc: 5, swpout fallback inc: 226, Fallback percentage: 97.84%
Iteration 13: swpout inc: 3, swpout fallback inc: 212, Fallback percentage: 98.60%
Iteration 14: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
Iteration 15: swpout inc: 3, swpout fallback inc: 222, Fallback percentage: 98.67%
Iteration 16: swpout inc: 4, swpout fallback inc: 223, Fallback percentage: 98.24%

=========
Kernel compile under tmpfs with cgroup memory.max = 470M.
12 core 24 hyperthreading, 32 jobs. 10 Run each group

SSD swap 10 runs average, 20G swap partition:
With:
user    2929.064
system  1479.381 : 1376.89 1398.22 1444.64 1477.39 1479.04 1497.27
1504.47 1531.4 1532.92 1551.57
real    1441.324

Without:
user    2910.872
system  1482.732 : 1440.01 1451.4 1462.01 1467.47 1467.51 1469.3
1470.19 1496.32 1544.1 1559.01
real    1580.822

Two zram swap: zram0 3.0G zram1 20G.

The idea is forcing the zram0 almost full then overflow to zram1:

With:
user    4320.301
system  4272.403 : 4236.24 4262.81 4264.75 4269.13 4269.44 4273.06
4279.85 4285.98 4289.64 4293.13
real    431.759

Without
user    4301.393
system  4387.672 : 4374.47 4378.3 4380.95 4382.84 4383.06 4388.05
4389.76 4397.16 4398.23 4403.9
real    433.979

------ more test result from Kaiui ----------

Test with build linux kernel using a 4G ZRAM, 1G memory.max limit on top of shmem:

System info: 32 Core AMD Zen2, 64G total memory.

Test 3 times using only 4K pages:
=================================

With:
-----
1838.74user 2411.21system 2:37.86elapsed 2692%CPU (0avgtext+0avgdata 847060maxresident)k
1839.86user 2465.77system 2:39.35elapsed 2701%CPU (0avgtext+0avgdata 847060maxresident)k
1840.26user 2454.68system 2:39.43elapsed 2693%CPU (0avgtext+0avgdata 847060maxresident)k

Summary (~4.6% improment of system time):
User: 1839.62
System: 2443.89: 2465.77 2454.68 2411.21
Real: 158.88

Without:
--------
1837.99user 2575.95system 2:43.09elapsed 2706%CPU (0avgtext+0avgdata 846520maxresident)k
1838.32user 2555.15system 2:42.52elapsed 2709%CPU (0avgtext+0avgdata 846520maxresident)k
1843.02user 2561.55system 2:43.35elapsed 2702%CPU (0avgtext+0avgdata 846520maxresident)k

Summary:
User: 1839.78
System: 2564.22: 2575.95 2555.15 2561.55
Real: 162.99

Test 5 times using enabled all mTHP pages:
==========================================

With:
-----
1796.44user 2937.33system 2:59.09elapsed 2643%CPU (0avgtext+0avgdata 846936maxresident)k
1802.55user 3002.32system 2:54.68elapsed 2750%CPU (0avgtext+0avgdata 847072maxresident)k
1806.59user 2986.53system 2:55.17elapsed 2736%CPU (0avgtext+0avgdata 847092maxresident)k
1803.27user 2982.40system 2:54.49elapsed 2742%CPU (0avgtext+0avgdata 846796maxresident)k
1807.43user 3036.08system 2:56.06elapsed 2751%CPU (0avgtext+0avgdata 846488maxresident)k

Summary (~8.4% improvement of system time):
User: 1803.25
System: 2988.93: 2937.33 3002.32 2986.53 2982.40 3036.08
Real: 175.90

mTHP swapout status:
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout:347721
/sys/kernel/mm/transparent_hugepage/hugepages-32kB/stats/swpout_fallback:3110
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout:3365
/sys/kernel/mm/transparent_hugepage/hugepages-512kB/stats/swpout_fallback:8269
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout:24
/sys/kernel/mm/transparent_hugepage/hugepages-2048kB/stats/swpout_fallback:3341
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout:145
/sys/kernel/mm/transparent_hugepage/hugepages-1024kB/stats/swpout_fallback:5038
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout:322737
/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback:36808
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout:380455
/sys/kernel/mm/transparent_hugepage/hugepages-16kB/stats/swpout_fallback:1010
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout:24973
/sys/kernel/mm/transparent_hugepage/hugepages-256kB/stats/swpout_fallback:13223
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout:197348
/sys/kernel/mm/transparent_hugepage/hugepages-128kB/stats/swpout_fallback:80541

Without:
--------
1794.41user 3151.29system 3:05.97elapsed 2659%CPU (0avgtext+0avgdata 846704maxresident)k
1810.27user 3304.48system 3:05.38elapsed 2759%CPU (0avgtext+0avgdata 846636maxresident)k
1809.84user 3254.85system 3:03.83elapsed 2755%CPU (0avgtext+0avgdata 846952maxresident)k
1813.54user 3259.56system 3:04.28elapsed 2752%CPU (0avgtext+0avgdata 846848maxresident)k
1829.97user 3338.40system 3:07.32elapsed 2759%CPU (0avgtext+0avgdata 847024maxresident)k

Summary:
User: 1811.61
System: 3261.72 : 3151.29 3304.48 3254.85 3259.56 3338.40
Real: 185.356

mTHP swapout status:
hugepages-32kB/stats/swpout:35630
hugepages-32kB/stats/swpout_fallback:1809908
hugepages-512kB/stats/swpout:523
hugepages-512kB/stats/swpout_fallback:55235
hugepages-2048kB/stats/swpout:53
hugepages-2048kB/stats/swpout_fallback:17264
hugepages-1024kB/stats/swpout:85
hugepages-1024kB/stats/swpout_fallback:24979
hugepages-64kB/stats/swpout:30117
hugepages-64kB/stats/swpout_fallback:1825399
hugepages-16kB/stats/swpout:42775
hugepages-16kB/stats/swpout_fallback:1951123
hugepages-256kB/stats/swpout:2326
hugepages-256kB/stats/swpout_fallback:170165
hugepages-128kB/stats/swpout:17925
hugepages-128kB/stats/swpout_fallback:1309757

Reported-by: Barry Song <21cnbao@gmail.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v5:
- Suggestion and fix up from v4 discussion thread from Yinm and Ryan.
- Adding Kairui's swap cache reclaim patches on top of patch 3.
- Link to v4: https://lore.kernel.org/r/20240711-swap-allocator-v4-0-0295a4d4c7aa@kernel.org

Changes in v4:
- Remove a warning in patch 2.
- Allocating from the free cluster list before the nonfull list. Revert the v3 behavior.
- Add cluster_index and cluster_offset function.
- Patch 3 has a new allocating path for SSD.
- HDD swap allocation does not need to consider clusters any more.

Changes in v3:
- Using V1 as base.
- Rename "next" to "list" for the list field, suggested by Ying.
- Update comment for the locking rules for cluster fields and list,
  suggested by Ying.
- Allocate from the nonfull list before attempting free list, suggested
  by Kairui.
- Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org

Changes in v2:
- Abandoned.
- Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org

---
Chris Li (3):
      mm: swap: swap cluster switch to double link list
      mm: swap: mTHP allocate swap entries from nonfull list
      mm: swap: separate SSD allocation from scan_swap_map_slots()

Kairui Song (6):
      mm: swap: clean up initialization helper
      mm: swap: skip slot cache on freeing for mTHP
      mm: swap: allow cache reclaim to skip slot cache
      mm: swap: add a fragment cluster list
      mm: swap: relaim the cached parts that got scanned
      mm: swap: add a adaptive full cluster cache reclaim

 include/linux/swap.h |  34 ++-
 mm/swapfile.c        | 840 ++++++++++++++++++++++++++++++---------------------
 2 files changed, 514 insertions(+), 360 deletions(-)
---
base-commit: ff3a648ecb9409aff1448cf4f6aa41d78c69a3bc
change-id: 20240523-swap-allocator-1534c480ece4

Best regards,
-- 
Chris Li <chrisl@kernel.org>



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v5 1/9] mm: swap: swap cluster switch to double link list
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
@ 2024-07-31  6:49 ` Chris Li
  2024-07-31  6:49 ` [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

Previously, the swap cluster used a cluster index as a pointer
to construct a custom single link list type "swap_cluster_list".
The next cluster pointer is shared with the cluster->count.
It prevents puting the non free cluster into a list.

Change the cluster to use the standard double link list instead.
This allows tracing the nonfull cluster in the follow up patch.
That way, it is faster to get to the nonfull cluster of that order.

Remove the cluster getter/setter for accessing the cluster
struct member.

The list operation is protected by the swap_info_struct->lock.

Change cluster code to use "struct swap_cluster_info *" to
reference the cluster rather than by using index. That is more
consistent with the list manipulation. It avoids the repeat
adding index to the cluser_info. The code is easier to understand.

Remove the cluster next pointer is NULL flag, the double link
list can handle the empty list pretty well.

The "swap_cluster_info" struct is two pointer bigger, because
512 swap entries share one swap_cluster_info struct, it has very
little impact on the average memory usage per swap entry. For
1TB swapfile, the swap cluster data structure increases from 8MB
to 24MB.

Other than the list conversion, there is no real function change
in this patch.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap.h |  25 ++----
 mm/swapfile.c        | 226 ++++++++++++++-------------------------------------
 2 files changed, 71 insertions(+), 180 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index e473fe6cfb7a..edafd52d7ac4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -243,22 +243,20 @@ enum {
  * free clusters are organized into a list. We fetch an entry from the list to
  * get a free cluster.
  *
- * The data field stores next cluster if the cluster is free or cluster usage
- * counter otherwise. The flags field determines if a cluster is free. This is
- * protected by swap_info_struct.lock.
+ * The flags field determines if a cluster is free. This is
+ * protected by cluster lock.
  */
 struct swap_cluster_info {
 	spinlock_t lock;	/*
 				 * Protect swap_cluster_info fields
-				 * and swap_info_struct->swap_map
-				 * elements correspond to the swap
-				 * cluster
+				 * other than list, and swap_info_struct->swap_map
+				 * elements corresponding to the swap cluster.
 				 */
-	unsigned int data:24;
-	unsigned int flags:8;
+	u16 count;
+	u8 flags;
+	struct list_head list;
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
 
 /*
  * The first page in the swap file is the swap header, which is always marked
@@ -283,11 +281,6 @@ struct percpu_cluster {
 	unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
 };
 
-struct swap_cluster_list {
-	struct swap_cluster_info head;
-	struct swap_cluster_info tail;
-};
-
 /*
  * The in-memory structure used to track swap areas.
  */
@@ -301,7 +294,7 @@ struct swap_info_struct {
 	unsigned char *swap_map;	/* vmalloc'ed array of usage counts */
 	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
-	struct swap_cluster_list free_clusters; /* free clusters list */
+	struct list_head free_clusters; /* free clusters list */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
@@ -332,7 +325,7 @@ struct swap_info_struct {
 					 * list.
 					 */
 	struct work_struct discard_work; /* discard worker */
-	struct swap_cluster_list discard_clusters; /* discard clusters list */
+	struct list_head discard_clusters; /* discard clusters list */
 	struct plist_node avail_lists[]; /*
 					   * entries in swap_avail_heads, one
 					   * entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f7224bc1320c..bceead7f9e3c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #endif
 #define LATENCY_LIMIT		256
 
-static inline void cluster_set_flag(struct swap_cluster_info *info,
-	unsigned int flag)
-{
-	info->flags = flag;
-}
-
-static inline unsigned int cluster_count(struct swap_cluster_info *info)
-{
-	return info->data;
-}
-
-static inline void cluster_set_count(struct swap_cluster_info *info,
-				     unsigned int c)
-{
-	info->data = c;
-}
-
-static inline void cluster_set_count_flag(struct swap_cluster_info *info,
-					 unsigned int c, unsigned int f)
-{
-	info->flags = f;
-	info->data = c;
-}
-
-static inline unsigned int cluster_next(struct swap_cluster_info *info)
-{
-	return info->data;
-}
-
-static inline void cluster_set_next(struct swap_cluster_info *info,
-				    unsigned int n)
-{
-	info->data = n;
-}
-
-static inline void cluster_set_next_flag(struct swap_cluster_info *info,
-					 unsigned int n, unsigned int f)
-{
-	info->flags = f;
-	info->data = n;
-}
-
 static inline bool cluster_is_free(struct swap_cluster_info *info)
 {
 	return info->flags & CLUSTER_FLAG_FREE;
 }
 
-static inline bool cluster_is_null(struct swap_cluster_info *info)
-{
-	return info->flags & CLUSTER_FLAG_NEXT_NULL;
-}
-
-static inline void cluster_set_null(struct swap_cluster_info *info)
+static inline unsigned int cluster_index(struct swap_info_struct *si,
+					 struct swap_cluster_info *ci)
 {
-	info->flags = CLUSTER_FLAG_NEXT_NULL;
-	info->data = 0;
+	return ci - si->cluster_info;
 }
 
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
 		spin_unlock(&si->lock);
 }
 
-static inline bool cluster_list_empty(struct swap_cluster_list *list)
-{
-	return cluster_is_null(&list->head);
-}
-
-static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
-{
-	return cluster_next(&list->head);
-}
-
-static void cluster_list_init(struct swap_cluster_list *list)
-{
-	cluster_set_null(&list->head);
-	cluster_set_null(&list->tail);
-}
-
-static void cluster_list_add_tail(struct swap_cluster_list *list,
-				  struct swap_cluster_info *ci,
-				  unsigned int idx)
-{
-	if (cluster_list_empty(list)) {
-		cluster_set_next_flag(&list->head, idx, 0);
-		cluster_set_next_flag(&list->tail, idx, 0);
-	} else {
-		struct swap_cluster_info *ci_tail;
-		unsigned int tail = cluster_next(&list->tail);
-
-		/*
-		 * Nested cluster lock, but both cluster locks are
-		 * only acquired when we held swap_info_struct->lock
-		 */
-		ci_tail = ci + tail;
-		spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
-		cluster_set_next(ci_tail, idx);
-		spin_unlock(&ci_tail->lock);
-		cluster_set_next_flag(&list->tail, idx, 0);
-	}
-}
-
-static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
-					   struct swap_cluster_info *ci)
-{
-	unsigned int idx;
-
-	idx = cluster_next(&list->head);
-	if (cluster_next(&list->tail) == idx) {
-		cluster_set_null(&list->head);
-		cluster_set_null(&list->tail);
-	} else
-		cluster_set_next_flag(&list->head,
-				      cluster_next(&ci[idx]), 0);
-
-	return idx;
-}
-
 /* Add a cluster to discard list and schedule it to do discard */
 static void swap_cluster_schedule_discard(struct swap_info_struct *si,
-		unsigned int idx)
+		struct swap_cluster_info *ci)
 {
+	unsigned int idx = cluster_index(si, ci);
 	/*
 	 * If scan_swap_map_slots() can't find a free cluster, it will check
 	 * si->swap_map directly. To make sure the discarding cluster isn't
@@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
 
-	cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
-
+	list_add_tail(&ci->list, &si->discard_clusters);
 	schedule_work(&si->discard_work);
 }
 
-static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	struct swap_cluster_info *ci = si->cluster_info;
-
-	cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
-	cluster_list_add_tail(&si->free_clusters, ci, idx);
+	ci->flags = CLUSTER_FLAG_FREE;
+	list_add_tail(&ci->list, &si->free_clusters);
 }
 
 /*
@@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
 */
 static void swap_do_scheduled_discard(struct swap_info_struct *si)
 {
-	struct swap_cluster_info *info, *ci;
+	struct swap_cluster_info *ci;
 	unsigned int idx;
 
-	info = si->cluster_info;
-
-	while (!cluster_list_empty(&si->discard_clusters)) {
-		idx = cluster_list_del_first(&si->discard_clusters, info);
+	while (!list_empty(&si->discard_clusters)) {
+		ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
+		list_del(&ci->list);
+		idx = cluster_index(si, ci);
 		spin_unlock(&si->lock);
 
 		discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-		ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
-		__free_cluster(si, idx);
+
+		spin_lock(&ci->lock);
+		__free_cluster(si, ci);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 				0, SWAPFILE_CLUSTER);
-		unlock_cluster(ci);
+		spin_unlock(&ci->lock);
 	}
 }
 
@@ -521,20 +418,21 @@ static void swap_users_ref_free(struct percpu_ref *ref)
 	complete(&si->comp);
 }
 
-static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
 {
-	struct swap_cluster_info *ci = si->cluster_info;
+	struct swap_cluster_info *ci = list_first_entry(&si->free_clusters,
+							struct swap_cluster_info, list);
 
-	VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
-	cluster_list_del_first(&si->free_clusters, ci);
-	cluster_set_count_flag(ci + idx, 0, 0);
+	VM_BUG_ON(cluster_index(si, ci) != idx);
+	list_del(&ci->list);
+	ci->count = 0;
+	ci->flags = 0;
+	return ci;
 }
 
-static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
-	struct swap_cluster_info *ci = si->cluster_info + idx;
-
-	VM_BUG_ON(cluster_count(ci) != 0);
+	VM_BUG_ON(ci->count != 0);
 	/*
 	 * If the swap is discardable, prepare discard the cluster
 	 * instead of free it immediately. The cluster will be freed
@@ -542,11 +440,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
 	 */
 	if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
 	    (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
-		swap_cluster_schedule_discard(si, idx);
+		swap_cluster_schedule_discard(si, ci);
 		return;
 	}
 
-	__free_cluster(si, idx);
+	__free_cluster(si, ci);
 }
 
 /*
@@ -559,15 +457,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
 	unsigned long count)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+	struct swap_cluster_info *ci = cluster_info + idx;
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(&cluster_info[idx]))
+	if (cluster_is_free(ci))
 		alloc_cluster(p, idx);
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
-	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) + count);
+	VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
+	ci->count += count;
 }
 
 /*
@@ -581,24 +479,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 }
 
 /*
- * The cluster corresponding to page_nr decreases one usage. If the usage
- * counter becomes 0, which means no page in the cluster is in using, we can
- * optionally discard the cluster and add it to free cluster list.
+ * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * which means no page in the cluster is in use, we can optionally discard
+ * the cluster and add it to free cluster list.
  */
-static void dec_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
 {
-	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
-
-	if (!cluster_info)
+	if (!p->cluster_info)
 		return;
 
-	VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
-	cluster_set_count(&cluster_info[idx],
-		cluster_count(&cluster_info[idx]) - 1);
+	VM_BUG_ON(ci->count == 0);
+	ci->count--;
 
-	if (cluster_count(&cluster_info[idx]) == 0)
-		free_cluster(p, idx);
+	if (!ci->count)
+		free_cluster(p, ci);
 }
 
 /*
@@ -611,10 +505,12 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
 {
 	struct percpu_cluster *percpu_cluster;
 	bool conflict;
+	struct swap_cluster_info *first = list_first_entry(&si->free_clusters,
+							   struct swap_cluster_info, list);
 
 	offset /= SWAPFILE_CLUSTER;
-	conflict = !cluster_list_empty(&si->free_clusters) &&
-		offset != cluster_list_first(&si->free_clusters) &&
+	conflict = !list_empty(&si->free_clusters) &&
+		offset !=  cluster_index(si, first) &&
 		cluster_is_free(&si->cluster_info[offset]);
 
 	if (!conflict)
@@ -655,10 +551,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	cluster = this_cpu_ptr(si->percpu_cluster);
 	tmp = cluster->next[order];
 	if (tmp == SWAP_NEXT_INVALID) {
-		if (!cluster_list_empty(&si->free_clusters)) {
-			tmp = cluster_next(&si->free_clusters.head) *
-					SWAPFILE_CLUSTER;
-		} else if (!cluster_list_empty(&si->discard_clusters)) {
+		if (!list_empty(&si->free_clusters)) {
+			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+		} else if (!list_empty(&si->discard_clusters)) {
 			/*
 			 * we don't have free cluster but have some clusters in
 			 * discarding, do discard now and reclaim them, then
@@ -1070,8 +966,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 
 	ci = lock_cluster(si, offset);
 	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
-	cluster_set_count_flag(ci, 0, 0);
-	free_cluster(si, idx);
+	ci->count = 0;
+	ci->flags = 0;
+	free_cluster(si, ci);
 	unlock_cluster(ci);
 	swap_range_free(si, offset, SWAPFILE_CLUSTER);
 }
@@ -1344,7 +1241,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
 	count = p->swap_map[offset];
 	VM_BUG_ON(count != SWAP_HAS_CACHE);
 	p->swap_map[offset] = 0;
-	dec_cluster_info_page(p, p->cluster_info, offset);
+	dec_cluster_info_page(p, ci);
 	unlock_cluster(ci);
 
 	mem_cgroup_uncharge_swap(entry, 1);
@@ -3022,8 +2919,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
-	cluster_list_init(&p->free_clusters);
-	cluster_list_init(&p->discard_clusters);
+	INIT_LIST_HEAD(&p->free_clusters);
+	INIT_LIST_HEAD(&p->discard_clusters);
 
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
@@ -3074,14 +2971,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
 		j = (k + col) % SWAP_CLUSTER_COLS;
 		for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+			struct swap_cluster_info *ci;
 			idx = i * SWAP_CLUSTER_COLS + j;
+			ci = cluster_info + idx;
 			if (idx >= nr_clusters)
 				continue;
-			if (cluster_count(&cluster_info[idx]))
+			if (ci->count)
 				continue;
-			cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
-			cluster_list_add_tail(&p->free_clusters, cluster_info,
-					      idx);
+			ci->flags = CLUSTER_FLAG_FREE;
+			list_add_tail(&ci->list, &p->free_clusters);
 		}
 	}
 	return nr_extents;

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
  2024-07-31  6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
@ 2024-07-31  6:49 ` Chris Li
       [not found]   ` <87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com>
  2024-07-31  6:49 ` [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() Chris Li
                   ` (9 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

Track the nonfull cluster as well as the empty cluster
on lists. Each order has one nonfull cluster list.

The cluster will remember which order it was used during
new cluster allocation.

When the cluster has free entry, add to the nonfull[order]
list.  When the free cluster list is empty, also allocate
from the nonempty list of that order.

This improves the mTHP swap allocation success rate.

There are limitations if the distribution of numbers of
different orders of mTHP changes a lot. e.g. there are a lot
of nonfull cluster assign to order A while later time there
are a lot of order B allocation while very little allocation
in order A. Currently the cluster used by order A will not
reused by order B unless the cluster is 100% empty.

Signed-off-by: Chris Li <chrisl@kernel.org>
---
 include/linux/swap.h |  4 ++++
 mm/swapfile.c        | 38 +++++++++++++++++++++++++++++++++++---
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index edafd52d7ac4..6716ef236766 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,9 +254,11 @@ struct swap_cluster_info {
 				 */
 	u16 count;
 	u8 flags;
+	u8 order;
 	struct list_head list;
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
+#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
 
 /*
  * The first page in the swap file is the swap header, which is always marked
@@ -295,6 +297,8 @@ struct swap_info_struct {
 	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
+	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+					/* list of cluster that contains at least one free slot */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bceead7f9e3c..dcf09eb549db 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -361,14 +361,22 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 	memset(si->swap_map + idx * SWAPFILE_CLUSTER,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
 
-	list_add_tail(&ci->list, &si->discard_clusters);
+	VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
+	if (ci->flags & CLUSTER_FLAG_NONFULL)
+		list_move_tail(&ci->list, &si->discard_clusters);
+	else
+		list_add_tail(&ci->list, &si->discard_clusters);
+	ci->flags = 0;
 	schedule_work(&si->discard_work);
 }
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
+	if (ci->flags & CLUSTER_FLAG_NONFULL)
+		list_move_tail(&ci->list, &si->free_clusters);
+	else
+		list_add_tail(&ci->list, &si->free_clusters);
 	ci->flags = CLUSTER_FLAG_FREE;
-	list_add_tail(&ci->list, &si->free_clusters);
 }
 
 /*
@@ -491,8 +499,15 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
 	VM_BUG_ON(ci->count == 0);
 	ci->count--;
 
-	if (!ci->count)
+	if (!ci->count) {
 		free_cluster(p, ci);
+		return;
+	}
+
+	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+		list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+		ci->flags |= CLUSTER_FLAG_NONFULL;
+	}
 }
 
 /*
@@ -553,6 +568,19 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
 	if (tmp == SWAP_NEXT_INVALID) {
 		if (!list_empty(&si->free_clusters)) {
 			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+			list_del(&ci->list);
+			spin_lock(&ci->lock);
+			ci->order = order;
+			ci->flags = 0;
+			spin_unlock(&ci->lock);
+			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+		} else if (!list_empty(&si->nonfull_clusters[order])) {
+			ci = list_first_entry(&si->nonfull_clusters[order],
+					      struct swap_cluster_info, list);
+			list_del(&ci->list);
+			spin_lock(&ci->lock);
+			ci->flags = 0;
+			spin_unlock(&ci->lock);
 			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
 		} else if (!list_empty(&si->discard_clusters)) {
 			/*
@@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 	ci = lock_cluster(si, offset);
 	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
 	ci->count = 0;
+	ci->order = 0;
 	ci->flags = 0;
 	free_cluster(si, ci);
 	unlock_cluster(ci);
@@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	INIT_LIST_HEAD(&p->free_clusters);
 	INIT_LIST_HEAD(&p->discard_clusters);
 
+	for (i = 0; i < SWAP_NR_ORDERS; i++)
+		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 		if (page_nr == 0 || page_nr > swap_header->info.last_page)

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots()
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
  2024-07-31  6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
  2024-07-31  6:49 ` [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
@ 2024-07-31  6:49 ` Chris Li
  2024-07-31  6:49 ` [PATCH v5 4/9] mm: swap: clean up initialization helper chrisl
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

Previously the SSD and HDD share the same swap_map scan loop in
scan_swap_map_slots(). This function is complex and hard to flow
the execution flow.

scan_swap_map_try_ssd_cluster() can already do most of the heavy
lifting to locate the candidate swap range in the cluster. However
it needs to go back to scan_swap_map_slots() to check conflict
and then perform the allocation.

When scan_swap_map_try_ssd_cluster() failed, it still depended on
the scan_swap_map_slots() to do brute force scanning of the swap_map.
When the swapfile is large and almost full, it will take some CPU
time to go through the swap_map array.

Get rid of the cluster allocation dependency on the swap_map scan
loop in scan_swap_map_slots(). Streamline the cluster allocation
code path.  No more conflict checks.

For order 0 swap entry, when run out of free and nonfull list.
It will allocate from the higher order nonfull cluster list.

Users should see less CPU time spent on searching the free swap
slot when swapfile is almost full.

Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 300 ++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 168 insertions(+), 132 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index dcf09eb549db..8a72c8a9aafd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,6 +53,8 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
+static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+			     unsigned int nr_entries);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -301,6 +303,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
 	return ci - si->cluster_info;
 }
 
+static inline unsigned int cluster_offset(struct swap_info_struct *si,
+					  struct swap_cluster_info *ci)
+{
+	return cluster_index(si, ci) * SWAPFILE_CLUSTER;
+}
+
 static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
 						     unsigned long offset)
 {
@@ -372,11 +380,15 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 
 static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
+	lockdep_assert_held(&si->lock);
+	lockdep_assert_held(&ci->lock);
+
 	if (ci->flags & CLUSTER_FLAG_NONFULL)
 		list_move_tail(&ci->list, &si->free_clusters);
 	else
 		list_add_tail(&ci->list, &si->free_clusters);
 	ci->flags = CLUSTER_FLAG_FREE;
+	ci->order = 0;
 }
 
 /*
@@ -431,9 +443,11 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi
 	struct swap_cluster_info *ci = list_first_entry(&si->free_clusters,
 							struct swap_cluster_info, list);
 
+	lockdep_assert_held(&si->lock);
+	lockdep_assert_held(&ci->lock);
 	VM_BUG_ON(cluster_index(si, ci) != idx);
+	VM_BUG_ON(ci->count);
 	list_del(&ci->list);
-	ci->count = 0;
 	ci->flags = 0;
 	return ci;
 }
@@ -441,6 +455,8 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi
 static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
 	VM_BUG_ON(ci->count != 0);
+	lockdep_assert_held(&si->lock);
+	lockdep_assert_held(&ci->lock);
 	/*
 	 * If the swap is discardable, prepare discard the cluster
 	 * instead of free it immediately. The cluster will be freed
@@ -497,6 +513,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
 		return;
 
 	VM_BUG_ON(ci->count == 0);
+	VM_BUG_ON(cluster_is_free(ci));
+	lockdep_assert_held(&p->lock);
+	lockdep_assert_held(&ci->lock);
 	ci->count--;
 
 	if (!ci->count) {
@@ -505,48 +524,88 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
 	}
 
 	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
 		list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
-		ci->flags |= CLUSTER_FLAG_NONFULL;
+		ci->flags = CLUSTER_FLAG_NONFULL;
 	}
 }
 
-/*
- * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
- * cluster list. Avoiding such abuse to avoid list corruption.
- */
-static bool
-scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
-	unsigned long offset, int order)
+static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start,
+				      unsigned int nr_pages)
 {
-	struct percpu_cluster *percpu_cluster;
-	bool conflict;
-	struct swap_cluster_info *first = list_first_entry(&si->free_clusters,
-							   struct swap_cluster_info, list);
-
-	offset /= SWAPFILE_CLUSTER;
-	conflict = !list_empty(&si->free_clusters) &&
-		offset !=  cluster_index(si, first) &&
-		cluster_is_free(&si->cluster_info[offset]);
+	unsigned char *p = si->swap_map + start;
+	unsigned char *end = p + nr_pages;
 
-	if (!conflict)
-		return false;
+	while (p < end)
+		if (*p++)
+			return false;
 
-	percpu_cluster = this_cpu_ptr(si->percpu_cluster);
-	percpu_cluster->next[order] = SWAP_NEXT_INVALID;
 	return true;
 }
 
-static inline bool swap_range_empty(char *swap_map, unsigned int start,
-				    unsigned int nr_pages)
+
+static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
+						unsigned int start, unsigned char usage,
+						unsigned int order)
 {
-	unsigned int i;
+	unsigned int nr_pages = 1 << order;
 
-	for (i = 0; i < nr_pages; i++) {
-		if (swap_map[start + i])
-			return false;
+	if (cluster_is_free(ci)) {
+		if (nr_pages < SWAPFILE_CLUSTER) {
+			list_move_tail(&ci->list, &si->nonfull_clusters[order]);
+			ci->flags = CLUSTER_FLAG_NONFULL;
+		}
+		ci->order = order;
 	}
 
-	return true;
+	memset(si->swap_map + start, usage, nr_pages);
+	swap_range_alloc(si, start, nr_pages);
+	ci->count += nr_pages;
+
+	if (ci->count == SWAPFILE_CLUSTER) {
+		VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL)));
+		list_del(&ci->list);
+		ci->flags = 0;
+	}
+}
+
+static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
+					    unsigned int *foundp, unsigned int order,
+					    unsigned char usage)
+{
+	unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
+	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
+	unsigned int nr_pages = 1 << order;
+	struct swap_cluster_info *ci;
+
+	if (end < nr_pages)
+		return SWAP_NEXT_INVALID;
+	end -= nr_pages;
+
+	ci = lock_cluster(si, offset);
+	if (ci->count + nr_pages > SWAPFILE_CLUSTER) {
+		offset = SWAP_NEXT_INVALID;
+		goto done;
+	}
+
+	while (offset <= end) {
+		if (cluster_scan_range(si, offset, nr_pages)) {
+			cluster_alloc_range(si, ci, offset, usage, order);
+			*foundp = offset;
+			if (ci->count == SWAPFILE_CLUSTER) {
+				offset = SWAP_NEXT_INVALID;
+				goto done;
+			}
+			offset += nr_pages;
+			break;
+		}
+		offset += nr_pages;
+	}
+	if (offset > end)
+		offset = SWAP_NEXT_INVALID;
+done:
+	unlock_cluster(ci);
+	return offset;
 }
 
 /*
@@ -554,72 +613,66 @@ static inline bool swap_range_empty(char *swap_map, unsigned int start,
  * pool (a cluster). This might involve allocating a new cluster for current CPU
  * too.
  */
-static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
-	unsigned long *offset, unsigned long *scan_base, int order)
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+					      unsigned char usage)
 {
-	unsigned int nr_pages = 1 << order;
 	struct percpu_cluster *cluster;
-	struct swap_cluster_info *ci;
-	unsigned int tmp, max;
+	struct swap_cluster_info *ci, *n;
+	unsigned int offset, found = 0;
 
 new_cluster:
+	lockdep_assert_held(&si->lock);
 	cluster = this_cpu_ptr(si->percpu_cluster);
-	tmp = cluster->next[order];
-	if (tmp == SWAP_NEXT_INVALID) {
-		if (!list_empty(&si->free_clusters)) {
-			ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
-			list_del(&ci->list);
-			spin_lock(&ci->lock);
-			ci->order = order;
-			ci->flags = 0;
-			spin_unlock(&ci->lock);
-			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
-		} else if (!list_empty(&si->nonfull_clusters[order])) {
-			ci = list_first_entry(&si->nonfull_clusters[order],
-					      struct swap_cluster_info, list);
-			list_del(&ci->list);
-			spin_lock(&ci->lock);
-			ci->flags = 0;
-			spin_unlock(&ci->lock);
-			tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
-		} else if (!list_empty(&si->discard_clusters)) {
-			/*
-			 * we don't have free cluster but have some clusters in
-			 * discarding, do discard now and reclaim them, then
-			 * reread cluster_next_cpu since we dropped si->lock
-			 */
-			swap_do_scheduled_discard(si);
-			*scan_base = this_cpu_read(*si->cluster_next_cpu);
-			*offset = *scan_base;
-			goto new_cluster;
-		} else
-			return false;
+	offset = cluster->next[order];
+	if (offset) {
+		offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
+		if (found)
+			goto done;
 	}
 
-	/*
-	 * Other CPUs can use our cluster if they can't find a free cluster,
-	 * check if there is still free entry in the cluster, maintaining
-	 * natural alignment.
-	 */
-	max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
-	if (tmp < max) {
-		ci = lock_cluster(si, tmp);
-		while (tmp < max) {
-			if (swap_range_empty(si->swap_map, tmp, nr_pages))
-				break;
-			tmp += nr_pages;
+	if (!list_empty(&si->free_clusters)) {
+		ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+		offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
+		VM_BUG_ON(!found);
+		goto done;
+	}
+
+	if (order < PMD_ORDER) {
+		list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
+			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							 &found, order, usage);
+			if (found)
+				goto done;
 		}
-		unlock_cluster(ci);
 	}
-	if (tmp >= max) {
-		cluster->next[order] = SWAP_NEXT_INVALID;
+
+	if (!list_empty(&si->discard_clusters)) {
+		/*
+		 * we don't have free cluster but have some clusters in
+		 * discarding, do discard now and reclaim them, then
+		 * reread cluster_next_cpu since we dropped si->lock
+		 */
+		swap_do_scheduled_discard(si);
 		goto new_cluster;
 	}
-	*offset = tmp;
-	*scan_base = tmp;
-	tmp += nr_pages;
-	cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
-	return true;
+
+	if (order)
+		goto done;
+
+	for (int o = 1; o < PMD_ORDER; o++) {
+		if (!list_empty(&si->nonfull_clusters[o])) {
+			ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info,
+					      list);
+			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							 &found, 0, usage);
+			VM_BUG_ON(!found);
+			goto done;
+		}
+	}
+
+done:
+	cluster->next[order] = offset;
+	return found;
 }
 
 static void __del_from_avail_list(struct swap_info_struct *p)
@@ -754,11 +807,29 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
 	return false;
 }
 
+static int cluster_alloc_swap(struct swap_info_struct *si,
+			     unsigned char usage, int nr,
+			     swp_entry_t slots[], int order)
+{
+	int n_ret = 0;
+
+	VM_BUG_ON(!si->cluster_info);
+
+	while (n_ret < nr) {
+		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
+
+		if (!offset)
+			break;
+		slots[n_ret++] = swp_entry(si->type, offset);
+	}
+
+	return n_ret;
+}
+
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
 			       swp_entry_t slots[], int order)
 {
-	struct swap_cluster_info *ci;
 	unsigned long offset;
 	unsigned long scan_base;
 	unsigned long last_in_cluster = 0;
@@ -797,26 +868,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 			return 0;
 	}
 
+	if (si->cluster_info)
+		return cluster_alloc_swap(si, usage, nr, slots, order);
+
 	si->flags += SWP_SCANNING;
-	/*
-	 * Use percpu scan base for SSD to reduce lock contention on
-	 * cluster and swap cache.  For HDD, sequential access is more
-	 * important.
-	 */
-	if (si->flags & SWP_SOLIDSTATE)
-		scan_base = this_cpu_read(*si->cluster_next_cpu);
-	else
-		scan_base = si->cluster_next;
+
+	/* For HDD, sequential access is more important. */
+	scan_base = si->cluster_next;
 	offset = scan_base;
 
-	/* SSD algorithm */
-	if (si->cluster_info) {
-		if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
-			if (order > 0)
-				goto no_page;
-			goto scan;
-		}
-	} else if (unlikely(!si->cluster_nr--)) {
+	if (unlikely(!si->cluster_nr--)) {
 		if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
 			si->cluster_nr = SWAPFILE_CLUSTER - 1;
 			goto checks;
@@ -827,8 +888,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		/*
 		 * If seek is expensive, start searching for new cluster from
 		 * start of partition, to minimize the span of allocated swap.
-		 * If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info
-		 * case, just handled by scan_swap_map_try_ssd_cluster() above.
 		 */
 		scan_base = offset = si->lowest_bit;
 		last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
@@ -856,19 +915,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	}
 
 checks:
-	if (si->cluster_info) {
-		while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
-		/* take a break if we already got some slots */
-			if (n_ret)
-				goto done;
-			if (!scan_swap_map_try_ssd_cluster(si, &offset,
-							&scan_base, order)) {
-				if (order > 0)
-					goto no_page;
-				goto scan;
-			}
-		}
-	}
 	if (!(si->flags & SWP_WRITEOK))
 		goto no_page;
 	if (!si->highest_bit)
@@ -876,11 +922,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	if (offset > si->highest_bit)
 		scan_base = offset = si->lowest_bit;
 
-	ci = lock_cluster(si, offset);
 	/* reuse swap entry of cache-only swap if not busy. */
 	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
-		unlock_cluster(ci);
 		spin_unlock(&si->lock);
 		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
 		spin_lock(&si->lock);
@@ -891,15 +935,12 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	}
 
 	if (si->swap_map[offset]) {
-		unlock_cluster(ci);
 		if (!n_ret)
 			goto scan;
 		else
 			goto done;
 	}
 	memset(si->swap_map + offset, usage, nr_pages);
-	add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
-	unlock_cluster(ci);
 
 	swap_range_alloc(si, offset, nr_pages);
 	slots[n_ret++] = swp_entry(si->type, offset);
@@ -920,13 +961,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 		latency_ration = LATENCY_LIMIT;
 	}
 
-	/* try to get more slots in cluster */
-	if (si->cluster_info) {
-		if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
-			goto checks;
-		if (order > 0)
-			goto done;
-	} else if (si->cluster_nr && !si->swap_map[++offset]) {
+	if (si->cluster_nr && !si->swap_map[++offset]) {
 		/* non-ssd case, still more slots in cluster? */
 		--si->cluster_nr;
 		goto checks;
@@ -995,8 +1030,6 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 	ci = lock_cluster(si, offset);
 	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
 	ci->count = 0;
-	ci->order = 0;
-	ci->flags = 0;
 	free_cluster(si, ci);
 	unlock_cluster(ci);
 	swap_range_free(si, offset, SWAPFILE_CLUSTER);
@@ -3008,8 +3041,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 			ci = cluster_info + idx;
 			if (idx >= nr_clusters)
 				continue;
-			if (ci->count)
+			if (ci->count) {
+				ci->flags = CLUSTER_FLAG_NONFULL;
+				list_add_tail(&ci->list, &p->nonfull_clusters[0]);
 				continue;
+			}
 			ci->flags = CLUSTER_FLAG_FREE;
 			list_add_tail(&ci->list, &p->free_clusters);
 		}

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 4/9] mm: swap: clean up initialization helper
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (2 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() Chris Li
@ 2024-07-31  6:49 ` chrisl
  2024-07-31  6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

From: Kairui Song <kasong@tencent.com>

At this point, alloc_cluster is never called already, and
inc_cluster_info_page is called by initialization only, a lot of dead
code can be dropped.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 44 ++++++++++----------------------------------
 1 file changed, 10 insertions(+), 34 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8a72c8a9aafd..34e6ea13e8e4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -438,20 +438,6 @@ static void swap_users_ref_free(struct percpu_ref *ref)
 	complete(&si->comp);
 }
 
-static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
-{
-	struct swap_cluster_info *ci = list_first_entry(&si->free_clusters,
-							struct swap_cluster_info, list);
-
-	lockdep_assert_held(&si->lock);
-	lockdep_assert_held(&ci->lock);
-	VM_BUG_ON(cluster_index(si, ci) != idx);
-	VM_BUG_ON(ci->count);
-	list_del(&ci->list);
-	ci->flags = 0;
-	return ci;
-}
-
 static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
 {
 	VM_BUG_ON(ci->count != 0);
@@ -472,34 +458,24 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 }
 
 /*
- * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased by
- * count.
+ * The cluster corresponding to page_nr will be used. The cluster will not be
+ * added to free cluster list and its usage counter will be increased by 1.
+ * Only used for initialization.
  */
-static void add_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr,
-	unsigned long count)
+static void inc_cluster_info_page(struct swap_info_struct *p,
+	struct swap_cluster_info *cluster_info, unsigned long page_nr)
 {
 	unsigned long idx = page_nr / SWAPFILE_CLUSTER;
-	struct swap_cluster_info *ci = cluster_info + idx;
+	struct swap_cluster_info *ci;
 
 	if (!cluster_info)
 		return;
-	if (cluster_is_free(ci))
-		alloc_cluster(p, idx);
 
-	VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
-	ci->count += count;
-}
+	ci = cluster_info + idx;
+	ci->count++;
 
-/*
- * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased by 1.
- */
-static void inc_cluster_info_page(struct swap_info_struct *p,
-	struct swap_cluster_info *cluster_info, unsigned long page_nr)
-{
-	add_cluster_info_page(p, cluster_info, page_nr, 1);
+	VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
+	VM_BUG_ON(ci->flags);
 }
 
 /*

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (3 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 4/9] mm: swap: clean up initialization helper chrisl
@ 2024-07-31  6:49 ` chrisl
  2024-08-03  9:11   ` Barry Song
  2024-07-31  6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: chrisl @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

From: Kairui Song <kasong@tencent.com>

Currently when we are freeing mTHP folios from swap cache, we free
then one by one and put each entry into swap slot cache. Slot
cache is designed to reduce the overhead by batching the freeing,
but mTHP swap entries are already continuous so they can be batch
freed without it already, it saves litle overhead, or even increase
overhead for larger mTHP.

What's more, mTHP entries could stay in swap cache for a while.
Contiguous swap entry is an rather rare resource so releasing them
directly can help improve mTHP allocation success rate when under
pressure.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 59 ++++++++++++++++++++++++++---------------------------------
 1 file changed, 26 insertions(+), 33 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 34e6ea13e8e4..9b63b2262cc2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
 }
 
 /*
- * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
  * which means no page in the cluster is in use, we can optionally discard
  * the cluster and add it to free cluster list.
  */
-static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
+static void dec_cluster_info_page(struct swap_info_struct *p,
+				  struct swap_cluster_info *ci, int nr_pages)
 {
 	if (!p->cluster_info)
 		return;
 
-	VM_BUG_ON(ci->count == 0);
+	VM_BUG_ON(ci->count < nr_pages);
 	VM_BUG_ON(cluster_is_free(ci));
 	lockdep_assert_held(&p->lock);
 	lockdep_assert_held(&ci->lock);
-	ci->count--;
+	ci->count -= nr_pages;
 
 	if (!ci->count) {
 		free_cluster(p, ci);
@@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	return n_ret;
 }
 
-static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
-{
-	unsigned long offset = idx * SWAPFILE_CLUSTER;
-	struct swap_cluster_info *ci;
-
-	ci = lock_cluster(si, offset);
-	memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
-	ci->count = 0;
-	free_cluster(si, ci);
-	unlock_cluster(ci);
-	swap_range_free(si, offset, SWAPFILE_CLUSTER);
-}
-
 int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 {
 	int order = swap_entry_order(entry_order);
@@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
 	return usage;
 }
 
-static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
+/*
+ * Drop the last HAS_CACHE flag of swap entries, caller have to
+ * ensure all entries belong to the same cgroup.
+ */
+static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry,
+				  unsigned int nr_pages)
 {
-	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
-	unsigned char count;
+	unsigned char *map = p->swap_map + offset;
+	unsigned char *map_end = map + nr_pages;
+	struct swap_cluster_info *ci;
 
 	ci = lock_cluster(p, offset);
-	count = p->swap_map[offset];
-	VM_BUG_ON(count != SWAP_HAS_CACHE);
-	p->swap_map[offset] = 0;
-	dec_cluster_info_page(p, ci);
+	do {
+		VM_BUG_ON(*map != SWAP_HAS_CACHE);
+		*map = 0;
+	} while (++map < map_end);
+	dec_cluster_info_page(p, ci, nr_pages);
 	unlock_cluster(ci);
 
-	mem_cgroup_uncharge_swap(entry, 1);
-	swap_range_free(p, offset, 1);
+	mem_cgroup_uncharge_swap(entry, nr_pages);
+	swap_range_free(p, offset, nr_pages);
 }
 
 static void cluster_swap_free_nr(struct swap_info_struct *sis,
@@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	unsigned long offset = swp_offset(entry);
-	unsigned long idx = offset / SWAPFILE_CLUSTER;
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	unsigned char *map;
@@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 		return;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (size == SWAPFILE_CLUSTER) {
+	if (size > 1) {
 		map = si->swap_map + offset;
-		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+		for (i = 0; i < size; i++) {
 			val = map[i];
 			VM_BUG_ON(!(val & SWAP_HAS_CACHE));
 			if (val == SWAP_HAS_CACHE)
 				free_entries++;
 		}
-		if (free_entries == SWAPFILE_CLUSTER) {
+		if (free_entries == size) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
-			mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
-			swap_free_cluster(si, idx);
+			swap_entry_range_free(si, entry, size);
 			spin_unlock(&si->lock);
 			return;
 		}
@@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 	for (i = 0; i < n; ++i) {
 		p = swap_info_get_cont(entries[i], prev);
 		if (p)
-			swap_entry_free(p, entries[i]);
+			swap_entry_range_free(p, entries[i], 1);
 		prev = p;
 	}
 	if (p)

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (4 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
@ 2024-07-31  6:49 ` chrisl
  2024-08-03 10:38   ` Barry Song
  2024-07-31  6:49 ` [PATCH v5 7/9] mm: swap: add a fragment cluster list chrisl
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 32+ messages in thread
From: chrisl @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

From: Kairui Song <kasong@tencent.com>

Currently we free the reclaimed slots through slot cache even
if the slot is required to be empty immediately. As a result
the reclaim caller will see the slot still occupied even after a
successful reclaim, and need to keep reclaiming until slot cache
get flushed. This caused ineffective or over reclaim when SWAP is
under stress.

So introduce a new flag allowing the slot to be emptied bypassing
the slot cache.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 109 insertions(+), 43 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9b63b2262cc2..4c0fc0409d3c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,8 +53,15 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
+static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
+				  unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
 			     unsigned int nr_entries);
+static bool folio_swapcache_freeable(struct folio *folio);
+static struct swap_cluster_info *lock_cluster_or_swap_info(
+		struct swap_info_struct *si, unsigned long offset);
+static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+					struct swap_cluster_info *ci);
 
 static DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
  * corresponding page
  */
 #define TTRS_UNMAPPED		0x2
-/* Reclaim the swap entry if swap is getting full*/
+/* Reclaim the swap entry if swap is getting full */
 #define TTRS_FULL		0x4
+/* Reclaim directly, bypass the slot cache and don't touch device lock */
+#define TTRS_DIRECT		0x8
+
+static bool swap_is_has_cache(struct swap_info_struct *si,
+			      unsigned long offset, int nr_pages)
+{
+	unsigned char *map = si->swap_map + offset;
+	unsigned char *map_end = map + nr_pages;
+
+	do {
+		VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
+		if (*map != SWAP_HAS_CACHE)
+			return false;
+	} while (++map < map_end);
+
+	return true;
+}
 
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
@@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
 	swp_entry_t entry = swp_entry(si->type, offset);
+	struct address_space *address_space = swap_address_space(entry);
+	struct swap_cluster_info *ci;
 	struct folio *folio;
-	int ret = 0;
+	int ret, nr_pages;
+	bool need_reclaim;
 
-	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
+	folio = filemap_get_folio(address_space, swap_cache_index(entry));
 	if (IS_ERR(folio))
 		return 0;
+
+	/* offset could point to the middle of a large folio */
+	entry = folio->swap;
+	offset = swp_offset(entry);
+	nr_pages = folio_nr_pages(folio);
+	ret = -nr_pages;
+
 	/*
 	 * When this function is called from scan_swap_map_slots() and it's
 	 * called by vmscan.c at reclaiming folios. So we hold a folio lock
@@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	 * case and you should use folio_free_swap() with explicit folio_lock()
 	 * in usual operations.
 	 */
-	if (folio_trylock(folio)) {
-		if ((flags & TTRS_ANYWAY) ||
-		    ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
-		    ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
-			ret = folio_free_swap(folio);
-		folio_unlock(folio);
+	if (!folio_trylock(folio))
+		goto out;
+
+	need_reclaim = ((flags & TTRS_ANYWAY) ||
+			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
+			((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
+	if (!need_reclaim || !folio_swapcache_freeable(folio))
+		goto out_unlock;
+
+	/*
+	 * It's safe to delete the folio from swap cache only if the folio's
+	 * swap_map is HAS_CACHE only, which means the slots have no page table
+	 * reference or pending writeback, and can't be allocated to others.
+	 */
+	ci = lock_cluster_or_swap_info(si, offset);
+	need_reclaim = swap_is_has_cache(si, offset, nr_pages);
+	unlock_cluster_or_swap_info(si, ci);
+	if (!need_reclaim)
+		goto out_unlock;
+
+	if (!(flags & TTRS_DIRECT)) {
+		/* Free through slot cache */
+		delete_from_swap_cache(folio);
+		folio_set_dirty(folio);
+		ret = nr_pages;
+		goto out_unlock;
 	}
-	ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
+
+	xa_lock_irq(&address_space->i_pages);
+	__delete_from_swap_cache(folio, entry, NULL);
+	xa_unlock_irq(&address_space->i_pages);
+	folio_ref_sub(folio, nr_pages);
+	folio_set_dirty(folio);
+
+	spin_lock(&si->lock);
+	/* Only sinple page folio can be backed by zswap */
+	if (!nr_pages)
+		zswap_invalidate(entry);
+	swap_entry_range_free(si, entry, nr_pages);
+	spin_unlock(&si->lock);
+	ret = nr_pages;
+out_unlock:
+	folio_unlock(folio);
+out:
 	folio_put(folio);
 	return ret;
 }
@@ -903,7 +973,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 	if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
 		int swap_was_freed;
 		spin_unlock(&si->lock);
-		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+		swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
 		spin_lock(&si->lock);
 		/* entry was freed successfully, try to use this again */
 		if (swap_was_freed > 0)
@@ -1340,9 +1410,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unsigned long offset = swp_offset(entry);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	unsigned char *map;
-	unsigned int i, free_entries = 0;
-	unsigned char val;
 	int size = 1 << swap_entry_order(folio_order(folio));
 
 	si = _swap_info_get(entry);
@@ -1350,23 +1417,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 		return;
 
 	ci = lock_cluster_or_swap_info(si, offset);
-	if (size > 1) {
-		map = si->swap_map + offset;
-		for (i = 0; i < size; i++) {
-			val = map[i];
-			VM_BUG_ON(!(val & SWAP_HAS_CACHE));
-			if (val == SWAP_HAS_CACHE)
-				free_entries++;
-		}
-		if (free_entries == size) {
-			unlock_cluster_or_swap_info(si, ci);
-			spin_lock(&si->lock);
-			swap_entry_range_free(si, entry, size);
-			spin_unlock(&si->lock);
-			return;
-		}
+	if (size > 1 && swap_is_has_cache(si, offset, size)) {
+		unlock_cluster_or_swap_info(si, ci);
+		spin_lock(&si->lock);
+		swap_entry_range_free(si, entry, size);
+		spin_unlock(&si->lock);
+		return;
 	}
-	for (i = 0; i < size; i++, entry.val++) {
+	for (int i = 0; i < size; i++, entry.val++) {
 		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
 			unlock_cluster_or_swap_info(si, ci);
 			free_swap_slot(entry);
@@ -1526,16 +1584,7 @@ static bool folio_swapped(struct folio *folio)
 	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
 }
 
-/**
- * folio_free_swap() - Free the swap space used for this folio.
- * @folio: The folio to remove.
- *
- * If swap is getting full, or if there are no more mappings of this folio,
- * then call folio_free_swap to free its swap space.
- *
- * Return: true if we were able to release the swap space.
- */
-bool folio_free_swap(struct folio *folio)
+static bool folio_swapcache_freeable(struct folio *folio)
 {
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
@@ -1543,8 +1592,6 @@ bool folio_free_swap(struct folio *folio)
 		return false;
 	if (folio_test_writeback(folio))
 		return false;
-	if (folio_swapped(folio))
-		return false;
 
 	/*
 	 * Once hibernation has begun to create its image of memory,
@@ -1564,6 +1611,25 @@ bool folio_free_swap(struct folio *folio)
 	if (pm_suspended_storage())
 		return false;
 
+	return true;
+}
+
+/**
+ * folio_free_swap() - Free the swap space used for this folio.
+ * @folio: The folio to remove.
+ *
+ * If swap is getting full, or if there are no more mappings of this folio,
+ * then call folio_free_swap to free its swap space.
+ *
+ * Return: true if we were able to release the swap space.
+ */
+bool folio_free_swap(struct folio *folio)
+{
+	if (!folio_swapcache_freeable(folio))
+		return false;
+	if (folio_swapped(folio))
+		return false;
+
 	delete_from_swap_cache(folio);
 	folio_set_dirty(folio);
 	return true;
@@ -1640,7 +1706,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 			 * to the next boundary.
 			 */
 			nr = __try_to_reclaim_swap(si, offset,
-					      TTRS_UNMAPPED | TTRS_FULL);
+						   TTRS_UNMAPPED | TTRS_FULL);
 			if (nr == 0)
 				nr = 1;
 			else if (nr < 0)

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 7/9] mm: swap: add a fragment cluster list
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (5 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
@ 2024-07-31  6:49 ` chrisl
  2024-07-31  6:49 ` [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned chrisl
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

From: Kairui Song <kasong@tencent.com>

Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocation long time ago and still partially occupied. So if
allocator can't find enough contiguous slots to satisfy an high order
allocation, it's unlikely there will be slot being free on them to
satisfy the allocation, at least in a short period.

As a result, nonfull cluster scanning will waste time repeatly scanning
the unusable head of the list.

Also, multiple CPUs could content on the same head cluster of nonfull
list. Unlike free clusters which are removed from the list when any
CPU starts using it, nonfull cluster stays on the head.

So introduce a new list frag list, all scanned nonfull clusters will be
moved to this list. Both for avoiding repeated scanning and contention.

Frag list is still used as fallback for allocations, so if one CPU
failed to allocate one order of slots, it can still steal other CPU's
clusters. And order 0 will favor the fragmented clusters to better
protect nonfull clusters

If any slots on a fragment list are being freed, move the fragment list
back to nonfull list indicating it worth another scan on the cluster.
Compared to scan upon freeing a slot, this keep the scanning lazy and
save some CPU if there are still other clusters to use.

It may seems unneccessay to keep the fragmented cluster on list at all
if they can't be used for specific order allocation. But this will
start to make sense once reclaim dring scanning is ready.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  3 +++
 mm/swapfile.c        | 41 +++++++++++++++++++++++++++++++++++++----
 2 files changed, 40 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6716ef236766..5a14b6c65949 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,6 +259,7 @@ struct swap_cluster_info {
 };
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
+#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
 
 /*
  * The first page in the swap file is the swap header, which is always marked
@@ -299,6 +300,8 @@ struct swap_info_struct {
 	struct list_head free_clusters; /* free clusters list */
 	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that contains at least one free slot */
+	struct list_head frag_clusters[SWAP_NR_ORDERS];
+					/* list of cluster that are fragmented or contented */
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c0fc0409d3c..eb3e387e86b2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -572,7 +572,10 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 
 	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
 		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-		list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+		if (ci->flags & CLUSTER_FLAG_FRAG)
+			list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+		else
+			list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
 		ci->flags = CLUSTER_FLAG_NONFULL;
 	}
 }
@@ -610,7 +613,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_
 	ci->count += nr_pages;
 
 	if (ci->count == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL)));
+		VM_BUG_ON(!(ci->flags &
+			  (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
 		list_del(&ci->list);
 		ci->flags = 0;
 	}
@@ -666,6 +670,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	struct percpu_cluster *cluster;
 	struct swap_cluster_info *ci, *n;
 	unsigned int offset, found = 0;
+	LIST_HEAD(fraged);
 
 new_cluster:
 	lockdep_assert_held(&si->lock);
@@ -686,13 +691,29 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 
 	if (order < PMD_ORDER) {
 		list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
+			list_move_tail(&ci->list, &fraged);
+			ci->flags = CLUSTER_FLAG_FRAG;
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
 			if (found)
-				goto done;
+				break;
 		}
+
+		if (!found) {
+			list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) {
+				offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+								 &found, order, usage);
+				if (found)
+					break;
+			}
+		}
+
+		list_splice_tail(&fraged, &si->frag_clusters[order]);
 	}
 
+	if (found)
+		goto done;
+
 	if (!list_empty(&si->discard_clusters)) {
 		/*
 		 * we don't have free cluster but have some clusters in
@@ -706,7 +727,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	if (order)
 		goto done;
 
+	/* Order 0 stealing from higher order */
 	for (int o = 1; o < PMD_ORDER; o++) {
+		if (!list_empty(&si->frag_clusters[o])) {
+			ci = list_first_entry(&si->frag_clusters[o],
+					      struct swap_cluster_info, list);
+			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found,
+							 0, usage);
+			VM_BUG_ON(!found);
+			goto done;
+		}
+
 		if (!list_empty(&si->nonfull_clusters[o])) {
 			ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info,
 					      list);
@@ -3019,8 +3050,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	INIT_LIST_HEAD(&p->free_clusters);
 	INIT_LIST_HEAD(&p->discard_clusters);
 
-	for (i = 0; i < SWAP_NR_ORDERS; i++)
+	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+		INIT_LIST_HEAD(&p->frag_clusters[i]);
+	}
 
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (6 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 7/9] mm: swap: add a fragment cluster list chrisl
@ 2024-07-31  6:49 ` chrisl
  2024-07-31  6:49 ` [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim chrisl
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

From: Kairui Song <kasong@tencent.com>

This commit implements reclaim during scan for cluster allocator.

Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which
could result in low allocation success rate or early OOM.

So to ensure maximum allocation success rate, integrate reclaiming
with scanning. If found a range of suitable swap slots but fragmented
due to HAS_CACHE, just try to reclaim the slots.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |   1 +
 mm/swapfile.c        | 140 +++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 110 insertions(+), 31 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5a14b6c65949..9eb740563d63 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -302,6 +302,7 @@ struct swap_info_struct {
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that are fragmented or contented */
+	unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
 	unsigned int lowest_bit;	/* index of first free in swap_map */
 	unsigned int highest_bit;	/* index of last free in swap_map */
 	unsigned int pages;		/* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index eb3e387e86b2..50e7f600a9a1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -513,6 +513,10 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 	VM_BUG_ON(ci->count != 0);
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
+
+	if (ci->flags & CLUSTER_FLAG_FRAG)
+		si->frag_cluster_nr[ci->order]--;
+
 	/*
 	 * If the swap is discardable, prepare discard the cluster
 	 * instead of free it immediately. The cluster will be freed
@@ -572,31 +576,84 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 
 	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
 		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-		if (ci->flags & CLUSTER_FLAG_FRAG)
+		if (ci->flags & CLUSTER_FLAG_FRAG) {
+			p->frag_cluster_nr[ci->order]--;
 			list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
-		else
+		} else {
 			list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+		}
 		ci->flags = CLUSTER_FLAG_NONFULL;
 	}
 }
 
-static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start,
-				      unsigned int nr_pages)
+static bool cluster_reclaim_range(struct swap_info_struct *si,
+				  struct swap_cluster_info *ci,
+				  unsigned long start, unsigned long end)
 {
-	unsigned char *p = si->swap_map + start;
-	unsigned char *end = p + nr_pages;
+	unsigned char *map = si->swap_map;
+	unsigned long offset;
+
+	spin_unlock(&ci->lock);
+	spin_unlock(&si->lock);
+
+	for (offset = start; offset < end; offset++) {
+		switch (READ_ONCE(map[offset])) {
+		case 0:
+			continue;
+		case SWAP_HAS_CACHE:
+			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0)
+				continue;
+			goto out;
+		default:
+			goto out;
+		}
+	}
+out:
+	spin_lock(&si->lock);
+	spin_lock(&ci->lock);
 
-	while (p < end)
-		if (*p++)
+	/*
+	 * Recheck the range no matter reclaim succeeded or not, the slot
+	 * could have been be freed while we are not holding the lock.
+	 */
+	for (offset = start; offset < end; offset++)
+		if (READ_ONCE(map[offset]))
 			return false;
 
 	return true;
 }
 
+static bool cluster_scan_range(struct swap_info_struct *si,
+			       struct swap_cluster_info *ci,
+			       unsigned long start, unsigned int nr_pages)
+{
+	unsigned long offset, end = start + nr_pages;
+	unsigned char *map = si->swap_map;
+	bool need_reclaim = false;
 
-static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
-						unsigned int start, unsigned char usage,
-						unsigned int order)
+	for (offset = start; offset < end; offset++) {
+		switch (READ_ONCE(map[offset])) {
+		case 0:
+			continue;
+		case SWAP_HAS_CACHE:
+			if (!vm_swap_full())
+				return false;
+			need_reclaim = true;
+			continue;
+		default:
+			return false;
+		}
+	}
+
+	if (need_reclaim)
+		return cluster_reclaim_range(si, ci, start, end);
+
+	return true;
+}
+
+static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
+				unsigned int start, unsigned char usage,
+				unsigned int order)
 {
 	unsigned int nr_pages = 1 << order;
 
@@ -615,6 +672,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_
 	if (ci->count == SWAPFILE_CLUSTER) {
 		VM_BUG_ON(!(ci->flags &
 			  (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
+		if (ci->flags & CLUSTER_FLAG_FRAG)
+			si->frag_cluster_nr[ci->order]--;
 		list_del(&ci->list);
 		ci->flags = 0;
 	}
@@ -640,7 +699,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 	}
 
 	while (offset <= end) {
-		if (cluster_scan_range(si, offset, nr_pages)) {
+		if (cluster_scan_range(si, ci, offset, nr_pages)) {
 			cluster_alloc_range(si, ci, offset, usage, order);
 			*foundp = offset;
 			if (ci->count == SWAPFILE_CLUSTER) {
@@ -668,9 +727,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 					      unsigned char usage)
 {
 	struct percpu_cluster *cluster;
-	struct swap_cluster_info *ci, *n;
+	struct swap_cluster_info *ci;
 	unsigned int offset, found = 0;
-	LIST_HEAD(fraged);
 
 new_cluster:
 	lockdep_assert_held(&si->lock);
@@ -690,25 +748,42 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 	}
 
 	if (order < PMD_ORDER) {
-		list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
-			list_move_tail(&ci->list, &fraged);
+		unsigned int frags = 0;
+
+		while (!list_empty(&si->nonfull_clusters[order])) {
+			ci = list_first_entry(&si->nonfull_clusters[order],
+					      struct swap_cluster_info, list);
+			list_move_tail(&ci->list, &si->frag_clusters[order]);
 			ci->flags = CLUSTER_FLAG_FRAG;
+			si->frag_cluster_nr[order]++;
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, order, usage);
+			frags++;
 			if (found)
 				break;
 		}
 
 		if (!found) {
-			list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) {
+			/*
+			 * Nonfull clusters are moved to frag tail if we reached
+			 * here, count them too, don't over scan the frag list.
+			 */
+			while (frags < si->frag_cluster_nr[order]) {
+				ci = list_first_entry(&si->frag_clusters[order],
+						      struct swap_cluster_info, list);
+				/*
+				 * Rotate the frag list to iterate, they were all failing
+				 * high order allocation or moved here due to per-CPU usage,
+				 * this help keeping usable cluster ahead.
+				 */
+				list_move_tail(&ci->list, &si->frag_clusters[order]);
 				offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 								 &found, order, usage);
+				frags++;
 				if (found)
 					break;
 			}
 		}
-
-		list_splice_tail(&fraged, &si->frag_clusters[order]);
 	}
 
 	if (found)
@@ -729,25 +804,28 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 
 	/* Order 0 stealing from higher order */
 	for (int o = 1; o < PMD_ORDER; o++) {
-		if (!list_empty(&si->frag_clusters[o])) {
+		/*
+		 * Clusters here have at least one usable slots and can't fail order 0
+		 * allocation, but reclaim may drop si->lock and race with another user.
+		 */
+		while (!list_empty(&si->frag_clusters[o])) {
 			ci = list_first_entry(&si->frag_clusters[o],
 					      struct swap_cluster_info, list);
-			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found,
-							 0, usage);
-			VM_BUG_ON(!found);
-			goto done;
+			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+							 &found, 0, usage);
+			if (found)
+				goto done;
 		}
 
-		if (!list_empty(&si->nonfull_clusters[o])) {
-			ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info,
-					      list);
+		while (!list_empty(&si->nonfull_clusters[o])) {
+			ci = list_first_entry(&si->nonfull_clusters[o],
+					      struct swap_cluster_info, list);
 			offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
 							 &found, 0, usage);
-			VM_BUG_ON(!found);
-			goto done;
+			if (found)
+				goto done;
 		}
 	}
-
 done:
 	cluster->next[order] = offset;
 	return found;
@@ -3053,6 +3131,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {
 		INIT_LIST_HEAD(&p->nonfull_clusters[i]);
 		INIT_LIST_HEAD(&p->frag_clusters[i]);
+		p->frag_cluster_nr[i] = 0;
 	}
 
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
@@ -3096,7 +3175,6 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	if (!cluster_info)
 		return nr_extents;
 
-
 	/*
 	 * Reduce false cache line sharing between cluster_info and
 	 * sharing same address space.

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (7 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned chrisl
@ 2024-07-31  6:49 ` chrisl
  2024-08-01  9:14 ` [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order David Hildenbrand
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31  6:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song

From: Kairui Song <kasong@tencent.com>

Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.

There are many reason a folio can end up being in the swap cache while
having no swap count reference. So the best way to search for such slots
is still by iterating the swap clusters.

With the list as an LRU, iterating from the oldest cluster and keep them
rotating is a very doable and clean way to free up potentially not inuse
clusters.

When any allocation failure, try reclaim and rotate only one cluster.
This is adaptive for high order allocations they can tolerate fallback.
So this avoids latency, and give the full cluster list an fair chance
to get reclaimed. It release the usage stress for the fallback order 0
allocation or following up high order allocation.

If the swap device is getting very full, reclaim more aggresively to
ensure no OOM will happen. This ensures order 0 heavy workload won't go
OOM as order 0 won't fail if any cluster still have any space.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h |  1 +
 mm/swapfile.c        | 68 +++++++++++++++++++++++++++++++++++++++++-----------
 2 files changed, 55 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9eb740563d63..145e796dab84 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -298,6 +298,7 @@ struct swap_info_struct {
 	unsigned long *zeromap;		/* vmalloc'ed bitmap to track zero pages */
 	struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
 	struct list_head free_clusters; /* free clusters list */
+	struct list_head full_clusters; /* full clusters list */
 	struct list_head nonfull_clusters[SWAP_NR_ORDERS];
 					/* list of cluster that contains at least one free slot */
 	struct list_head frag_clusters[SWAP_NR_ORDERS];
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 50e7f600a9a1..9872e0dbfc72 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -440,10 +440,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
 			SWAP_MAP_BAD, SWAPFILE_CLUSTER);
 
 	VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-	if (ci->flags & CLUSTER_FLAG_NONFULL)
-		list_move_tail(&ci->list, &si->discard_clusters);
-	else
-		list_add_tail(&ci->list, &si->discard_clusters);
+	list_move_tail(&ci->list, &si->discard_clusters);
 	ci->flags = 0;
 	schedule_work(&si->discard_work);
 }
@@ -453,10 +450,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 
-	if (ci->flags & CLUSTER_FLAG_NONFULL)
-		list_move_tail(&ci->list, &si->free_clusters);
-	else
-		list_add_tail(&ci->list, &si->free_clusters);
+	list_move_tail(&ci->list, &si->free_clusters);
 	ci->flags = CLUSTER_FLAG_FREE;
 	ci->order = 0;
 }
@@ -576,12 +570,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 
 	if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
 		VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
-		if (ci->flags & CLUSTER_FLAG_FRAG) {
+		if (ci->flags & CLUSTER_FLAG_FRAG)
 			p->frag_cluster_nr[ci->order]--;
-			list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
-		} else {
-			list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
-		}
+		list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
 		ci->flags = CLUSTER_FLAG_NONFULL;
 	}
 }
@@ -674,7 +665,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 			  (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
 		if (ci->flags & CLUSTER_FLAG_FRAG)
 			si->frag_cluster_nr[ci->order]--;
-		list_del(&ci->list);
+		list_move_tail(&ci->list, &si->full_clusters);
 		ci->flags = 0;
 	}
 }
@@ -718,6 +709,46 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
 	return offset;
 }
 
+static void swap_reclaim_full_clusters(struct swap_info_struct *si)
+{
+	long to_scan = 1;
+	unsigned long offset, end;
+	struct swap_cluster_info *ci;
+	unsigned char *map = si->swap_map;
+	int nr_reclaim, total_reclaimed = 0;
+
+	if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER)
+		to_scan = si->inuse_pages / SWAPFILE_CLUSTER;
+
+	while (!list_empty(&si->full_clusters)) {
+		ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
+		list_move_tail(&ci->list, &si->full_clusters);
+		offset = cluster_offset(si, ci);
+		end = min(si->max, offset + SWAPFILE_CLUSTER);
+		to_scan--;
+
+		while (offset < end) {
+			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+				spin_unlock(&si->lock);
+				nr_reclaim = __try_to_reclaim_swap(si, offset,
+								   TTRS_ANYWAY | TTRS_DIRECT);
+				spin_lock(&si->lock);
+				if (nr_reclaim > 0) {
+					offset += nr_reclaim;
+					total_reclaimed += nr_reclaim;
+					continue;
+				} else if (nr_reclaim < 0) {
+					offset += -nr_reclaim;
+					continue;
+				}
+			}
+			offset++;
+		}
+		if (to_scan <= 0 || total_reclaimed)
+			break;
+	}
+}
+
 /*
  * Try to get swap entries with specified order from current cpu's swap entry
  * pool (a cluster). This might involve allocating a new cluster for current CPU
@@ -826,7 +857,15 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
 				goto done;
 		}
 	}
+
 done:
+	/* Try reclaim from full clusters if device is nearfull */
+	if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) {
+		swap_reclaim_full_clusters(si);
+		if (!found && !order && si->pages != si->inuse_pages)
+			goto new_cluster;
+	}
+
 	cluster->next[order] = offset;
 	return found;
 }
@@ -3126,6 +3165,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
 	nr_good_pages = maxpages - 1;	/* omit header page */
 
 	INIT_LIST_HEAD(&p->free_clusters);
+	INIT_LIST_HEAD(&p->full_clusters);
 	INIT_LIST_HEAD(&p->discard_clusters);
 
 	for (i = 0; i < SWAP_NR_ORDERS; i++) {

-- 
2.46.0.rc1.232.g9752f9e123-goog



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (8 preceding siblings ...)
  2024-07-31  6:49 ` [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim chrisl
@ 2024-08-01  9:14 ` David Hildenbrand
  2024-08-01  9:59   ` Kairui Song
       [not found] ` <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
  2024-09-02  1:20 ` Andrew Morton
  11 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-01  9:14 UTC (permalink / raw)
  To: Chris Li, Andrew Morton
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

On 31.07.24 08:49, Chris Li wrote:
> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
> 

Running the cow.c selftest on mm/mm-unstable, I get:

# [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
[   51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
[   51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
[   51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[   51.871298] Workqueue: events swap_discard_work
[   51.872211] RIP: 0010:__free_cluster+0x27/0x90
[   51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
[   51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
[   51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
[   51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
[   51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
[   51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
[   51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
[   51.884827] FS:  0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
[   51.886412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
[   51.888931] PKRU: 55555554
[   51.889471] Call Trace:
[   51.889964]  <TASK>
[   51.890391]  ? __die_body.cold+0x19/0x27
[   51.891174]  ? die_addr+0x3c/0x60
[   51.891824]  ? exc_general_protection+0x14f/0x430
[   51.892754]  ? asm_exc_general_protection+0x26/0x30
[   51.893717]  ? __free_cluster+0x27/0x90
[   51.894483]  ? __free_cluster+0x7e/0x90
[   51.895245]  swap_do_scheduled_discard+0x142/0x1b0
[   51.896189]  swap_discard_work+0x26/0x30
[   51.896958]  process_one_work+0x211/0x5a0
[   51.897750]  ? srso_alias_return_thunk+0x5/0xfbef5
[   51.898693]  worker_thread+0x1c9/0x3c0
[   51.899438]  ? __pfx_worker_thread+0x10/0x10
[   51.900287]  kthread+0xe3/0x110
[   51.900913]  ? __pfx_kthread+0x10/0x10
[   51.901656]  ret_from_fork+0x34/0x50
[   51.902377]  ? __pfx_kthread+0x10/0x10
[   51.903114]  ret_from_fork_asm+0x1a/0x30
[   51.903896]  </TASK>


Maybe related to this series?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-01  9:14 ` [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order David Hildenbrand
@ 2024-08-01  9:59   ` Kairui Song
  2024-08-01 10:06     ` Kairui Song
       [not found]     ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
  0 siblings, 2 replies; 32+ messages in thread
From: Kairui Song @ 2024-08-01  9:59 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Chris Li
  Cc: Hugh Dickins, Ryan Roberts, Huang, Ying, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]

On Thu, Aug 1, 2024 at 5:15 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.07.24 08:49, Chris Li wrote:
> > This is the short term solutions "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
>
> Running the cow.c selftest on mm/mm-unstable, I get:

Hi David, thanks very much for the test and report!

>
> # [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
> [   51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
> [   51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
> [   51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [   51.871298] Workqueue: events swap_discard_work
> [   51.872211] RIP: 0010:__free_cluster+0x27/0x90
> [   51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
> [   51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
> [   51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
> [   51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
> [   51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
> [   51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
> [   51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
> [   51.884827] FS:  0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
> [   51.886412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
> [   51.888931] PKRU: 55555554
> [   51.889471] Call Trace:
> [   51.889964]  <TASK>
> [   51.890391]  ? __die_body.cold+0x19/0x27
> [   51.891174]  ? die_addr+0x3c/0x60
> [   51.891824]  ? exc_general_protection+0x14f/0x430
> [   51.892754]  ? asm_exc_general_protection+0x26/0x30
> [   51.893717]  ? __free_cluster+0x27/0x90
> [   51.894483]  ? __free_cluster+0x7e/0x90
> [   51.895245]  swap_do_scheduled_discard+0x142/0x1b0
> [   51.896189]  swap_discard_work+0x26/0x30
> [   51.896958]  process_one_work+0x211/0x5a0
> [   51.897750]  ? srso_alias_return_thunk+0x5/0xfbef5
> [   51.898693]  worker_thread+0x1c9/0x3c0
> [   51.899438]  ? __pfx_worker_thread+0x10/0x10
> [   51.900287]  kthread+0xe3/0x110
> [   51.900913]  ? __pfx_kthread+0x10/0x10
> [   51.901656]  ret_from_fork+0x34/0x50
> [   51.902377]  ? __pfx_kthread+0x10/0x10
> [   51.903114]  ret_from_fork_asm+0x1a/0x30
> [   51.903896]  </TASK>
>
>
> Maybe related to this series?

Right, I can reproduce your problem and I believe this patch can fix
it, see the attachment.

Hi Andrew, can you pick this patch too?

[-- Attachment #2: 0001-SQUASH-Fix-discard-of-full-cluster.patch --]
[-- Type: application/octet-stream, Size: 1968 bytes --]

From 31365f8d8fa5d7a30920166b8c3a60130c24412e Mon Sep 17 00:00:00 2001
From: Kairui Song <kasong@tencent.com>
Date: Thu, 1 Aug 2024 17:29:48 +0800
Subject: [PATCH] SQUASH: Fix discard of full cluster

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 include/linux/swap.h | 1 +
 mm/swapfile.c        | 8 +++++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 145e796dab84..3543ffaf982e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,7 @@ struct swap_cluster_info {
 #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
 #define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
 #define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
+#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */
 
 /*
  * The first page in the swap file is the swap header, which is always marked
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9872e0dbfc72..5ab06a45565a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
 	lockdep_assert_held(&si->lock);
 	lockdep_assert_held(&ci->lock);
 
-	list_move_tail(&ci->list, &si->free_clusters);
+	if (ci->flags)
+		list_move_tail(&ci->list, &si->free_clusters);
+	else
+		list_add_tail(&ci->list, &si->free_clusters);
 	ci->flags = CLUSTER_FLAG_FREE;
 	ci->order = 0;
 }
@@ -474,7 +477,6 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
 				SWAPFILE_CLUSTER);
 
 		spin_lock(&si->lock);
-
 		spin_lock(&ci->lock);
 		__free_cluster(si, ci);
 		memset(si->swap_map + idx * SWAPFILE_CLUSTER,
@@ -666,7 +668,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
 		if (ci->flags & CLUSTER_FLAG_FRAG)
 			si->frag_cluster_nr[ci->order]--;
 		list_move_tail(&ci->list, &si->full_clusters);
-		ci->flags = 0;
+		ci->flags = CLUSTER_FLAG_FULL;
 	}
 }
 
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-01  9:59   ` Kairui Song
@ 2024-08-01 10:06     ` Kairui Song
       [not found]     ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
  1 sibling, 0 replies; 32+ messages in thread
From: Kairui Song @ 2024-08-01 10:06 UTC (permalink / raw)
  To: David Hildenbrand, Andrew Morton, Chris Li
  Cc: Hugh Dickins, Ryan Roberts, Huang, Ying, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Thu, Aug 1, 2024 at 5:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 5:15 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 31.07.24 08:49, Chris Li wrote:
> > > This is the short term solutions "swap cluster order" listed
> > > in my "Swap Abstraction" discussion slice 8 in the recent
> > > LSF/MM conference.
> > >
> >
> > Running the cow.c selftest on mm/mm-unstable, I get:
>
> Hi David, thanks very much for the test and report!
>
> >
> > # [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
> > [   51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
> > [   51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
> > [   51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> > [   51.871298] Workqueue: events swap_discard_work
> > [   51.872211] RIP: 0010:__free_cluster+0x27/0x90
> > [   51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
> > [   51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
> > [   51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
> > [   51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
> > [   51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
> > [   51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
> > [   51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
> > [   51.884827] FS:  0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
> > [   51.886412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
> > [   51.888931] PKRU: 55555554
> > [   51.889471] Call Trace:
> > [   51.889964]  <TASK>
> > [   51.890391]  ? __die_body.cold+0x19/0x27
> > [   51.891174]  ? die_addr+0x3c/0x60
> > [   51.891824]  ? exc_general_protection+0x14f/0x430
> > [   51.892754]  ? asm_exc_general_protection+0x26/0x30
> > [   51.893717]  ? __free_cluster+0x27/0x90
> > [   51.894483]  ? __free_cluster+0x7e/0x90
> > [   51.895245]  swap_do_scheduled_discard+0x142/0x1b0
> > [   51.896189]  swap_discard_work+0x26/0x30
> > [   51.896958]  process_one_work+0x211/0x5a0
> > [   51.897750]  ? srso_alias_return_thunk+0x5/0xfbef5
> > [   51.898693]  worker_thread+0x1c9/0x3c0
> > [   51.899438]  ? __pfx_worker_thread+0x10/0x10
> > [   51.900287]  kthread+0xe3/0x110
> > [   51.900913]  ? __pfx_kthread+0x10/0x10
> > [   51.901656]  ret_from_fork+0x34/0x50
> > [   51.902377]  ? __pfx_kthread+0x10/0x10
> > [   51.903114]  ret_from_fork_asm+0x1a/0x30
> > [   51.903896]  </TASK>
> >
> >
> > Maybe related to this series?
>
> Right, I can reproduce your problem and I believe this patch can fix
> it, see the attachment.
>
> Hi Andrew, can you pick this patch too?

This issue is caused by my PATCH 9/9 and this attachment patch can be
squashed into it, very sorry for the problem.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP
  2024-07-31  6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
@ 2024-08-03  9:11   ` Barry Song
  2024-08-03 10:57     ` Barry Song
  0 siblings, 1 reply; 32+ messages in thread
From: Barry Song @ 2024-08-03  9:11 UTC (permalink / raw)
  To: chrisl
  Cc: akpm, baohua, hughd, kaleshsingh, kasong, linux-kernel, linux-mm,
	ryan.roberts, ying.huang, Barry Song

On Wed, Jul 31, 2024 at 6:49 PM <chrisl@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently when we are freeing mTHP folios from swap cache, we free
> then one by one and put each entry into swap slot cache. Slot
> cache is designed to reduce the overhead by batching the freeing,
> but mTHP swap entries are already continuous so they can be batch
> freed without it already, it saves litle overhead, or even increase
> overhead for larger mTHP.
>
> What's more, mTHP entries could stay in swap cache for a while.
> Contiguous swap entry is an rather rare resource so releasing them
> directly can help improve mTHP allocation success rate when under
> pressure.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Acked-by: Barry Song <baohua@kernel.org>

I believe this is the right direction to take. Currently, entries are released
one by one, even when they are contiguous in the swap file(those nr_pages
entries are definitely in the same cluster and same si), leading to numerous
lock and unlock operations.
This approach provides batched support.

free_swap_and_cache_nr() has the same issue, so I drafted a patch based on
your code. I wonder if you can also help test and review before I send it
officially:

From 4bed5c08bc0f7769ee2849812acdad70c4e32ead Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Sat, 3 Aug 2024 20:21:14 +1200
Subject: [PATCH RFC] mm: attempt to batch free swap entries for zap_pte_range()

Zhiguo reported that swap release could be a serious bottleneck
during process exits[1]. With mTHP, we have the opportunity to
batch free swaps.
Thanks to the work of Chris and Kairui[2], I was able to achieve
this optimization with minimal code changes by building on their
efforts.

[1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/
[2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/swapfile.c | 43 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index ea023fc25d08..9def6dba8d26 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
 	return true;
 }
 
+static bool swap_is_last_map(struct swap_info_struct *si,
+			      unsigned long offset, int nr_pages,
+			      bool *any_only_cache)
+{
+	unsigned char *map = si->swap_map + offset;
+	unsigned char *map_end = map + nr_pages;
+	bool cached = false;
+
+	do {
+		if ((*map & ~SWAP_HAS_CACHE) != 1)
+			return false;
+		if (*map & SWAP_HAS_CACHE)
+			cached = true;
+	} while (++map < map_end);
+
+	*any_only_cache = cached;
+	return true;
+}
+
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
  * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
@@ -1808,6 +1827,29 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	if (WARN_ON(end_offset > si->max))
 		goto out;
 
+	if (nr > 1) {
+		struct swap_cluster_info *ci;
+		bool batched_free;
+		int i;
+
+		ci = lock_cluster_or_swap_info(si, start_offset);
+		if ((batched_free = swap_is_last_map(si, start_offset, nr, &any_only_cache))) {
+			for (i = 0; i < nr; i++)
+				WRITE_ONCE(si->swap_map[start_offset + i], SWAP_HAS_CACHE);
+		}
+		unlock_cluster_or_swap_info(si, ci);
+
+		if (batched_free) {
+			spin_lock(&si->lock);
+			pr_err("%s offset:%lx nr:%lx\n", __func__,start_offset, nr);
+			swap_entry_range_free(si, entry, nr);
+			spin_unlock(&si->lock);
+			if (any_only_cache)
+				goto reclaim;
+			goto out;
+		}
+	}
+
 	/*
 	 * First free all entries in the range.
 	 */
@@ -1828,6 +1870,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	if (!any_only_cache)
 		goto out;
 
+reclaim:
 	/*
 	 * Now go back over the range trying to reclaim the swap cache. This is
 	 * more efficient for large folios because we will only try to reclaim
-- 
2.34.1



> ---
>  mm/swapfile.c | 59 ++++++++++++++++++++++++++---------------------------------
>  1 file changed, 26 insertions(+), 33 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 34e6ea13e8e4..9b63b2262cc2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
>  }
>
>  /*
> - * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
>   * which means no page in the cluster is in use, we can optionally discard
>   * the cluster and add it to free cluster list.
>   */
> -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> +static void dec_cluster_info_page(struct swap_info_struct *p,
> +                                 struct swap_cluster_info *ci, int nr_pages)
>  {
>         if (!p->cluster_info)
>                 return;
>
> -       VM_BUG_ON(ci->count == 0);
> +       VM_BUG_ON(ci->count < nr_pages);
>         VM_BUG_ON(cluster_is_free(ci));
>         lockdep_assert_held(&p->lock);
>         lockdep_assert_held(&ci->lock);
> -       ci->count--;
> +       ci->count -= nr_pages;
>
>         if (!ci->count) {
>                 free_cluster(p, ci);
> @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>         return n_ret;
>  }
>
> -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> -{
> -       unsigned long offset = idx * SWAPFILE_CLUSTER;
> -       struct swap_cluster_info *ci;
> -
> -       ci = lock_cluster(si, offset);
> -       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> -       ci->count = 0;
> -       free_cluster(si, ci);
> -       unlock_cluster(ci);
> -       swap_range_free(si, offset, SWAPFILE_CLUSTER);
> -}
> -
>  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
>  {
>         int order = swap_entry_order(entry_order);
> @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
>         return usage;
>  }
>
> -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> +/*
> + * Drop the last HAS_CACHE flag of swap entries, caller have to
> + * ensure all entries belong to the same cgroup.
> + */
> +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry,
> +                                 unsigned int nr_pages)
>  {
> -       struct swap_cluster_info *ci;
>         unsigned long offset = swp_offset(entry);
> -       unsigned char count;
> +       unsigned char *map = p->swap_map + offset;
> +       unsigned char *map_end = map + nr_pages;
> +       struct swap_cluster_info *ci;
>
>         ci = lock_cluster(p, offset);
> -       count = p->swap_map[offset];
> -       VM_BUG_ON(count != SWAP_HAS_CACHE);
> -       p->swap_map[offset] = 0;
> -       dec_cluster_info_page(p, ci);
> +       do {
> +               VM_BUG_ON(*map != SWAP_HAS_CACHE);
> +               *map = 0;
> +       } while (++map < map_end);
> +       dec_cluster_info_page(p, ci, nr_pages);
>         unlock_cluster(ci);
>
> -       mem_cgroup_uncharge_swap(entry, 1);
> -       swap_range_free(p, offset, 1);
> +       mem_cgroup_uncharge_swap(entry, nr_pages);
> +       swap_range_free(p, offset, nr_pages);
>  }
>
>  static void cluster_swap_free_nr(struct swap_info_struct *sis,
> @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
>  void put_swap_folio(struct folio *folio, swp_entry_t entry)
>  {
>         unsigned long offset = swp_offset(entry);
> -       unsigned long idx = offset / SWAPFILE_CLUSTER;
>         struct swap_cluster_info *ci;
>         struct swap_info_struct *si;
>         unsigned char *map;
> @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>                 return;
>
>         ci = lock_cluster_or_swap_info(si, offset);
> -       if (size == SWAPFILE_CLUSTER) {
> +       if (size > 1) {
>                 map = si->swap_map + offset;
> -               for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> +               for (i = 0; i < size; i++) {
>                         val = map[i];
>                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
>                         if (val == SWAP_HAS_CACHE)
>                                 free_entries++;
>                 }
> -               if (free_entries == SWAPFILE_CLUSTER) {
> +               if (free_entries == size) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         spin_lock(&si->lock);
> -                       mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> -                       swap_free_cluster(si, idx);
> +                       swap_entry_range_free(si, entry, size);
>                         spin_unlock(&si->lock);
>                         return;
>                 }
> @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
>         for (i = 0; i < n; ++i) {
>                 p = swap_info_get_cont(entries[i], prev);
>                 if (p)
> -                       swap_entry_free(p, entries[i]);
> +                       swap_entry_range_free(p, entries[i], 1);
>                 prev = p;
>         }
>         if (p)
>
> --
> 2.46.0.rc1.232.g9752f9e123-goog
>

Thanks
Barry



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
  2024-07-31  6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
@ 2024-08-03 10:38   ` Barry Song
  2024-08-03 12:18     ` Kairui Song
  0 siblings, 1 reply; 32+ messages in thread
From: Barry Song @ 2024-08-03 10:38 UTC (permalink / raw)
  To: chrisl
  Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
	Huang, Ying, Kalesh Singh, linux-kernel, linux-mm

On Wed, Jul 31, 2024 at 2:49 PM <chrisl@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently we free the reclaimed slots through slot cache even
> if the slot is required to be empty immediately. As a result
> the reclaim caller will see the slot still occupied even after a
> successful reclaim, and need to keep reclaiming until slot cache
> get flushed. This caused ineffective or over reclaim when SWAP is
> under stress.
>
> So introduce a new flag allowing the slot to be emptied bypassing
> the slot cache.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 109 insertions(+), 43 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9b63b2262cc2..4c0fc0409d3c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -53,8 +53,15 @@
>  static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
>                                  unsigned char);
>  static void free_swap_count_continuations(struct swap_info_struct *);
> +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
> +                                 unsigned int nr_pages);
>  static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
>                              unsigned int nr_entries);
> +static bool folio_swapcache_freeable(struct folio *folio);
> +static struct swap_cluster_info *lock_cluster_or_swap_info(
> +               struct swap_info_struct *si, unsigned long offset);
> +static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> +                                       struct swap_cluster_info *ci);
>
>  static DEFINE_SPINLOCK(swap_lock);
>  static unsigned int nr_swapfiles;
> @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
>   * corresponding page
>   */
>  #define TTRS_UNMAPPED          0x2
> -/* Reclaim the swap entry if swap is getting full*/
> +/* Reclaim the swap entry if swap is getting full */
>  #define TTRS_FULL              0x4
> +/* Reclaim directly, bypass the slot cache and don't touch device lock */
> +#define TTRS_DIRECT            0x8
> +
> +static bool swap_is_has_cache(struct swap_info_struct *si,
> +                             unsigned long offset, int nr_pages)
> +{
> +       unsigned char *map = si->swap_map + offset;
> +       unsigned char *map_end = map + nr_pages;
> +
> +       do {
> +               VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> +               if (*map != SWAP_HAS_CACHE)
> +                       return false;
> +       } while (++map < map_end);
> +
> +       return true;
> +}
>
>  /*
>   * returns number of pages in the folio that backs the swap entry. If positive,
> @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>                                  unsigned long offset, unsigned long flags)
>  {
>         swp_entry_t entry = swp_entry(si->type, offset);
> +       struct address_space *address_space = swap_address_space(entry);
> +       struct swap_cluster_info *ci;
>         struct folio *folio;
> -       int ret = 0;
> +       int ret, nr_pages;
> +       bool need_reclaim;
>
> -       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> +       folio = filemap_get_folio(address_space, swap_cache_index(entry));
>         if (IS_ERR(folio))
>                 return 0;
> +
> +       /* offset could point to the middle of a large folio */
> +       entry = folio->swap;
> +       offset = swp_offset(entry);
> +       nr_pages = folio_nr_pages(folio);
> +       ret = -nr_pages;
> +
>         /*
>          * When this function is called from scan_swap_map_slots() and it's
>          * called by vmscan.c at reclaiming folios. So we hold a folio lock
> @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>          * case and you should use folio_free_swap() with explicit folio_lock()
>          * in usual operations.
>          */
> -       if (folio_trylock(folio)) {
> -               if ((flags & TTRS_ANYWAY) ||
> -                   ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> -                   ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
> -                       ret = folio_free_swap(folio);
> -               folio_unlock(folio);
> +       if (!folio_trylock(folio))
> +               goto out;
> +
> +       need_reclaim = ((flags & TTRS_ANYWAY) ||
> +                       ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> +                       ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> +       if (!need_reclaim || !folio_swapcache_freeable(folio))
> +               goto out_unlock;
> +
> +       /*
> +        * It's safe to delete the folio from swap cache only if the folio's
> +        * swap_map is HAS_CACHE only, which means the slots have no page table
> +        * reference or pending writeback, and can't be allocated to others.
> +        */
> +       ci = lock_cluster_or_swap_info(si, offset);
> +       need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> +       unlock_cluster_or_swap_info(si, ci);
> +       if (!need_reclaim)
> +               goto out_unlock;
> +
> +       if (!(flags & TTRS_DIRECT)) {
> +               /* Free through slot cache */
> +               delete_from_swap_cache(folio);
> +               folio_set_dirty(folio);
> +               ret = nr_pages;
> +               goto out_unlock;
>         }
> -       ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> +
> +       xa_lock_irq(&address_space->i_pages);
> +       __delete_from_swap_cache(folio, entry, NULL);
> +       xa_unlock_irq(&address_space->i_pages);
> +       folio_ref_sub(folio, nr_pages);
> +       folio_set_dirty(folio);
> +
> +       spin_lock(&si->lock);
> +       /* Only sinple page folio can be backed by zswap */
> +       if (!nr_pages)
> +               zswap_invalidate(entry);

I am trying to figure out if I am mad :-)  Does nr_pages == 0 means single
page folio?

> +       swap_entry_range_free(si, entry, nr_pages);
> +       spin_unlock(&si->lock);
> +       ret = nr_pages;
> +out_unlock:
> +       folio_unlock(folio);
> +out:
>         folio_put(folio);
>         return ret;
>  }
> @@ -903,7 +973,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
>         if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
>                 int swap_was_freed;
>                 spin_unlock(&si->lock);
> -               swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> +               swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
>                 spin_lock(&si->lock);
>                 /* entry was freed successfully, try to use this again */
>                 if (swap_was_freed > 0)
> @@ -1340,9 +1410,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>         unsigned long offset = swp_offset(entry);
>         struct swap_cluster_info *ci;
>         struct swap_info_struct *si;
> -       unsigned char *map;
> -       unsigned int i, free_entries = 0;
> -       unsigned char val;
>         int size = 1 << swap_entry_order(folio_order(folio));
>
>         si = _swap_info_get(entry);
> @@ -1350,23 +1417,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
>                 return;
>
>         ci = lock_cluster_or_swap_info(si, offset);
> -       if (size > 1) {
> -               map = si->swap_map + offset;
> -               for (i = 0; i < size; i++) {
> -                       val = map[i];
> -                       VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> -                       if (val == SWAP_HAS_CACHE)
> -                               free_entries++;
> -               }
> -               if (free_entries == size) {
> -                       unlock_cluster_or_swap_info(si, ci);
> -                       spin_lock(&si->lock);
> -                       swap_entry_range_free(si, entry, size);
> -                       spin_unlock(&si->lock);
> -                       return;
> -               }
> +       if (size > 1 && swap_is_has_cache(si, offset, size)) {
> +               unlock_cluster_or_swap_info(si, ci);
> +               spin_lock(&si->lock);
> +               swap_entry_range_free(si, entry, size);
> +               spin_unlock(&si->lock);
> +               return;
>         }
> -       for (i = 0; i < size; i++, entry.val++) {
> +       for (int i = 0; i < size; i++, entry.val++) {
>                 if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
>                         unlock_cluster_or_swap_info(si, ci);
>                         free_swap_slot(entry);
> @@ -1526,16 +1584,7 @@ static bool folio_swapped(struct folio *folio)
>         return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
>  }
>
> -/**
> - * folio_free_swap() - Free the swap space used for this folio.
> - * @folio: The folio to remove.
> - *
> - * If swap is getting full, or if there are no more mappings of this folio,
> - * then call folio_free_swap to free its swap space.
> - *
> - * Return: true if we were able to release the swap space.
> - */
> -bool folio_free_swap(struct folio *folio)
> +static bool folio_swapcache_freeable(struct folio *folio)
>  {
>         VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> @@ -1543,8 +1592,6 @@ bool folio_free_swap(struct folio *folio)
>                 return false;
>         if (folio_test_writeback(folio))
>                 return false;
> -       if (folio_swapped(folio))
> -               return false;
>
>         /*
>          * Once hibernation has begun to create its image of memory,
> @@ -1564,6 +1611,25 @@ bool folio_free_swap(struct folio *folio)
>         if (pm_suspended_storage())
>                 return false;
>
> +       return true;
> +}
> +
> +/**
> + * folio_free_swap() - Free the swap space used for this folio.
> + * @folio: The folio to remove.
> + *
> + * If swap is getting full, or if there are no more mappings of this folio,
> + * then call folio_free_swap to free its swap space.
> + *
> + * Return: true if we were able to release the swap space.
> + */
> +bool folio_free_swap(struct folio *folio)
> +{
> +       if (!folio_swapcache_freeable(folio))
> +               return false;
> +       if (folio_swapped(folio))
> +               return false;
> +
>         delete_from_swap_cache(folio);
>         folio_set_dirty(folio);
>         return true;
> @@ -1640,7 +1706,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>                          * to the next boundary.
>                          */
>                         nr = __try_to_reclaim_swap(si, offset,
> -                                             TTRS_UNMAPPED | TTRS_FULL);
> +                                                  TTRS_UNMAPPED | TTRS_FULL);
>                         if (nr == 0)
>                                 nr = 1;
>                         else if (nr < 0)
>
> --
> 2.46.0.rc1.232.g9752f9e123-goog
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP
  2024-08-03  9:11   ` Barry Song
@ 2024-08-03 10:57     ` Barry Song
  0 siblings, 0 replies; 32+ messages in thread
From: Barry Song @ 2024-08-03 10:57 UTC (permalink / raw)
  To: chrisl
  Cc: akpm, hughd, kaleshsingh, kasong, linux-kernel, linux-mm,
	ryan.roberts, ying.huang, Barry Song

On Sat, Aug 3, 2024 at 9:11 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 6:49 PM <chrisl@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently when we are freeing mTHP folios from swap cache, we free
> > then one by one and put each entry into swap slot cache. Slot
> > cache is designed to reduce the overhead by batching the freeing,
> > but mTHP swap entries are already continuous so they can be batch
> > freed without it already, it saves litle overhead, or even increase
> > overhead for larger mTHP.
> >
> > What's more, mTHP entries could stay in swap cache for a while.
> > Contiguous swap entry is an rather rare resource so releasing them
> > directly can help improve mTHP allocation success rate when under
> > pressure.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Acked-by: Barry Song <baohua@kernel.org>
>
> I believe this is the right direction to take. Currently, entries are released
> one by one, even when they are contiguous in the swap file(those nr_pages
> entries are definitely in the same cluster and same si), leading to numerous
> lock and unlock operations.
> This approach provides batched support.
>
> free_swap_and_cache_nr() has the same issue, so I drafted a patch based on
> your code. I wonder if you can also help test and review before I send it
> officially:
>
> From 4bed5c08bc0f7769ee2849812acdad70c4e32ead Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Sat, 3 Aug 2024 20:21:14 +1200
> Subject: [PATCH RFC] mm: attempt to batch free swap entries for zap_pte_range()
>
> Zhiguo reported that swap release could be a serious bottleneck
> during process exits[1]. With mTHP, we have the opportunity to
> batch free swaps.
> Thanks to the work of Chris and Kairui[2], I was able to achieve
> this optimization with minimal code changes by building on their
> efforts.
>
> [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/
> [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/swapfile.c | 43 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 43 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ea023fc25d08..9def6dba8d26 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
>         return true;
>  }
>
> +static bool swap_is_last_map(struct swap_info_struct *si,
> +                             unsigned long offset, int nr_pages,
> +                             bool *any_only_cache)
> +{
> +       unsigned char *map = si->swap_map + offset;
> +       unsigned char *map_end = map + nr_pages;
> +       bool cached = false;
> +
> +       do {
> +               if ((*map & ~SWAP_HAS_CACHE) != 1)
> +                       return false;
> +               if (*map & SWAP_HAS_CACHE)
> +                       cached = true;
> +       } while (++map < map_end);
> +
> +       *any_only_cache = cached;
> +       return true;
> +}
> +
>  /*
>   * returns number of pages in the folio that backs the swap entry. If positive,
>   * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
> @@ -1808,6 +1827,29 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>         if (WARN_ON(end_offset > si->max))
>                 goto out;
>
> +       if (nr > 1) {
> +               struct swap_cluster_info *ci;
> +               bool batched_free;
> +               int i;
> +
> +               ci = lock_cluster_or_swap_info(si, start_offset);
> +               if ((batched_free = swap_is_last_map(si, start_offset, nr, &any_only_cache))) {
> +                       for (i = 0; i < nr; i++)
> +                               WRITE_ONCE(si->swap_map[start_offset + i], SWAP_HAS_CACHE);
> +               }
> +               unlock_cluster_or_swap_info(si, ci);
> +
> +               if (batched_free) {
> +                       spin_lock(&si->lock);
> +                       pr_err("%s offset:%lx nr:%lx\n", __func__,start_offset, nr);
> +                       swap_entry_range_free(si, entry, nr);
> +                       spin_unlock(&si->lock);
> +                       if (any_only_cache)
> +                               goto reclaim;
> +                       goto out;
> +               }

Sorry, what I actually meant was that the two gotos are reversed.

                if (batched_free) {
                        if (any_only_cache)
                                goto reclaim;

                        spin_lock(&si->lock);
                        swap_entry_range_free(si, entry, nr);
                        spin_unlock(&si->lock);
                        goto out;
                }

> +       }
> +
>         /*
>          * First free all entries in the range.
>          */
> @@ -1828,6 +1870,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
>         if (!any_only_cache)
>                 goto out;
>
> +reclaim:
>         /*
>          * Now go back over the range trying to reclaim the swap cache. This is
>          * more efficient for large folios because we will only try to reclaim
> --
> 2.34.1
>
>
>
> > ---
> >  mm/swapfile.c | 59 ++++++++++++++++++++++++++---------------------------------
> >  1 file changed, 26 insertions(+), 33 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 34e6ea13e8e4..9b63b2262cc2 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> >  }
> >
> >  /*
> > - * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
> >   * which means no page in the cluster is in use, we can optionally discard
> >   * the cluster and add it to free cluster list.
> >   */
> > -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> > +static void dec_cluster_info_page(struct swap_info_struct *p,
> > +                                 struct swap_cluster_info *ci, int nr_pages)
> >  {
> >         if (!p->cluster_info)
> >                 return;
> >
> > -       VM_BUG_ON(ci->count == 0);
> > +       VM_BUG_ON(ci->count < nr_pages);
> >         VM_BUG_ON(cluster_is_free(ci));
> >         lockdep_assert_held(&p->lock);
> >         lockdep_assert_held(&ci->lock);
> > -       ci->count--;
> > +       ci->count -= nr_pages;
> >
> >         if (!ci->count) {
> >                 free_cluster(p, ci);
> > @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> >         return n_ret;
> >  }
> >
> > -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > -{
> > -       unsigned long offset = idx * SWAPFILE_CLUSTER;
> > -       struct swap_cluster_info *ci;
> > -
> > -       ci = lock_cluster(si, offset);
> > -       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > -       ci->count = 0;
> > -       free_cluster(si, ci);
> > -       unlock_cluster(ci);
> > -       swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > -}
> > -
> >  int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> >  {
> >         int order = swap_entry_order(entry_order);
> > @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
> >         return usage;
> >  }
> >
> > -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > +/*
> > + * Drop the last HAS_CACHE flag of swap entries, caller have to
> > + * ensure all entries belong to the same cgroup.
> > + */
> > +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry,
> > +                                 unsigned int nr_pages)
> >  {
> > -       struct swap_cluster_info *ci;
> >         unsigned long offset = swp_offset(entry);
> > -       unsigned char count;
> > +       unsigned char *map = p->swap_map + offset;
> > +       unsigned char *map_end = map + nr_pages;
> > +       struct swap_cluster_info *ci;
> >
> >         ci = lock_cluster(p, offset);
> > -       count = p->swap_map[offset];
> > -       VM_BUG_ON(count != SWAP_HAS_CACHE);
> > -       p->swap_map[offset] = 0;
> > -       dec_cluster_info_page(p, ci);
> > +       do {
> > +               VM_BUG_ON(*map != SWAP_HAS_CACHE);
> > +               *map = 0;
> > +       } while (++map < map_end);
> > +       dec_cluster_info_page(p, ci, nr_pages);
> >         unlock_cluster(ci);
> >
> > -       mem_cgroup_uncharge_swap(entry, 1);
> > -       swap_range_free(p, offset, 1);
> > +       mem_cgroup_uncharge_swap(entry, nr_pages);
> > +       swap_range_free(p, offset, nr_pages);
> >  }
> >
> >  static void cluster_swap_free_nr(struct swap_info_struct *sis,
> > @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
> >  void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >  {
> >         unsigned long offset = swp_offset(entry);
> > -       unsigned long idx = offset / SWAPFILE_CLUSTER;
> >         struct swap_cluster_info *ci;
> >         struct swap_info_struct *si;
> >         unsigned char *map;
> > @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
> >                 return;
> >
> >         ci = lock_cluster_or_swap_info(si, offset);
> > -       if (size == SWAPFILE_CLUSTER) {
> > +       if (size > 1) {
> >                 map = si->swap_map + offset;
> > -               for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> > +               for (i = 0; i < size; i++) {
> >                         val = map[i];
> >                         VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> >                         if (val == SWAP_HAS_CACHE)
> >                                 free_entries++;
> >                 }
> > -               if (free_entries == SWAPFILE_CLUSTER) {
> > +               if (free_entries == size) {
> >                         unlock_cluster_or_swap_info(si, ci);
> >                         spin_lock(&si->lock);
> > -                       mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> > -                       swap_free_cluster(si, idx);
> > +                       swap_entry_range_free(si, entry, size);
> >                         spin_unlock(&si->lock);
> >                         return;
> >                 }
> > @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
> >         for (i = 0; i < n; ++i) {
> >                 p = swap_info_get_cont(entries[i], prev);
> >                 if (p)
> > -                       swap_entry_free(p, entries[i]);
> > +                       swap_entry_range_free(p, entries[i], 1);
> >                 prev = p;
> >         }
> >         if (p)
> >
> > --
> > 2.46.0.rc1.232.g9752f9e123-goog
> >
>
> Thanks
> Barry
>


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
  2024-08-03 10:38   ` Barry Song
@ 2024-08-03 12:18     ` Kairui Song
  2024-08-04 18:06       ` Chris Li
  0 siblings, 1 reply; 32+ messages in thread
From: Kairui Song @ 2024-08-03 12:18 UTC (permalink / raw)
  To: Barry Song
  Cc: chrisl, Andrew Morton, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm

On Sat, Aug 3, 2024 at 6:39 PM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Jul 31, 2024 at 2:49 PM <chrisl@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently we free the reclaimed slots through slot cache even
> > if the slot is required to be empty immediately. As a result
> > the reclaim caller will see the slot still occupied even after a
> > successful reclaim, and need to keep reclaiming until slot cache
> > get flushed. This caused ineffective or over reclaim when SWAP is
> > under stress.
> >
> > So introduce a new flag allowing the slot to be emptied bypassing
> > the slot cache.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 109 insertions(+), 43 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9b63b2262cc2..4c0fc0409d3c 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -53,8 +53,15 @@
> >  static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
> >                                  unsigned char);
> >  static void free_swap_count_continuations(struct swap_info_struct *);
> > +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
> > +                                 unsigned int nr_pages);
> >  static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> >                              unsigned int nr_entries);
> > +static bool folio_swapcache_freeable(struct folio *folio);
> > +static struct swap_cluster_info *lock_cluster_or_swap_info(
> > +               struct swap_info_struct *si, unsigned long offset);
> > +static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > +                                       struct swap_cluster_info *ci);
> >
> >  static DEFINE_SPINLOCK(swap_lock);
> >  static unsigned int nr_swapfiles;
> > @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
> >   * corresponding page
> >   */
> >  #define TTRS_UNMAPPED          0x2
> > -/* Reclaim the swap entry if swap is getting full*/
> > +/* Reclaim the swap entry if swap is getting full */
> >  #define TTRS_FULL              0x4
> > +/* Reclaim directly, bypass the slot cache and don't touch device lock */
> > +#define TTRS_DIRECT            0x8
> > +
> > +static bool swap_is_has_cache(struct swap_info_struct *si,
> > +                             unsigned long offset, int nr_pages)
> > +{
> > +       unsigned char *map = si->swap_map + offset;
> > +       unsigned char *map_end = map + nr_pages;
> > +
> > +       do {
> > +               VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> > +               if (*map != SWAP_HAS_CACHE)
> > +                       return false;
> > +       } while (++map < map_end);
> > +
> > +       return true;
> > +}
> >
> >  /*
> >   * returns number of pages in the folio that backs the swap entry. If positive,
> > @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >                                  unsigned long offset, unsigned long flags)
> >  {
> >         swp_entry_t entry = swp_entry(si->type, offset);
> > +       struct address_space *address_space = swap_address_space(entry);
> > +       struct swap_cluster_info *ci;
> >         struct folio *folio;
> > -       int ret = 0;
> > +       int ret, nr_pages;
> > +       bool need_reclaim;
> >
> > -       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > +       folio = filemap_get_folio(address_space, swap_cache_index(entry));
> >         if (IS_ERR(folio))
> >                 return 0;
> > +
> > +       /* offset could point to the middle of a large folio */
> > +       entry = folio->swap;
> > +       offset = swp_offset(entry);
> > +       nr_pages = folio_nr_pages(folio);
> > +       ret = -nr_pages;
> > +
> >         /*
> >          * When this function is called from scan_swap_map_slots() and it's
> >          * called by vmscan.c at reclaiming folios. So we hold a folio lock
> > @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> >          * case and you should use folio_free_swap() with explicit folio_lock()
> >          * in usual operations.
> >          */
> > -       if (folio_trylock(folio)) {
> > -               if ((flags & TTRS_ANYWAY) ||
> > -                   ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > -                   ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
> > -                       ret = folio_free_swap(folio);
> > -               folio_unlock(folio);
> > +       if (!folio_trylock(folio))
> > +               goto out;
> > +
> > +       need_reclaim = ((flags & TTRS_ANYWAY) ||
> > +                       ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > +                       ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> > +       if (!need_reclaim || !folio_swapcache_freeable(folio))
> > +               goto out_unlock;
> > +
> > +       /*
> > +        * It's safe to delete the folio from swap cache only if the folio's
> > +        * swap_map is HAS_CACHE only, which means the slots have no page table
> > +        * reference or pending writeback, and can't be allocated to others.
> > +        */
> > +       ci = lock_cluster_or_swap_info(si, offset);
> > +       need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> > +       unlock_cluster_or_swap_info(si, ci);
> > +       if (!need_reclaim)
> > +               goto out_unlock;
> > +
> > +       if (!(flags & TTRS_DIRECT)) {
> > +               /* Free through slot cache */
> > +               delete_from_swap_cache(folio);
> > +               folio_set_dirty(folio);
> > +               ret = nr_pages;
> > +               goto out_unlock;
> >         }
> > -       ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> > +
> > +       xa_lock_irq(&address_space->i_pages);
> > +       __delete_from_swap_cache(folio, entry, NULL);
> > +       xa_unlock_irq(&address_space->i_pages);
> > +       folio_ref_sub(folio, nr_pages);
> > +       folio_set_dirty(folio);
> > +
> > +       spin_lock(&si->lock);
> > +       /* Only sinple page folio can be backed by zswap */
> > +       if (!nr_pages)
> > +               zswap_invalidate(entry);
>
> I am trying to figure out if I am mad :-)  Does nr_pages == 0 means single
> page folio?
>

Hi Barry

I'm sorry, this should be nr_pages == 1, I messed up order and nr, as
zswap only works for single page folios.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
  2024-08-03 12:18     ` Kairui Song
@ 2024-08-04 18:06       ` Chris Li
  2024-08-05  1:53         ` Barry Song
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-08-04 18:06 UTC (permalink / raw)
  To: Kairui Song
  Cc: Barry Song, Andrew Morton, Hugh Dickins, Ryan Roberts,
	Huang, Ying, Kalesh Singh, linux-kernel, linux-mm

On Sat, Aug 3, 2024 at 5:19 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 3, 2024 at 6:39 PM Barry Song <baohua@kernel.org> wrote:
> >
> > On Wed, Jul 31, 2024 at 2:49 PM <chrisl@kernel.org> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Currently we free the reclaimed slots through slot cache even
> > > if the slot is required to be empty immediately. As a result
> > > the reclaim caller will see the slot still occupied even after a
> > > successful reclaim, and need to keep reclaiming until slot cache
> > > get flushed. This caused ineffective or over reclaim when SWAP is
> > > under stress.
> > >
> > > So introduce a new flag allowing the slot to be emptied bypassing
> > > the slot cache.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > >  mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
> > >  1 file changed, 109 insertions(+), 43 deletions(-)
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 9b63b2262cc2..4c0fc0409d3c 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -53,8 +53,15 @@
> > >  static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
> > >                                  unsigned char);
> > >  static void free_swap_count_continuations(struct swap_info_struct *);
> > > +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
> > > +                                 unsigned int nr_pages);
> > >  static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> > >                              unsigned int nr_entries);
> > > +static bool folio_swapcache_freeable(struct folio *folio);
> > > +static struct swap_cluster_info *lock_cluster_or_swap_info(
> > > +               struct swap_info_struct *si, unsigned long offset);
> > > +static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > > +                                       struct swap_cluster_info *ci);
> > >
> > >  static DEFINE_SPINLOCK(swap_lock);
> > >  static unsigned int nr_swapfiles;
> > > @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
> > >   * corresponding page
> > >   */
> > >  #define TTRS_UNMAPPED          0x2
> > > -/* Reclaim the swap entry if swap is getting full*/
> > > +/* Reclaim the swap entry if swap is getting full */
> > >  #define TTRS_FULL              0x4
> > > +/* Reclaim directly, bypass the slot cache and don't touch device lock */
> > > +#define TTRS_DIRECT            0x8
> > > +
> > > +static bool swap_is_has_cache(struct swap_info_struct *si,
> > > +                             unsigned long offset, int nr_pages)
> > > +{
> > > +       unsigned char *map = si->swap_map + offset;
> > > +       unsigned char *map_end = map + nr_pages;
> > > +
> > > +       do {
> > > +               VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> > > +               if (*map != SWAP_HAS_CACHE)
> > > +                       return false;
> > > +       } while (++map < map_end);
> > > +
> > > +       return true;
> > > +}
> > >
> > >  /*
> > >   * returns number of pages in the folio that backs the swap entry. If positive,
> > > @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > >                                  unsigned long offset, unsigned long flags)
> > >  {
> > >         swp_entry_t entry = swp_entry(si->type, offset);
> > > +       struct address_space *address_space = swap_address_space(entry);
> > > +       struct swap_cluster_info *ci;
> > >         struct folio *folio;
> > > -       int ret = 0;
> > > +       int ret, nr_pages;
> > > +       bool need_reclaim;
> > >
> > > -       folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > > +       folio = filemap_get_folio(address_space, swap_cache_index(entry));
> > >         if (IS_ERR(folio))
> > >                 return 0;
> > > +
> > > +       /* offset could point to the middle of a large folio */
> > > +       entry = folio->swap;
> > > +       offset = swp_offset(entry);
> > > +       nr_pages = folio_nr_pages(folio);
> > > +       ret = -nr_pages;
> > > +
> > >         /*
> > >          * When this function is called from scan_swap_map_slots() and it's
> > >          * called by vmscan.c at reclaiming folios. So we hold a folio lock
> > > @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > >          * case and you should use folio_free_swap() with explicit folio_lock()
> > >          * in usual operations.
> > >          */
> > > -       if (folio_trylock(folio)) {
> > > -               if ((flags & TTRS_ANYWAY) ||
> > > -                   ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > > -                   ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
> > > -                       ret = folio_free_swap(folio);
> > > -               folio_unlock(folio);
> > > +       if (!folio_trylock(folio))
> > > +               goto out;
> > > +
> > > +       need_reclaim = ((flags & TTRS_ANYWAY) ||
> > > +                       ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > > +                       ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> > > +       if (!need_reclaim || !folio_swapcache_freeable(folio))
> > > +               goto out_unlock;
> > > +
> > > +       /*
> > > +        * It's safe to delete the folio from swap cache only if the folio's
> > > +        * swap_map is HAS_CACHE only, which means the slots have no page table
> > > +        * reference or pending writeback, and can't be allocated to others.
> > > +        */
> > > +       ci = lock_cluster_or_swap_info(si, offset);
> > > +       need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> > > +       unlock_cluster_or_swap_info(si, ci);
> > > +       if (!need_reclaim)
> > > +               goto out_unlock;
> > > +
> > > +       if (!(flags & TTRS_DIRECT)) {
> > > +               /* Free through slot cache */
> > > +               delete_from_swap_cache(folio);
> > > +               folio_set_dirty(folio);
> > > +               ret = nr_pages;
> > > +               goto out_unlock;
> > >         }
> > > -       ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> > > +
> > > +       xa_lock_irq(&address_space->i_pages);
> > > +       __delete_from_swap_cache(folio, entry, NULL);
> > > +       xa_unlock_irq(&address_space->i_pages);
> > > +       folio_ref_sub(folio, nr_pages);
> > > +       folio_set_dirty(folio);
> > > +
> > > +       spin_lock(&si->lock);
> > > +       /* Only sinple page folio can be backed by zswap */
> > > +       if (!nr_pages)
> > > +               zswap_invalidate(entry);
> >
> > I am trying to figure out if I am mad :-)  Does nr_pages == 0 means single
> > page folio?
> >
>
> Hi Barry
>
> I'm sorry, this should be nr_pages == 1, I messed up order and nr, as
> zswap only works for single page folios.
Ack. Should be  nr_pages == 1.

Barry, thanks for catching that.

Chris


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
  2024-08-04 18:06       ` Chris Li
@ 2024-08-05  1:53         ` Barry Song
  0 siblings, 0 replies; 32+ messages in thread
From: Barry Song @ 2024-08-05  1:53 UTC (permalink / raw)
  To: chrisl
  Cc: akpm, baohua, hughd, kaleshsingh, linux-kernel, linux-mm,
	ryan.roberts, ryncsn, ying.huang

On Mon, Aug 5, 2024 at 6:07 AM Chris Li <chrisl@kernel.org> wrote:

> > > > +
> > > > +       spin_lock(&si->lock);
> > > > +       /* Only sinple page folio can be backed by zswap */
> > > > +       if (!nr_pages)
> > > > +               zswap_invalidate(entry);
> > >
> > > I am trying to figure out if I am mad :-)  Does nr_pages == 0 means single
> > > page folio?
> > >
> >
> > Hi Barry
> >
> > I'm sorry, this should be nr_pages == 1, I messed up order and nr, as
> > zswap only works for single page folios.
> Ack. Should be  nr_pages == 1.
>

Yes. Andrew, can you please help squash the below fix,

small folios should have nr_pages == 1 but not nr_page == 0

diff --git a/mm/swapfile.c b/mm/swapfile.c
index ea023fc25d08..6de12d712c7e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -224,7 +224,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 
 	spin_lock(&si->lock);
 	/* Only sinple page folio can be backed by zswap */
-	if (!nr_pages)
+	if (nr_pages == 1)
 		zswap_invalidate(entry);
 	swap_entry_range_free(si, entry, nr_pages);
 	spin_unlock(&si->lock);

> Barry, thanks for catching that.
>
> Chris

Thanks
Barry



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
       [not found]     ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2024-08-16  7:36       ` Chris Li
  2024-08-17 17:47         ` Kairui Song
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-08-16  7:36 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song, David Hildenbrand, Andrew Morton, Hugh Dickins,
	Ryan Roberts, Kalesh Singh, linux-kernel, linux-mm, Barry Song

On Thu, Aug 8, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> [snip]
>
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> >       lockdep_assert_held(&si->lock);
> >       lockdep_assert_held(&ci->lock);
> >
> > -     list_move_tail(&ci->list, &si->free_clusters);
> > +     if (ci->flags)
> > +             list_move_tail(&ci->list, &si->free_clusters);
> > +     else
> > +             list_add_tail(&ci->list, &si->free_clusters);
>
> If we use list_del_init() to delete the cluster, we can always use
> list_move_tail()?  If so, the logic can be simplified.

Thanks for the suggestion.

I feel that list_del_init() generates more instruction than necessary.
It is my bad that I leave the discard list without not a list flag bit
for it.

I do want to clean this up. While we are at it, because the cluster
can only belong to one list at a time. We can use a list indicator as
integer rather than bits mask. If we give the discard list the same
treatment, that should remove the special condition to add a cluster
to another list as well.

Chris

> >       ci->flags = CLUSTER_FLAG_FREE;
> >       ci->order = 0;
> >  }
> > @@ -474,7 +477,6 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> >                               SWAPFILE_CLUSTER);
> >
> >               spin_lock(&si->lock);
> > -
> >               spin_lock(&ci->lock);
> >               __free_cluster(si, ci);
> >               memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > @@ -666,7 +668,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> >               if (ci->flags & CLUSTER_FLAG_FRAG)
> >                       si->frag_cluster_nr[ci->order]--;
> >               list_move_tail(&ci->list, &si->full_clusters);
> > -             ci->flags = 0;
> > +             ci->flags = CLUSTER_FLAG_FULL;
> >       }
> >  }
>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
       [not found]     ` <87sevfza3w.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2024-08-16  7:47       ` Chris Li
  2024-08-18 16:59         ` Kairui Song
  2024-08-19  8:39         ` Huang, Ying
  0 siblings, 2 replies; 32+ messages in thread
From: Chris Li @ 2024-08-16  7:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, Kairui Song, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Chris,
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > This is the short term solutions "swap cluster order" listed
> >> > in my "Swap Abstraction" discussion slice 8 in the recent
> >> > LSF/MM conference.
> >> >
> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> >> > orders" is introduced, it only allocates the mTHP swap entries
> >> > from the new empty cluster list.  It has a fragmentation issue
> >> > reported by Barry.
> >> >
> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >> >
> >> > The reason is that all the empty clusters have been exhausted while
> >> > there are plenty of free swap entries in the cluster that are
> >> > not 100% free.
> >> >
> >> > Remember the swap allocation order in the cluster.
> >> > Keep track of the per order non full cluster list for later allocation.
> >> >
> >> > This series gives the swap SSD allocation a new separate code path
> >> > from the HDD allocation. The new allocator use cluster list only
> >> > and do not global scan swap_map[] without lock any more.
> >>
> >> This sounds good.  Can we use SSD allocation method for HDD too?
> >> We may not need a swap entry allocator optimized for HDD.
> >
> > Yes, that is the plan as well. That way we can completely get rid of
> > the old scan_swap_map_slots() code.
>
> Good!
>
> > However, considering the size of the series, let's focus on the
> > cluster allocation path first, get it tested and reviewed.
>
> OK.
>
> > For HDD optimization, mostly just the new block allocations portion
> > need some separate code path from the new cluster allocator to not do
> > the per cpu allocation.  Allocating from the non free list doesn't
> > need to change too
>
> I suggest not consider HDD optimization at all.  Just use SSD algorithm
> to simplify.

Adding a global next allocating CI rather than the per CPU next CI
pointer is pretty trivial as well. It is just a different way to fetch
the next cluster pointer.

>
> >>
> >> Hi, Hugh,
> >>
> >> What do you think about this?
> >>
> >> > This streamline the swap allocation for SSD. The code matches the
> >> > execution flow much better.
> >> >
> >> > User impact: For users that allocate and free mix order mTHP swapping,
> >> > It greatly improves the success rate of the mTHP swap allocation after the
> >> > initial phase.
> >> >
> >> > It also performs faster when the swapfile is close to full, because the
> >> > allocator can get the non full cluster from a list rather than scanning
> >> > a lot of swap_map entries.
> >>
> >> Do you have some test results to prove this?  Or which test below can
> >> prove this?
> >
> > The two zram tests are already proving this. The system time
> > improvement is about 2% on my low CPU count machine.
> > Kairui has a higher core count machine and the difference is higher
> > there. The theory is that higher CPU count has higher contentions.
>
> I will interpret this as the performance is better in theory.  But
> there's almost no measurable results so far.

I am trying to understand why don't see the performance improvement in
the zram setup in my cover letter as a measurable result?

>
> > The 2% system time number does not sound like much. But consider this
> > two factors:
> > 1) swap allocator only takes a small percentage of the overall workload.
> > 2) The new allocator does more work.
> > The old allocator has a time tick budget. It will abort and fail to
> > find an entry when it runs out of time budget, even though there are
> > still some free entries on the swapfile.
>
> What is the time tick budget you mentioned?

I was under the impression that the previous swap entry allocation
code will not scan 100% of the swapfile if there is only one entry
left.
Please let me know if my understanding is not correct.

        /* time to take a break? */
        if (unlikely(--latency_ration < 0)) {
                if (n_ret)
                        goto done;
                spin_unlock(&si->lock);
                cond_resched();
                spin_lock(&si->lock);
                latency_ration = LATENCY_LIMIT;
        }


>
> > The new allocator can get to the last few free swap entries if it is
> > available. If not then, the new swap allocator will work harder on
> > swap cache reclaim.
> >
> > From the swap cache reclaim aspect, it is very hard to optimize the
> > swap cache reclaim in the old allocation path because the scan
> > position is randomized.
> > The full list and frag list both design to help reduce the repeat
> > reclaim attempt of the swap cache.
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
       [not found]   ` <87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2024-08-16  8:01     ` Chris Li
  2024-08-19  8:08       ` Huang, Ying
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-08-16  8:01 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

On Wed, Aug 7, 2024 at 6:14 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Track the nonfull cluster as well as the empty cluster
> > on lists. Each order has one nonfull cluster list.
> >
> > The cluster will remember which order it was used during
> > new cluster allocation.
> >
> > When the cluster has free entry, add to the nonfull[order]
> > list.  When the free cluster list is empty, also allocate
> > from the nonempty list of that order.
> >
> > This improves the mTHP swap allocation success rate.
> >
> > There are limitations if the distribution of numbers of
> > different orders of mTHP changes a lot. e.g. there are a lot
> > of nonfull cluster assign to order A while later time there
> > are a lot of order B allocation while very little allocation
> > in order A. Currently the cluster used by order A will not
> > reused by order B unless the cluster is 100% empty.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> >  include/linux/swap.h |  4 ++++
> >  mm/swapfile.c        | 38 +++++++++++++++++++++++++++++++++++---
> >  2 files changed, 39 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index edafd52d7ac4..6716ef236766 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -254,9 +254,11 @@ struct swap_cluster_info {
> >                                */
> >       u16 count;
> >       u8 flags;
> > +     u8 order;
> >       struct list_head list;
> >  };
> >  #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >
> >  /*
> >   * The first page in the swap file is the swap header, which is always marked
> > @@ -295,6 +297,8 @@ struct swap_info_struct {
> >       unsigned long *zeromap;         /* vmalloc'ed bitmap to track zero pages */
> >       struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >       struct list_head free_clusters; /* free clusters list */
> > +     struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> > +                                     /* list of cluster that contains at least one free slot */
> >       unsigned int lowest_bit;        /* index of first free in swap_map */
> >       unsigned int highest_bit;       /* index of last free in swap_map */
> >       unsigned int pages;             /* total of usable pages of swap */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index bceead7f9e3c..dcf09eb549db 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -361,14 +361,22 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >       memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >                       SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > -     list_add_tail(&ci->list, &si->discard_clusters);
> > +     VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
> > +     if (ci->flags & CLUSTER_FLAG_NONFULL)
> > +             list_move_tail(&ci->list, &si->discard_clusters);
> > +     else
> > +             list_add_tail(&ci->list, &si->discard_clusters);
> > +     ci->flags = 0;
>
> As Ryan pointed out before, it's better to clear the specific bit
> instead of assigning 0.  This will make code future proof.
Ack.

BTW, I plan to change the bit to a list number in the future.I can
define a list bit mask for now.

>
> >       schedule_work(&si->discard_work);
> >  }
> >
> >  static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >  {
> > +     if (ci->flags & CLUSTER_FLAG_NONFULL)
> > +             list_move_tail(&ci->list, &si->free_clusters);
> > +     else
> > +             list_add_tail(&ci->list, &si->free_clusters);
> >       ci->flags = CLUSTER_FLAG_FREE;
> > -     list_add_tail(&ci->list, &si->free_clusters);
> >  }
> >
> >  /*
> > @@ -491,8 +499,15 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> >       VM_BUG_ON(ci->count == 0);
> >       ci->count--;
> >
> > -     if (!ci->count)
> > +     if (!ci->count) {
> >               free_cluster(p, ci);
> > +             return;
> > +     }
> > +
> > +     if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> > +             list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> > +             ci->flags |= CLUSTER_FLAG_NONFULL;
> > +     }
> >  }
> >
> >  /*
> > @@ -553,6 +568,19 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >       if (tmp == SWAP_NEXT_INVALID) {
> >               if (!list_empty(&si->free_clusters)) {
> >                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > +                     list_del(&ci->list);
> > +                     spin_lock(&ci->lock);
> > +                     ci->order = order;
> > +                     ci->flags = 0;
> > +                     spin_unlock(&ci->lock);
> > +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> > +             } else if (!list_empty(&si->nonfull_clusters[order])) {
> > +                     ci = list_first_entry(&si->nonfull_clusters[order],
> > +                                           struct swap_cluster_info, list);
> > +                     list_del(&ci->list);
> > +                     spin_lock(&ci->lock);
> > +                     ci->flags = 0;
> > +                     spin_unlock(&ci->lock);
> >                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >               } else if (!list_empty(&si->discard_clusters)) {
>
> We should check discard_clusters before nonfull clusters.

And the reason behind that is?

I see the discard_cluster can take a long time. It will take a
synchronous wait for the issuing the discard command. Why not just use
the  nonfull list and return immediately. When the discard command
finished. It will show up in the free list anyway.

BTW, what is your take on my  previous analysis of the current SSD
prefer write new cluster can wear out the SSD faster?
I think it might be useful to provide users an option to choose to
write a non full list first. The trade off is more friendly to SSD
wear out than preferring to write new blocks. If you keep doing the
swap long enough, there will be no new free cluster anyway.

The example I give in this email:

https://lore.kernel.org/linux-mm/CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@mail.gmail.com/

Chris
>
> >                       /*
> > @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >       ci = lock_cluster(si, offset);
> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >       ci->count = 0;
> > +     ci->order = 0;
> >       ci->flags = 0;
> >       free_cluster(si, ci);
> >       unlock_cluster(ci);
> > @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >       INIT_LIST_HEAD(&p->free_clusters);
> >       INIT_LIST_HEAD(&p->discard_clusters);
> >
> > +     for (i = 0; i < SWAP_NR_ORDERS; i++)
> > +             INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> > +
> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >               unsigned int page_nr = swap_header->info.badpages[i];
> >               if (page_nr == 0 || page_nr > swap_header->info.last_page)
>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-16  7:36       ` Chris Li
@ 2024-08-17 17:47         ` Kairui Song
  0 siblings, 0 replies; 32+ messages in thread
From: Kairui Song @ 2024-08-17 17:47 UTC (permalink / raw)
  To: Chris Li
  Cc: Huang, Ying, David Hildenbrand, Andrew Morton, Hugh Dickins,
	Ryan Roberts, Kalesh Singh, linux-kernel, linux-mm, Barry Song

On Fri, Aug 16, 2024 at 3:37 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > [snip]
> >
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> > >       lockdep_assert_held(&si->lock);
> > >       lockdep_assert_held(&ci->lock);
> > >
> > > -     list_move_tail(&ci->list, &si->free_clusters);
> > > +     if (ci->flags)
> > > +             list_move_tail(&ci->list, &si->free_clusters);
> > > +     else
> > > +             list_add_tail(&ci->list, &si->free_clusters);
> >
> > If we use list_del_init() to delete the cluster, we can always use
> > list_move_tail()?  If so, the logic can be simplified.
>
> Thanks for the suggestion.

Hi All, thanks for the review and discussion.

> I feel that list_del_init() generates more instruction than necessary.
> It is my bad that I leave the discard list without not a list flag bit
> for it.

Right, list_del_init is a little bit more noisy than list_del indeed.

But considering after this patch, all non-discard clusters are always
a on list (free/nonfull/full) already, and a cluster will be dangling
only when being removed from the discard list (when doing discard
work, it need to unlock si->lock, so the cluster have to be hidden
from other racers).

I think it's good to use list_del_init when deleting from the discard
list in this patch, then list_move_tail can always be used when
changing the list of a cluster.
Discard should be a much less common operation so this should
be OK.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-16  7:47       ` Chris Li
@ 2024-08-18 16:59         ` Kairui Song
  2024-08-19  8:27           ` Huang, Ying
  2024-08-19  8:39         ` Huang, Ying
  1 sibling, 1 reply; 32+ messages in thread
From: Kairui Song @ 2024-08-18 16:59 UTC (permalink / raw)
  To: Chris Li, Huang, Ying
  Cc: Hugh Dickins, Andrew Morton, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Chris Li <chrisl@kernel.org> writes:
> >
> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Hi, Chris,
> > >>
> > >> Chris Li <chrisl@kernel.org> writes:
> > >>
> > >> > This is the short term solutions "swap cluster order" listed
> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> > >> > LSF/MM conference.
> > >> >
> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > >> > orders" is introduced, it only allocates the mTHP swap entries
> > >> > from the new empty cluster list.  It has a fragmentation issue
> > >> > reported by Barry.
> > >> >
> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > >> >
> > >> > The reason is that all the empty clusters have been exhausted while
> > >> > there are plenty of free swap entries in the cluster that are
> > >> > not 100% free.
> > >> >
> > >> > Remember the swap allocation order in the cluster.
> > >> > Keep track of the per order non full cluster list for later allocation.
> > >> >
> > >> > This series gives the swap SSD allocation a new separate code path
> > >> > from the HDD allocation. The new allocator use cluster list only
> > >> > and do not global scan swap_map[] without lock any more.
> > >>
> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> > >> We may not need a swap entry allocator optimized for HDD.
> > >
> > > Yes, that is the plan as well. That way we can completely get rid of
> > > the old scan_swap_map_slots() code.
> >
> > Good!
> >
> > > However, considering the size of the series, let's focus on the
> > > cluster allocation path first, get it tested and reviewed.
> >
> > OK.
> >
> > > For HDD optimization, mostly just the new block allocations portion
> > > need some separate code path from the new cluster allocator to not do
> > > the per cpu allocation.  Allocating from the non free list doesn't
> > > need to change too
> >
> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> > to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

Yes, if we enable the new cluster based allocator for HDD, we can
enable THP and mTHP for HDD too, and use a global cluster_next instead
of Per-CPU for it.
It's easy to do with minimal changes, and should actually boost
performance for HDD SWAP. Currently testing this locally.

> > >>
> > >> Hi, Hugh,
> > >>
> > >> What do you think about this?
> > >>
> > >> > This streamline the swap allocation for SSD. The code matches the
> > >> > execution flow much better.
> > >> >
> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> > >> > initial phase.
> > >> >
> > >> > It also performs faster when the swapfile is close to full, because the
> > >> > allocator can get the non full cluster from a list rather than scanning
> > >> > a lot of swap_map entries.
> > >>
> > >> Do you have some test results to prove this?  Or which test below can
> > >> prove this?
> > >
> > > The two zram tests are already proving this. The system time
> > > improvement is about 2% on my low CPU count machine.
> > > Kairui has a higher core count machine and the difference is higher
> > > there. The theory is that higher CPU count has higher contentions.
> >
> > I will interpret this as the performance is better in theory.  But
> > there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

Hi Ying, you can check the test with the 32 cores AMD machine in the
cover letter, as Chris pointed out the performance gain is higher as
core number grows. The performance gain is still not much (*yet, based
on this design thing can go much faster after HDD codes are
dropped which enables many other optimizations, this series
is mainly focusing on the fragmentation issue), but I think a
stable ~4 - 8% improvement with a build linux kernel test
could be considered measurable?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
  2024-08-16  8:01     ` Chris Li
@ 2024-08-19  8:08       ` Huang, Ying
  2024-08-26 21:26         ` Chris Li
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Ying @ 2024-08-19  8:08 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

Chris Li <chrisl@kernel.org> writes:

> On Wed, Aug 7, 2024 at 6:14 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>

[snip]

>> >
>> >  /*
>> > @@ -553,6 +568,19 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> >       if (tmp == SWAP_NEXT_INVALID) {
>> >               if (!list_empty(&si->free_clusters)) {
>> >                       ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> > +                     list_del(&ci->list);
>> > +                     spin_lock(&ci->lock);
>> > +                     ci->order = order;
>> > +                     ci->flags = 0;
>> > +                     spin_unlock(&ci->lock);
>> > +                     tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>> > +             } else if (!list_empty(&si->nonfull_clusters[order])) {
>> > +                     ci = list_first_entry(&si->nonfull_clusters[order],
>> > +                                           struct swap_cluster_info, list);
>> > +                     list_del(&ci->list);
>> > +                     spin_lock(&ci->lock);
>> > +                     ci->flags = 0;
>> > +                     spin_unlock(&ci->lock);
>> >                       tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>> >               } else if (!list_empty(&si->discard_clusters)) {
>>
>> We should check discard_clusters before nonfull clusters.
>
> And the reason behind that is?
>
> I see the discard_cluster can take a long time. It will take a
> synchronous wait for the issuing the discard command. Why not just use
> the  nonfull list and return immediately. When the discard command
> finished. It will show up in the free list anyway.

I think that you are right.  We don't need to wait for discard here.

> BTW, what is your take on my  previous analysis of the current SSD
> prefer write new cluster can wear out the SSD faster?

No.  I don't agree with you on that.  However, my knowledge on SSD
wearing out algorithm is quite limited.

> I think it might be useful to provide users an option to choose to
> write a non full list first. The trade off is more friendly to SSD
> wear out than preferring to write new blocks. If you keep doing the
> swap long enough, there will be no new free cluster anyway.

It depends on workloads.  Some workloads may demonstrate better spatial
locality.

> The example I give in this email:
>
> https://lore.kernel.org/linux-mm/CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@mail.gmail.com/
>
> Chris
>>
>> >                       /*
>> > @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >       ci = lock_cluster(si, offset);
>> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> >       ci->count = 0;
>> > +     ci->order = 0;
>> >       ci->flags = 0;
>> >       free_cluster(si, ci);
>> >       unlock_cluster(ci);
>> > @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>> >       INIT_LIST_HEAD(&p->free_clusters);
>> >       INIT_LIST_HEAD(&p->discard_clusters);
>> >
>> > +     for (i = 0; i < SWAP_NR_ORDERS; i++)
>> > +             INIT_LIST_HEAD(&p->nonfull_clusters[i]);
>> > +
>> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
>> >               unsigned int page_nr = swap_header->info.badpages[i];
>> >               if (page_nr == 0 || page_nr > swap_header->info.last_page)

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-18 16:59         ` Kairui Song
@ 2024-08-19  8:27           ` Huang, Ying
  2024-08-19  8:47             ` Kairui Song
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Ying @ 2024-08-19  8:27 UTC (permalink / raw)
  To: Kairui Song
  Cc: Chris Li, Hugh Dickins, Andrew Morton, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

Kairui Song <ryncsn@gmail.com> writes:

> On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
>>
>> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > Chris Li <chrisl@kernel.org> writes:
>> >
>> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > >>
>> > >> Hi, Chris,
>> > >>
>> > >> Chris Li <chrisl@kernel.org> writes:
>> > >>
>> > >> > This is the short term solutions "swap cluster order" listed
>> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> > >> > LSF/MM conference.
>> > >> >
>> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> > >> > orders" is introduced, it only allocates the mTHP swap entries
>> > >> > from the new empty cluster list.  It has a fragmentation issue
>> > >> > reported by Barry.
>> > >> >
>> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>> > >> >
>> > >> > The reason is that all the empty clusters have been exhausted while
>> > >> > there are plenty of free swap entries in the cluster that are
>> > >> > not 100% free.
>> > >> >
>> > >> > Remember the swap allocation order in the cluster.
>> > >> > Keep track of the per order non full cluster list for later allocation.
>> > >> >
>> > >> > This series gives the swap SSD allocation a new separate code path
>> > >> > from the HDD allocation. The new allocator use cluster list only
>> > >> > and do not global scan swap_map[] without lock any more.
>> > >>
>> > >> This sounds good.  Can we use SSD allocation method for HDD too?
>> > >> We may not need a swap entry allocator optimized for HDD.
>> > >
>> > > Yes, that is the plan as well. That way we can completely get rid of
>> > > the old scan_swap_map_slots() code.
>> >
>> > Good!
>> >
>> > > However, considering the size of the series, let's focus on the
>> > > cluster allocation path first, get it tested and reviewed.
>> >
>> > OK.
>> >
>> > > For HDD optimization, mostly just the new block allocations portion
>> > > need some separate code path from the new cluster allocator to not do
>> > > the per cpu allocation.  Allocating from the non free list doesn't
>> > > need to change too
>> >
>> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
>> > to simplify.
>>
>> Adding a global next allocating CI rather than the per CPU next CI
>> pointer is pretty trivial as well. It is just a different way to fetch
>> the next cluster pointer.
>
> Yes, if we enable the new cluster based allocator for HDD, we can
> enable THP and mTHP for HDD too, and use a global cluster_next instead
> of Per-CPU for it.
> It's easy to do with minimal changes, and should actually boost
> performance for HDD SWAP. Currently testing this locally.

I think that it's better to start with SSD algorithm.  Then, you can add
HDD specific optimization on top of it with supporting data.

BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
writing is more important for HDD.

>> > >>
>> > >> Hi, Hugh,
>> > >>
>> > >> What do you think about this?
>> > >>
>> > >> > This streamline the swap allocation for SSD. The code matches the
>> > >> > execution flow much better.
>> > >> >
>> > >> > User impact: For users that allocate and free mix order mTHP swapping,
>> > >> > It greatly improves the success rate of the mTHP swap allocation after the
>> > >> > initial phase.
>> > >> >
>> > >> > It also performs faster when the swapfile is close to full, because the
>> > >> > allocator can get the non full cluster from a list rather than scanning
>> > >> > a lot of swap_map entries.
>> > >>
>> > >> Do you have some test results to prove this?  Or which test below can
>> > >> prove this?
>> > >
>> > > The two zram tests are already proving this. The system time
>> > > improvement is about 2% on my low CPU count machine.
>> > > Kairui has a higher core count machine and the difference is higher
>> > > there. The theory is that higher CPU count has higher contentions.
>> >
>> > I will interpret this as the performance is better in theory.  But
>> > there's almost no measurable results so far.
>>
>> I am trying to understand why don't see the performance improvement in
>> the zram setup in my cover letter as a measurable result?
>
> Hi Ying, you can check the test with the 32 cores AMD machine in the
> cover letter, as Chris pointed out the performance gain is higher as
> core number grows. The performance gain is still not much (*yet, based
> on this design thing can go much faster after HDD codes are
> dropped which enables many other optimizations, this series
> is mainly focusing on the fragmentation issue), but I think a
> stable ~4 - 8% improvement with a build linux kernel test
> could be considered measurable?

Is this the test result for "when the swapfile is close to full"?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-16  7:47       ` Chris Li
  2024-08-18 16:59         ` Kairui Song
@ 2024-08-19  8:39         ` Huang, Ying
  1 sibling, 0 replies; 32+ messages in thread
From: Huang, Ying @ 2024-08-19  8:39 UTC (permalink / raw)
  To: Chris Li
  Cc: Hugh Dickins, Andrew Morton, Kairui Song, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

Chris Li <chrisl@kernel.org> writes:

> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Chris,
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > This is the short term solutions "swap cluster order" listed
>> >> > in my "Swap Abstraction" discussion slice 8 in the recent
>> >> > LSF/MM conference.
>> >> >
>> >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
>> >> > orders" is introduced, it only allocates the mTHP swap entries
>> >> > from the new empty cluster list.  It has a fragmentation issue
>> >> > reported by Barry.
>> >> >
>> >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>> >> >
>> >> > The reason is that all the empty clusters have been exhausted while
>> >> > there are plenty of free swap entries in the cluster that are
>> >> > not 100% free.
>> >> >
>> >> > Remember the swap allocation order in the cluster.
>> >> > Keep track of the per order non full cluster list for later allocation.
>> >> >
>> >> > This series gives the swap SSD allocation a new separate code path
>> >> > from the HDD allocation. The new allocator use cluster list only
>> >> > and do not global scan swap_map[] without lock any more.
>> >>
>> >> This sounds good.  Can we use SSD allocation method for HDD too?
>> >> We may not need a swap entry allocator optimized for HDD.
>> >
>> > Yes, that is the plan as well. That way we can completely get rid of
>> > the old scan_swap_map_slots() code.
>>
>> Good!
>>
>> > However, considering the size of the series, let's focus on the
>> > cluster allocation path first, get it tested and reviewed.
>>
>> OK.
>>
>> > For HDD optimization, mostly just the new block allocations portion
>> > need some separate code path from the new cluster allocator to not do
>> > the per cpu allocation.  Allocating from the non free list doesn't
>> > need to change too
>>
>> I suggest not consider HDD optimization at all.  Just use SSD algorithm
>> to simplify.
>
> Adding a global next allocating CI rather than the per CPU next CI
> pointer is pretty trivial as well. It is just a different way to fetch
> the next cluster pointer.

For HDD optimization, I mean original no struct swap_cluster_info etc.

>>
>> >>
>> >> Hi, Hugh,
>> >>
>> >> What do you think about this?
>> >>
>> >> > This streamline the swap allocation for SSD. The code matches the
>> >> > execution flow much better.
>> >> >
>> >> > User impact: For users that allocate and free mix order mTHP swapping,
>> >> > It greatly improves the success rate of the mTHP swap allocation after the
>> >> > initial phase.
>> >> >
>> >> > It also performs faster when the swapfile is close to full, because the
>> >> > allocator can get the non full cluster from a list rather than scanning
>> >> > a lot of swap_map entries.
>> >>
>> >> Do you have some test results to prove this?  Or which test below can
>> >> prove this?
>> >
>> > The two zram tests are already proving this. The system time
>> > improvement is about 2% on my low CPU count machine.
>> > Kairui has a higher core count machine and the difference is higher
>> > there. The theory is that higher CPU count has higher contentions.
>>
>> I will interpret this as the performance is better in theory.  But
>> there's almost no measurable results so far.
>
> I am trying to understand why don't see the performance improvement in
> the zram setup in my cover letter as a measurable result?

IIUC, there's no benchmark score difference, just system time.  And the
number is low too.

For Kairui's test, does all performance improvement come from "swapfile
is close to full"?

>>
>> > The 2% system time number does not sound like much. But consider this
>> > two factors:
>> > 1) swap allocator only takes a small percentage of the overall workload.
>> > 2) The new allocator does more work.
>> > The old allocator has a time tick budget. It will abort and fail to
>> > find an entry when it runs out of time budget, even though there are
>> > still some free entries on the swapfile.
>>
>> What is the time tick budget you mentioned?
>
> I was under the impression that the previous swap entry allocation
> code will not scan 100% of the swapfile if there is only one entry
> left.
> Please let me know if my understanding is not correct.
>
>         /* time to take a break? */
>         if (unlikely(--latency_ration < 0)) {
>                 if (n_ret)
>                         goto done;
>                 spin_unlock(&si->lock);
>                 cond_resched();
>                 spin_lock(&si->lock);
>                 latency_ration = LATENCY_LIMIT;
>         }

IIUC, this is to reduce latency via cond_resched().  If n_ret != 0, we
have allocated some swap entries successfully, it's OK to return to
reduce allocation latency.

>
>>
>> > The new allocator can get to the last few free swap entries if it is
>> > available. If not then, the new swap allocator will work harder on
>> > swap cache reclaim.
>> >
>> > From the swap cache reclaim aspect, it is very hard to optimize the
>> > swap cache reclaim in the old allocation path because the scan
>> > position is randomized.
>> > The full list and frag list both design to help reduce the repeat
>> > reclaim attempt of the swap cache.
>>
>> [snip]
>>

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-19  8:27           ` Huang, Ying
@ 2024-08-19  8:47             ` Kairui Song
  2024-08-19 21:27               ` Chris Li
  0 siblings, 1 reply; 32+ messages in thread
From: Kairui Song @ 2024-08-19  8:47 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Chris Li, Hugh Dickins, Andrew Morton, Ryan Roberts, Kalesh Singh,
	linux-kernel, linux-mm, Barry Song

On Mon, Aug 19, 2024 at 4:31 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
> >>
> >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > Chris Li <chrisl@kernel.org> writes:
> >> >
> >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> >> > >>
> >> > >> Hi, Chris,
> >> > >>
> >> > >> Chris Li <chrisl@kernel.org> writes:
> >> > >>
> >> > >> > This is the short term solutions "swap cluster order" listed
> >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> >> > >> > LSF/MM conference.
> >> > >> >
> >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> >> > >> > orders" is introduced, it only allocates the mTHP swap entries
> >> > >> > from the new empty cluster list.  It has a fragmentation issue
> >> > >> > reported by Barry.
> >> > >> >
> >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> >> > >> >
> >> > >> > The reason is that all the empty clusters have been exhausted while
> >> > >> > there are plenty of free swap entries in the cluster that are
> >> > >> > not 100% free.
> >> > >> >
> >> > >> > Remember the swap allocation order in the cluster.
> >> > >> > Keep track of the per order non full cluster list for later allocation.
> >> > >> >
> >> > >> > This series gives the swap SSD allocation a new separate code path
> >> > >> > from the HDD allocation. The new allocator use cluster list only
> >> > >> > and do not global scan swap_map[] without lock any more.
> >> > >>
> >> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> >> > >> We may not need a swap entry allocator optimized for HDD.
> >> > >
> >> > > Yes, that is the plan as well. That way we can completely get rid of
> >> > > the old scan_swap_map_slots() code.
> >> >
> >> > Good!
> >> >
> >> > > However, considering the size of the series, let's focus on the
> >> > > cluster allocation path first, get it tested and reviewed.
> >> >
> >> > OK.
> >> >
> >> > > For HDD optimization, mostly just the new block allocations portion
> >> > > need some separate code path from the new cluster allocator to not do
> >> > > the per cpu allocation.  Allocating from the non free list doesn't
> >> > > need to change too
> >> >
> >> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> >> > to simplify.
> >>
> >> Adding a global next allocating CI rather than the per CPU next CI
> >> pointer is pretty trivial as well. It is just a different way to fetch
> >> the next cluster pointer.
> >
> > Yes, if we enable the new cluster based allocator for HDD, we can
> > enable THP and mTHP for HDD too, and use a global cluster_next instead
> > of Per-CPU for it.
> > It's easy to do with minimal changes, and should actually boost
> > performance for HDD SWAP. Currently testing this locally.
>
> I think that it's better to start with SSD algorithm.  Then, you can add
> HDD specific optimization on top of it with supporting data.

Yes, we are having the same idea.

>
> BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
> writing is more important for HDD.
> >> > >>
> >> > >> Hi, Hugh,
> >> > >>
> >> > >> What do you think about this?
> >> > >>
> >> > >> > This streamline the swap allocation for SSD. The code matches the
> >> > >> > execution flow much better.
> >> > >> >
> >> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> >> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> >> > >> > initial phase.
> >> > >> >
> >> > >> > It also performs faster when the swapfile is close to full, because the
> >> > >> > allocator can get the non full cluster from a list rather than scanning
> >> > >> > a lot of swap_map entries.
> >> > >>
> >> > >> Do you have some test results to prove this?  Or which test below can
> >> > >> prove this?
> >> > >
> >> > > The two zram tests are already proving this. The system time
> >> > > improvement is about 2% on my low CPU count machine.
> >> > > Kairui has a higher core count machine and the difference is higher
> >> > > there. The theory is that higher CPU count has higher contentions.
> >> >
> >> > I will interpret this as the performance is better in theory.  But
> >> > there's almost no measurable results so far.
> >>
> >> I am trying to understand why don't see the performance improvement in
> >> the zram setup in my cover letter as a measurable result?
> >
> > Hi Ying, you can check the test with the 32 cores AMD machine in the
> > cover letter, as Chris pointed out the performance gain is higher as
> > core number grows. The performance gain is still not much (*yet, based
> > on this design thing can go much faster after HDD codes are
> > dropped which enables many other optimizations, this series
> > is mainly focusing on the fragmentation issue), but I think a
> > stable ~4 - 8% improvement with a build linux kernel test
> > could be considered measurable?
>
> Is this the test result for "when the swapfile is close to full"?

Yes, it's about 60% to 90% full during the whole test process. If ZRAM
is completely full the workload will go OOM, but testing with madvice
showed no performance drop.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-08-19  8:47             ` Kairui Song
@ 2024-08-19 21:27               ` Chris Li
  0 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2024-08-19 21:27 UTC (permalink / raw)
  To: Kairui Song
  Cc: Huang, Ying, Hugh Dickins, Andrew Morton, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

Hi Kairui,

On Mon, Aug 19, 2024 at 1:48 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Aug 19, 2024 at 4:31 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > > On Fri, Aug 16, 2024 at 3:53 PM Chris Li <chrisl@kernel.org> wrote:
> > >>
> > >> On Thu, Aug 8, 2024 at 1:38 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> >
> > >> > Chris Li <chrisl@kernel.org> writes:
> > >> >
> > >> > > On Wed, Aug 7, 2024 at 12:59 AM Huang, Ying <ying.huang@intel.com> wrote:
> > >> > >>
> > >> > >> Hi, Chris,
> > >> > >>
> > >> > >> Chris Li <chrisl@kernel.org> writes:
> > >> > >>
> > >> > >> > This is the short term solutions "swap cluster order" listed
> > >> > >> > in my "Swap Abstraction" discussion slice 8 in the recent
> > >> > >> > LSF/MM conference.
> > >> > >> >
> > >> > >> > When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> > >> > >> > orders" is introduced, it only allocates the mTHP swap entries
> > >> > >> > from the new empty cluster list.  It has a fragmentation issue
> > >> > >> > reported by Barry.
> > >> > >> >
> > >> > >> > https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > >> > >> >
> > >> > >> > The reason is that all the empty clusters have been exhausted while
> > >> > >> > there are plenty of free swap entries in the cluster that are
> > >> > >> > not 100% free.
> > >> > >> >
> > >> > >> > Remember the swap allocation order in the cluster.
> > >> > >> > Keep track of the per order non full cluster list for later allocation.
> > >> > >> >
> > >> > >> > This series gives the swap SSD allocation a new separate code path
> > >> > >> > from the HDD allocation. The new allocator use cluster list only
> > >> > >> > and do not global scan swap_map[] without lock any more.
> > >> > >>
> > >> > >> This sounds good.  Can we use SSD allocation method for HDD too?
> > >> > >> We may not need a swap entry allocator optimized for HDD.
> > >> > >
> > >> > > Yes, that is the plan as well. That way we can completely get rid of
> > >> > > the old scan_swap_map_slots() code.
> > >> >
> > >> > Good!
> > >> >
> > >> > > However, considering the size of the series, let's focus on the
> > >> > > cluster allocation path first, get it tested and reviewed.
> > >> >
> > >> > OK.
> > >> >
> > >> > > For HDD optimization, mostly just the new block allocations portion
> > >> > > need some separate code path from the new cluster allocator to not do
> > >> > > the per cpu allocation.  Allocating from the non free list doesn't
> > >> > > need to change too
> > >> >
> > >> > I suggest not consider HDD optimization at all.  Just use SSD algorithm
> > >> > to simplify.
> > >>
> > >> Adding a global next allocating CI rather than the per CPU next CI
> > >> pointer is pretty trivial as well. It is just a different way to fetch
> > >> the next cluster pointer.
> > >
> > > Yes, if we enable the new cluster based allocator for HDD, we can
> > > enable THP and mTHP for HDD too, and use a global cluster_next instead
> > > of Per-CPU for it.
> > > It's easy to do with minimal changes, and should actually boost
> > > performance for HDD SWAP. Currently testing this locally.
> >
> > I think that it's better to start with SSD algorithm.  Then, you can add
> > HDD specific optimization on top of it with supporting data.
>
> Yes, we are having the same idea.
>
> >
> > BTW, I don't know why HDD shouldn't use per-CPU cluster.  Sequential
> > writing is more important for HDD.
> > >> > >>
> > >> > >> Hi, Hugh,
> > >> > >>
> > >> > >> What do you think about this?
> > >> > >>
> > >> > >> > This streamline the swap allocation for SSD. The code matches the
> > >> > >> > execution flow much better.
> > >> > >> >
> > >> > >> > User impact: For users that allocate and free mix order mTHP swapping,
> > >> > >> > It greatly improves the success rate of the mTHP swap allocation after the
> > >> > >> > initial phase.
> > >> > >> >
> > >> > >> > It also performs faster when the swapfile is close to full, because the
> > >> > >> > allocator can get the non full cluster from a list rather than scanning
> > >> > >> > a lot of swap_map entries.
> > >> > >>
> > >> > >> Do you have some test results to prove this?  Or which test below can
> > >> > >> prove this?
> > >> > >
> > >> > > The two zram tests are already proving this. The system time
> > >> > > improvement is about 2% on my low CPU count machine.
> > >> > > Kairui has a higher core count machine and the difference is higher
> > >> > > there. The theory is that higher CPU count has higher contentions.
> > >> >
> > >> > I will interpret this as the performance is better in theory.  But
> > >> > there's almost no measurable results so far.
> > >>
> > >> I am trying to understand why don't see the performance improvement in
> > >> the zram setup in my cover letter as a measurable result?
> > >
> > > Hi Ying, you can check the test with the 32 cores AMD machine in the
> > > cover letter, as Chris pointed out the performance gain is higher as
> > > core number grows. The performance gain is still not much (*yet, based
> > > on this design thing can go much faster after HDD codes are
> > > dropped which enables many other optimizations, this series
> > > is mainly focusing on the fragmentation issue), but I think a
> > > stable ~4 - 8% improvement with a build linux kernel test
> > > could be considered measurable?
> >
> > Is this the test result for "when the swapfile is close to full"?
>
> Yes, it's about 60% to 90% full during the whole test process. If ZRAM
> is completely full the workload will go OOM, but testing with madvice

BTW, one trick to avoid ZRAM completely full causing OOM is to have
two zram devices and assign different priorities. Let the first zram
get 100% full then the swap overflow to the second ZRAM device, which
has more swap entries to avoid the OOM.

Chris

> showed no performance drop.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
  2024-08-19  8:08       ` Huang, Ying
@ 2024-08-26 21:26         ` Chris Li
  2024-09-09  7:19           ` Huang, Ying
  0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-08-26 21:26 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@intel.com> wrote:
> > BTW, what is your take on my  previous analysis of the current SSD
> > prefer write new cluster can wear out the SSD faster?
>
> No.  I don't agree with you on that.  However, my knowledge on SSD
> wearing out algorithm is quite limited.

Hi Ying,

Can you please clarify. You said you have limited knowledge on SSD
wearing internals. Does that mean you have low confidence in your
verdict?

I would like to understand your reasoning for the disagreement.
Starting from which part of my analysis you are disagreeing with.

At the same time, we can consult someone who works in the SSD space
and understand the SSD internal wearing better.

I see this is a serious issue for using SSD as swapping for data
center usage cases. In your laptop usage case, you are not using the
LLM training 24/7 right? So it still fits the usage model of the
occasional user of the swap file. It might not be as big a deal. In
the data center workload, e.g. Google's swap write 24/7. The amount of
data swapped out is much higher than typical laptop usage as well.
There the SSD wearing out issue would be much higher because the SSD
is under constant write and much bigger swap usage.

I am claiming that *some* SSD would have a higher internal write
amplification factor if doing random 4K write all over the drive, than
random 4K write to a small area of the drive.
I do believe having a different swap out policy controlling preferring
old vs new clusters is beneficial to the data center SSD swap usage
case.
It come downs to:
1) SSD are slow to erase. So most of the SSD performance erases at a
huge erase block size.
2) SSD remaps the logical block address to the internal erase block.
Most of the new data rewritten, regardless of the logical block
address of the SSD drive, grouped together and written to the erase
block.
3) When new data is overridden to the old logical data address, SSD
firmware marks those over-written data as obsolete. The discard
command has the similar effect without introducing new data.
4) When the SSD driver runs out of new erase block, it would need to
GC the old fragmented erased block and pertectial write out of old
data to make room for new erase block. Where the discard command can
be beneficial. It tells the SSD firmware which part of the old data
the GC process can just ignore and skip rewriting.

GC of the obsolete logical blocks is a general hard problem for the SSD.

I am not claiming every SSD has this kind of behavior, but it is
common enough to be worth providing an option.

> > I think it might be useful to provide users an option to choose to
> > write a non full list first. The trade off is more friendly to SSD
> > wear out than preferring to write new blocks. If you keep doing the
> > swap long enough, there will be no new free cluster anyway.
>
> It depends on workloads.  Some workloads may demonstrate better spatial
> locality.

Yes, agree that it may happen or may not happen depending on the
workload . The random distribution swap entry is a common pattern we
need to consider as well. The odds are against us. As in the quoted
email where I did the calculation, the odds of getting the whole
cluster free in the random model is very low, 4.4E10-15 even if we are
only using 1/16 swap entries in the swapfile.

Chris

>
> > The example I give in this email:
> >
> > https://lore.kernel.org/linux-mm/CACePvbXGBNC9WzzL4s2uB2UciOkV6nb4bKKkc5TBZP6QuHS_aQ@mail.gmail.com/
> >
> > Chris
> >>
> >> >                       /*
> >> > @@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >> >       ci = lock_cluster(si, offset);
> >> >       memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >> >       ci->count = 0;
> >> > +     ci->order = 0;
> >> >       ci->flags = 0;
> >> >       free_cluster(si, ci);
> >> >       unlock_cluster(ci);
> >> > @@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> >> >       INIT_LIST_HEAD(&p->free_clusters);
> >> >       INIT_LIST_HEAD(&p->discard_clusters);
> >> >
> >> > +     for (i = 0; i < SWAP_NR_ORDERS; i++)
> >> > +             INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> >> > +
> >> >       for (i = 0; i < swap_header->info.nr_badpages; i++) {
> >> >               unsigned int page_nr = swap_header->info.badpages[i];
> >> >               if (page_nr == 0 || page_nr > swap_header->info.last_page)
>
> --
> Best Regards,
> Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
  2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
                   ` (10 preceding siblings ...)
       [not found] ` <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2024-09-02  1:20 ` Andrew Morton
  11 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2024-09-02  1:20 UTC (permalink / raw)
  To: Chris Li
  Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

On Tue, 30 Jul 2024 23:49:12 -0700 Chris Li <chrisl@kernel.org> wrote:

> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.

Well this patchset hasn't needed a -fix patch since Aug 5, which is
hopeful.

How do people feel about moving this series into mm-stable for the next
merge window?

A few notes I've made are

https://lkml.kernel.org/r/CACePvbX1EWQk03YcC47s7+vn40kEFb_3wp3D_GmJV-8Fn2j+=Q@mail.gmail.com
https://lkml.kernel.org/r/87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com
https://lkml.kernel.org/r/CACePvbU9NUgfD-QC-hdYEmeMbLc5whOKeW4hb2ijcPonZ4i6mg@mail.gmail.com

Have these all been resolved?

Thanks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
  2024-08-26 21:26         ` Chris Li
@ 2024-09-09  7:19           ` Huang, Ying
  0 siblings, 0 replies; 32+ messages in thread
From: Huang, Ying @ 2024-09-09  7:19 UTC (permalink / raw)
  To: Chris Li
  Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
	Kalesh Singh, linux-kernel, linux-mm, Barry Song

Chris Li <chrisl@kernel.org> writes:

> On Mon, Aug 19, 2024 at 1:11 AM Huang, Ying <ying.huang@intel.com> wrote:
>> > BTW, what is your take on my  previous analysis of the current SSD
>> > prefer write new cluster can wear out the SSD faster?
>>
>> No.  I don't agree with you on that.  However, my knowledge on SSD
>> wearing out algorithm is quite limited.
>
> Hi Ying,
>
> Can you please clarify. You said you have limited knowledge on SSD
> wearing internals. Does that mean you have low confidence in your
> verdict?

Yes.

> I would like to understand your reasoning for the disagreement.
> Starting from which part of my analysis you are disagreeing with.
>
> At the same time, we can consult someone who works in the SSD space
> and understand the SSD internal wearing better.

I think that is a good idea.

> I see this is a serious issue for using SSD as swapping for data
> center usage cases. In your laptop usage case, you are not using the
> LLM training 24/7 right? So it still fits the usage model of the
> occasional user of the swap file. It might not be as big a deal. In
> the data center workload, e.g. Google's swap write 24/7. The amount of
> data swapped out is much higher than typical laptop usage as well.
> There the SSD wearing out issue would be much higher because the SSD
> is under constant write and much bigger swap usage.
>
> I am claiming that *some* SSD would have a higher internal write
> amplification factor if doing random 4K write all over the drive, than
> random 4K write to a small area of the drive.
> I do believe having a different swap out policy controlling preferring
> old vs new clusters is beneficial to the data center SSD swap usage
> case.
> It come downs to:
> 1) SSD are slow to erase. So most of the SSD performance erases at a
> huge erase block size.
> 2) SSD remaps the logical block address to the internal erase block.
> Most of the new data rewritten, regardless of the logical block
> address of the SSD drive, grouped together and written to the erase
> block.
> 3) When new data is overridden to the old logical data address, SSD
> firmware marks those over-written data as obsolete. The discard
> command has the similar effect without introducing new data.
> 4) When the SSD driver runs out of new erase block, it would need to
> GC the old fragmented erased block and pertectial write out of old
> data to make room for new erase block. Where the discard command can
> be beneficial. It tells the SSD firmware which part of the old data
> the GC process can just ignore and skip rewriting.
>
> GC of the obsolete logical blocks is a general hard problem for the SSD.
>
> I am not claiming every SSD has this kind of behavior, but it is
> common enough to be worth providing an option.
>
>> > I think it might be useful to provide users an option to choose to
>> > write a non full list first. The trade off is more friendly to SSD
>> > wear out than preferring to write new blocks. If you keep doing the
>> > swap long enough, there will be no new free cluster anyway.
>>
>> It depends on workloads.  Some workloads may demonstrate better spatial
>> locality.
>
> Yes, agree that it may happen or may not happen depending on the
> workload . The random distribution swap entry is a common pattern we
> need to consider as well. The odds are against us. As in the quoted
> email where I did the calculation, the odds of getting the whole
> cluster free in the random model is very low, 4.4E10-15 even if we are
> only using 1/16 swap entries in the swapfile.

Do you have real workloads?  For example, some trace?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-09-09  7:22 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-31  6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-31  6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
2024-07-31  6:49 ` [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
     [not found]   ` <87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-08-16  8:01     ` Chris Li
2024-08-19  8:08       ` Huang, Ying
2024-08-26 21:26         ` Chris Li
2024-09-09  7:19           ` Huang, Ying
2024-07-31  6:49 ` [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() Chris Li
2024-07-31  6:49 ` [PATCH v5 4/9] mm: swap: clean up initialization helper chrisl
2024-07-31  6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
2024-08-03  9:11   ` Barry Song
2024-08-03 10:57     ` Barry Song
2024-07-31  6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
2024-08-03 10:38   ` Barry Song
2024-08-03 12:18     ` Kairui Song
2024-08-04 18:06       ` Chris Li
2024-08-05  1:53         ` Barry Song
2024-07-31  6:49 ` [PATCH v5 7/9] mm: swap: add a fragment cluster list chrisl
2024-07-31  6:49 ` [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned chrisl
2024-07-31  6:49 ` [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim chrisl
2024-08-01  9:14 ` [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order David Hildenbrand
2024-08-01  9:59   ` Kairui Song
2024-08-01 10:06     ` Kairui Song
     [not found]     ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-08-16  7:36       ` Chris Li
2024-08-17 17:47         ` Kairui Song
     [not found] ` <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
     [not found]   ` <CACePvbXH8b9SOePQ-Ld_UBbcAdJ3gdYtEkReMto5Hbq9WAL7JQ@mail.gmail.com>
     [not found]     ` <87sevfza3w.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-08-16  7:47       ` Chris Li
2024-08-18 16:59         ` Kairui Song
2024-08-19  8:27           ` Huang, Ying
2024-08-19  8:47             ` Kairui Song
2024-08-19 21:27               ` Chris Li
2024-08-19  8:39         ` Huang, Ying
2024-09-02  1:20 ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).