* [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
@ 2024-07-11 7:29 Chris Li
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
` (4 more replies)
0 siblings, 5 replies; 43+ messages in thread
From: Chris Li @ 2024-07-11 7:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
This is the short term solutions "swap cluster order" listed
in my "Swap Abstraction" discussion slice 8 in the recent
LSF/MM conference.
When commit 845982eb264bc "mm: swap: allow storage of all mTHP
orders" is introduced, it only allocates the mTHP swap entries
from the new empty cluster list. It has a fragmentation issue
reported by Barry.
https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
The reason is that all the empty clusters have been exhausted while
there are plenty of free swap entries in the cluster that are
not 100% free.
Remember the swap allocation order in the cluster.
Keep track of the per order non full cluster list for later allocation.
The patch 3 of this series gives the swap SSD allocation
a new separate code path from the HDD allocation. The new allocator
use cluster list only and do not global scan swap_map[] without lock
any more.
This streamline the swap allocation for SSD. The code matches the execution
flow much better.
User impact: For users that allocate and free mix order mTHP swapping,
It greatly improves the success rate of the mTHP swap allocation after the
initial phase.
It also performs faster when the swapfile is close to full, because the
allocator can get the non full cluster from a list rather than scanning
a lot of swap_map entries.
This series still lacks the swap cache reclaim feature. The reclaim series
of patches are under development and testing right now. Will post the
mail list soon. For this reason, the patch 3 is consider RFC and not
ready to merge.
With Barry's mthp test program V2:
Without:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
...
Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
..
Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
With:
$ ./thp_swap_allocator_test -a
Iteration 1: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
...
Iteration 98: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 99: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 100: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
$ ./thp_swap_allocator_test -a -s
Iteration 1: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 5: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 6: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 7: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 8: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 9: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 10: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 11: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 12: swpout inc: 232, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 13: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
Iteration 15: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 16: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 17: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 18: swpout inc: 234, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 19: swpout inc: 220, swpout fallback inc: 6, Fallback percentage: 2.65%
Iteration 20: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 21: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 22: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 23: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 24: swpout inc: 232, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 25: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 26: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 27: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 29: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 30: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 31: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 32: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 33: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 34: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 35: swpout inc: 230, swpout fallback inc: 3, Fallback percentage: 1.29%
Iteration 36: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 37: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 38: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 39: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 40: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 41: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 42: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 43: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 44: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 45: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 46: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 47: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 48: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 49: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 50: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 51: swpout inc: 224, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 52: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 53: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 54: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 55: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 56: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
Iteration 57: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 58: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 59: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 60: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 61: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 62: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 63: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 64: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 65: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 66: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 67: swpout inc: 220, swpout fallback inc: 2, Fallback percentage: 0.90%
Iteration 68: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 69: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 70: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 71: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 72: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 73: swpout inc: 218, swpout fallback inc: 5, Fallback percentage: 2.24%
Iteration 74: swpout inc: 223, swpout fallback inc: 5, Fallback percentage: 2.19%
Iteration 75: swpout inc: 222, swpout fallback inc: 7, Fallback percentage: 3.06%
Iteration 76: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 77: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 78: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 79: swpout inc: 223, swpout fallback inc: 2, Fallback percentage: 0.89%
Iteration 80: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 81: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 82: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 83: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 84: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 85: swpout inc: 213, swpout fallback inc: 1, Fallback percentage: 0.47%
Iteration 86: swpout inc: 215, swpout fallback inc: 8, Fallback percentage: 3.59%
Iteration 87: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 88: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 89: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
Iteration 90: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 91: swpout inc: 214, swpout fallback inc: 1, Fallback percentage: 0.47%
Iteration 92: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 93: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 94: swpout inc: 223, swpout fallback inc: 2, Fallback percentage: 0.89%
Iteration 95: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
Iteration 96: swpout inc: 223, swpout fallback inc: 4, Fallback percentage: 1.76%
Iteration 97: swpout inc: 223, swpout fallback inc: 7, Fallback percentage: 3.04%
Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
Iteration 99: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
Iteration 100: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
$ ./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 134, swpout fallback inc: 98, Fallback percentage: 42.24%
Iteration 3: swpout inc: 72, swpout fallback inc: 154, Fallback percentage: 68.14%
Iteration 4: swpout inc: 40, swpout fallback inc: 183, Fallback percentage: 82.06%
Iteration 5: swpout inc: 27, swpout fallback inc: 199, Fallback percentage: 88.05%
Iteration 6: swpout inc: 22, swpout fallback inc: 202, Fallback percentage: 90.18%
Iteration 7: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 8: swpout inc: 14, swpout fallback inc: 214, Fallback percentage: 93.86%
Iteration 9: swpout inc: 5, swpout fallback inc: 221, Fallback percentage: 97.79%
Iteration 10: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
...
Iteration 97: swpout inc: 12, swpout fallback inc: 207, Fallback percentage: 94.52%
Iteration 98: swpout inc: 8, swpout fallback inc: 219, Fallback percentage: 96.48%
Iteration 99: swpout inc: 16, swpout fallback inc: 218, Fallback percentage: 93.16%
Iteration 100: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
$ ./thp_swap_allocator_test -s
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 84, swpout fallback inc: 148, Fallback percentage: 63.79%
Iteration 3: swpout inc: 39, swpout fallback inc: 195, Fallback percentage: 83.33%
Iteration 4: swpout inc: 16, swpout fallback inc: 217, Fallback percentage: 93.13%
Iteration 5: swpout inc: 11, swpout fallback inc: 214, Fallback percentage: 95.11%
Iteration 6: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
...
Iteration 96: swpout inc: 5, swpout fallback inc: 225, Fallback percentage: 97.83%
Iteration 97: swpout inc: 2, swpout fallback inc: 215, Fallback percentage: 99.08%
Iteration 98: swpout inc: 2, swpout fallback inc: 220, Fallback percentage: 99.10%
Iteration 99: swpout inc: 4, swpout fallback inc: 222, Fallback percentage: 98.23%
Iteration 100: swpout inc: 3, swpout fallback inc: 221, Fallback percentage: 98.66%
Kernel compile under tmpfs with cgroup memory.max = 2G.
12 core 24 hyperthreading, 32 jobs.
HDD swap 3 runs average, 20G swap file:
Without:
user 4186.290
system 421.743
real 597.317
With:
user 4113.897
system 413.123
real 659.543
SSD swap 10 runs average, 20G swap partition:
Without:
user 4736.810
system 500.921
real 250.243
With:
user 4729.478
system 500.265
real 249.633
Two zram swap:
zram0 1.4G zram1 20G.
The idea is forcing the zram0 almost
full then overflow to zram1:
Two zram 10 runs average:
Without:
user 4600.693
system 384.105
real 238.735
With:
user 4604.502
system 382.087
real 239.063
Reported-by: Barry Song <21cnbao@gmail.com>
Signed-off-by: Chris Li <chrisl@kernel.org>
---
Changes in v4:
- Remove a warning in patch 2.
- Allocating from the free cluster list before the nonfull list. Revert the v3 behavior.
- Add cluster_index and cluster_offset function.
- Patch 3 has a new allocating path for SSD.
- HDD swap allocation does not need to consider clusters any more.
Changes in v3:
- Using V1 as base.
- Rename "next" to "list" for the list field, suggested by Ying.
- Update comment for the locking rules for cluster fields and list,
suggested by Ying.
- Allocate from the nonfull list before attempting free list, suggested
by Kairui.
- Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org
Changes in v2:
- Abandoned.
- Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
---
Chris Li (3):
mm: swap: swap cluster switch to double link list
mm: swap: mTHP allocate swap entries from nonfull list
RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()
include/linux/swap.h | 30 ++--
mm/swapfile.c | 490 +++++++++++++++++++++++----------------------------
2 files changed, 238 insertions(+), 282 deletions(-)
---
base-commit: ff3a648ecb9409aff1448cf4f6aa41d78c69a3bc
change-id: 20240523-swap-allocator-1534c480ece4
Best regards,
--
Chris Li <chrisl@kernel.org>
^ permalink raw reply [flat|nested] 43+ messages in thread
* [PATCH v4 1/3] mm: swap: swap cluster switch to double link list
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
@ 2024-07-11 7:29 ` Chris Li
2024-07-15 14:57 ` Ryan Roberts
2024-07-18 6:26 ` Huang, Ying
2024-07-11 7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
` (3 subsequent siblings)
4 siblings, 2 replies; 43+ messages in thread
From: Chris Li @ 2024-07-11 7:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
Previously, the swap cluster used a cluster index as a pointer
to construct a custom single link list type "swap_cluster_list".
The next cluster pointer is shared with the cluster->count.
It prevents puting the non free cluster into a list.
Change the cluster to use the standard double link list instead.
This allows tracing the nonfull cluster in the follow up patch.
That way, it is faster to get to the nonfull cluster of that order.
Remove the cluster getter/setter for accessing the cluster
struct member.
The list operation is protected by the swap_info_struct->lock.
Change cluster code to use "struct swap_cluster_info *" to
reference the cluster rather than by using index. That is more
consistent with the list manipulation. It avoids the repeat
adding index to the cluser_info. The code is easier to understand.
Remove the cluster next pointer is NULL flag, the double link
list can handle the empty list pretty well.
The "swap_cluster_info" struct is two pointer bigger, because
512 swap entries share one swap struct, it has very little impact
on the average memory usage per swap entry. For 1TB swapfile, the
swap cluster data structure increases from 8MB to 24MB.
Other than the list conversion, there is no real function change
in this patch.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
include/linux/swap.h | 26 +++---
mm/swapfile.c | 225 ++++++++++++++-------------------------------------
2 files changed, 70 insertions(+), 181 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e473fe6cfb7a..e9be95468fc7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -243,22 +243,21 @@ enum {
* free clusters are organized into a list. We fetch an entry from the list to
* get a free cluster.
*
- * The data field stores next cluster if the cluster is free or cluster usage
- * counter otherwise. The flags field determines if a cluster is free. This is
- * protected by swap_info_struct.lock.
+ * The flags field determines if a cluster is free. This is
+ * protected by cluster lock.
*/
struct swap_cluster_info {
spinlock_t lock; /*
* Protect swap_cluster_info fields
- * and swap_info_struct->swap_map
- * elements correspond to the swap
- * cluster
+ * other than list, and swap_info_struct->swap_map
+ * elements correspond to the swap cluster.
*/
- unsigned int data:24;
- unsigned int flags:8;
+ u16 count;
+ u8 flags;
+ struct list_head list;
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
+
/*
* The first page in the swap file is the swap header, which is always marked
@@ -283,11 +282,6 @@ struct percpu_cluster {
unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
};
-struct swap_cluster_list {
- struct swap_cluster_info head;
- struct swap_cluster_info tail;
-};
-
/*
* The in-memory structure used to track swap areas.
*/
@@ -301,7 +295,7 @@ struct swap_info_struct {
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
- struct swap_cluster_list free_clusters; /* free clusters list */
+ struct list_head free_clusters; /* free clusters list */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
@@ -332,7 +326,7 @@ struct swap_info_struct {
* list.
*/
struct work_struct discard_work; /* discard worker */
- struct swap_cluster_list discard_clusters; /* discard clusters list */
+ struct list_head discard_clusters; /* discard clusters list */
struct plist_node avail_lists[]; /*
* entries in swap_avail_heads, one
* entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f7224bc1320c..f70d25005d2c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
#endif
#define LATENCY_LIMIT 256
-static inline void cluster_set_flag(struct swap_cluster_info *info,
- unsigned int flag)
-{
- info->flags = flag;
-}
-
-static inline unsigned int cluster_count(struct swap_cluster_info *info)
-{
- return info->data;
-}
-
-static inline void cluster_set_count(struct swap_cluster_info *info,
- unsigned int c)
-{
- info->data = c;
-}
-
-static inline void cluster_set_count_flag(struct swap_cluster_info *info,
- unsigned int c, unsigned int f)
-{
- info->flags = f;
- info->data = c;
-}
-
-static inline unsigned int cluster_next(struct swap_cluster_info *info)
-{
- return info->data;
-}
-
-static inline void cluster_set_next(struct swap_cluster_info *info,
- unsigned int n)
-{
- info->data = n;
-}
-
-static inline void cluster_set_next_flag(struct swap_cluster_info *info,
- unsigned int n, unsigned int f)
-{
- info->flags = f;
- info->data = n;
-}
-
static inline bool cluster_is_free(struct swap_cluster_info *info)
{
return info->flags & CLUSTER_FLAG_FREE;
}
-static inline bool cluster_is_null(struct swap_cluster_info *info)
-{
- return info->flags & CLUSTER_FLAG_NEXT_NULL;
-}
-
-static inline void cluster_set_null(struct swap_cluster_info *info)
+static inline unsigned int cluster_index(struct swap_info_struct *si,
+ struct swap_cluster_info *ci)
{
- info->flags = CLUSTER_FLAG_NEXT_NULL;
- info->data = 0;
+ return ci - si->cluster_info;
}
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
spin_unlock(&si->lock);
}
-static inline bool cluster_list_empty(struct swap_cluster_list *list)
-{
- return cluster_is_null(&list->head);
-}
-
-static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
-{
- return cluster_next(&list->head);
-}
-
-static void cluster_list_init(struct swap_cluster_list *list)
-{
- cluster_set_null(&list->head);
- cluster_set_null(&list->tail);
-}
-
-static void cluster_list_add_tail(struct swap_cluster_list *list,
- struct swap_cluster_info *ci,
- unsigned int idx)
-{
- if (cluster_list_empty(list)) {
- cluster_set_next_flag(&list->head, idx, 0);
- cluster_set_next_flag(&list->tail, idx, 0);
- } else {
- struct swap_cluster_info *ci_tail;
- unsigned int tail = cluster_next(&list->tail);
-
- /*
- * Nested cluster lock, but both cluster locks are
- * only acquired when we held swap_info_struct->lock
- */
- ci_tail = ci + tail;
- spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
- cluster_set_next(ci_tail, idx);
- spin_unlock(&ci_tail->lock);
- cluster_set_next_flag(&list->tail, idx, 0);
- }
-}
-
-static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
- struct swap_cluster_info *ci)
-{
- unsigned int idx;
-
- idx = cluster_next(&list->head);
- if (cluster_next(&list->tail) == idx) {
- cluster_set_null(&list->head);
- cluster_set_null(&list->tail);
- } else
- cluster_set_next_flag(&list->head,
- cluster_next(&ci[idx]), 0);
-
- return idx;
-}
-
/* Add a cluster to discard list and schedule it to do discard */
static void swap_cluster_schedule_discard(struct swap_info_struct *si,
- unsigned int idx)
+ struct swap_cluster_info *ci)
{
+ unsigned int idx = cluster_index(si, ci);
/*
* If scan_swap_map_slots() can't find a free cluster, it will check
* si->swap_map directly. To make sure the discarding cluster isn't
@@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
- cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
-
+ list_add_tail(&ci->list, &si->discard_clusters);
schedule_work(&si->discard_work);
}
-static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
- struct swap_cluster_info *ci = si->cluster_info;
-
- cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
- cluster_list_add_tail(&si->free_clusters, ci, idx);
+ ci->flags = CLUSTER_FLAG_FREE;
+ list_add_tail(&ci->list, &si->free_clusters);
}
/*
@@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
*/
static void swap_do_scheduled_discard(struct swap_info_struct *si)
{
- struct swap_cluster_info *info, *ci;
+ struct swap_cluster_info *ci;
unsigned int idx;
- info = si->cluster_info;
-
- while (!cluster_list_empty(&si->discard_clusters)) {
- idx = cluster_list_del_first(&si->discard_clusters, info);
+ while (!list_empty(&si->discard_clusters)) {
+ ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
+ list_del(&ci->list);
+ idx = cluster_index(si, ci);
spin_unlock(&si->lock);
discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
SWAPFILE_CLUSTER);
spin_lock(&si->lock);
- ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
- __free_cluster(si, idx);
+
+ spin_lock(&ci->lock);
+ __free_cluster(si, ci);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
- unlock_cluster(ci);
+ spin_unlock(&ci->lock);
}
}
@@ -521,20 +418,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
complete(&si->comp);
}
-static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
{
- struct swap_cluster_info *ci = si->cluster_info;
+ struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
- VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
- cluster_list_del_first(&si->free_clusters, ci);
- cluster_set_count_flag(ci + idx, 0, 0);
+ VM_BUG_ON(cluster_index(si, ci) != idx);
+ list_del(&ci->list);
+ ci->count = 0;
+ ci->flags = 0;
+ return ci;
}
-static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
- struct swap_cluster_info *ci = si->cluster_info + idx;
-
- VM_BUG_ON(cluster_count(ci) != 0);
+ VM_BUG_ON(ci->count != 0);
/*
* If the swap is discardable, prepare discard the cluster
* instead of free it immediately. The cluster will be freed
@@ -542,11 +439,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
*/
if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
- swap_cluster_schedule_discard(si, idx);
+ swap_cluster_schedule_discard(si, ci);
return;
}
- __free_cluster(si, idx);
+ __free_cluster(si, ci);
}
/*
@@ -559,15 +456,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
unsigned long count)
{
unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+ struct swap_cluster_info *ci = cluster_info + idx;
if (!cluster_info)
return;
- if (cluster_is_free(&cluster_info[idx]))
+ if (cluster_is_free(ci))
alloc_cluster(p, idx);
- VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
- cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) + count);
+ VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
+ ci->count += count;
}
/*
@@ -581,24 +478,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
}
/*
- * The cluster corresponding to page_nr decreases one usage. If the usage
- * counter becomes 0, which means no page in the cluster is in using, we can
- * optionally discard the cluster and add it to free cluster list.
+ * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * which means no page in the cluster is in using, we can optionally discard
+ * the cluster and add it to free cluster list.
*/
-static void dec_cluster_info_page(struct swap_info_struct *p,
- struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
{
- unsigned long idx = page_nr / SWAPFILE_CLUSTER;
-
- if (!cluster_info)
+ if (!p->cluster_info)
return;
- VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
- cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) - 1);
+ VM_BUG_ON(ci->count == 0);
+ ci->count--;
- if (cluster_count(&cluster_info[idx]) == 0)
- free_cluster(p, idx);
+ if (!ci->count)
+ free_cluster(p, ci);
}
/*
@@ -611,10 +504,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
{
struct percpu_cluster *percpu_cluster;
bool conflict;
-
+ struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
offset /= SWAPFILE_CLUSTER;
- conflict = !cluster_list_empty(&si->free_clusters) &&
- offset != cluster_list_first(&si->free_clusters) &&
+ conflict = !list_empty(&si->free_clusters) &&
+ offset != first - si->cluster_info &&
cluster_is_free(&si->cluster_info[offset]);
if (!conflict)
@@ -655,10 +548,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
cluster = this_cpu_ptr(si->percpu_cluster);
tmp = cluster->next[order];
if (tmp == SWAP_NEXT_INVALID) {
- if (!cluster_list_empty(&si->free_clusters)) {
- tmp = cluster_next(&si->free_clusters.head) *
- SWAPFILE_CLUSTER;
- } else if (!cluster_list_empty(&si->discard_clusters)) {
+ if (!list_empty(&si->free_clusters)) {
+ ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+ tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+ } else if (!list_empty(&si->discard_clusters)) {
/*
* we don't have free cluster but have some clusters in
* discarding, do discard now and reclaim them, then
@@ -1070,8 +963,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
- cluster_set_count_flag(ci, 0, 0);
- free_cluster(si, idx);
+ ci->count = 0;
+ ci->flags = 0;
+ free_cluster(si, ci);
unlock_cluster(ci);
swap_range_free(si, offset, SWAPFILE_CLUSTER);
}
@@ -1344,7 +1238,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
count = p->swap_map[offset];
VM_BUG_ON(count != SWAP_HAS_CACHE);
p->swap_map[offset] = 0;
- dec_cluster_info_page(p, p->cluster_info, offset);
+ dec_cluster_info_page(p, ci);
unlock_cluster(ci);
mem_cgroup_uncharge_swap(entry, 1);
@@ -3022,8 +2916,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
nr_good_pages = maxpages - 1; /* omit header page */
- cluster_list_init(&p->free_clusters);
- cluster_list_init(&p->discard_clusters);
+ INIT_LIST_HEAD(&p->free_clusters);
+ INIT_LIST_HEAD(&p->discard_clusters);
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
@@ -3074,14 +2968,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
j = (k + col) % SWAP_CLUSTER_COLS;
for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+ struct swap_cluster_info *ci;
idx = i * SWAP_CLUSTER_COLS + j;
+ ci = cluster_info + idx;
if (idx >= nr_clusters)
continue;
- if (cluster_count(&cluster_info[idx]))
+ if (ci->count)
continue;
- cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
- cluster_list_add_tail(&p->free_clusters, cluster_info,
- idx);
+ ci->flags = CLUSTER_FLAG_FREE;
+ list_add_tail(&ci->list, &p->free_clusters);
}
}
return nr_extents;
--
2.45.2.803.g4e1b14247a-goog
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
@ 2024-07-11 7:29 ` Chris Li
2024-07-15 15:40 ` Ryan Roberts
2024-07-11 7:29 ` [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Chris Li
` (2 subsequent siblings)
4 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-11 7:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
Track the nonfull cluster as well as the empty cluster
on lists. Each order has one nonfull cluster list.
The cluster will remember which order it was used during
new cluster allocation.
When the cluster has free entry, add to the nonfull[order]
list. When the free cluster list is empty, also allocate
from the nonempty list of that order.
This improves the mTHP swap allocation success rate.
There are limitations if the distribution of numbers of
different orders of mTHP changes a lot. e.g. there are a lot
of nonfull cluster assign to order A while later time there
are a lot of order B allocation while very little allocation
in order A. Currently the cluster used by order A will not
reused by order B unless the cluster is 100% empty.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
include/linux/swap.h | 4 ++++
mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
2 files changed, 35 insertions(+), 3 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e9be95468fc7..db8d6000c116 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,9 +254,11 @@ struct swap_cluster_info {
*/
u16 count;
u8 flags;
+ u8 order;
struct list_head list;
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
+#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
/*
@@ -296,6 +298,8 @@ struct swap_info_struct {
unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
+ struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+ /* list of cluster that contains at least one free slot */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f70d25005d2c..e13a33664cfa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
- list_add_tail(&ci->list, &si->discard_clusters);
+ if (ci->flags)
+ list_move_tail(&ci->list, &si->discard_clusters);
+ else
+ list_add_tail(&ci->list, &si->discard_clusters);
+ ci->flags = 0;
schedule_work(&si->discard_work);
}
static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
+ if (ci->flags & CLUSTER_FLAG_NONFULL)
+ list_move_tail(&ci->list, &si->free_clusters);
+ else
+ list_add_tail(&ci->list, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
- list_add_tail(&ci->list, &si->free_clusters);
}
/*
@@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
ci->count--;
if (!ci->count)
- free_cluster(p, ci);
+ return free_cluster(p, ci);
+
+ if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+ list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+ ci->flags |= CLUSTER_FLAG_NONFULL;
+ }
}
/*
@@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
if (tmp == SWAP_NEXT_INVALID) {
if (!list_empty(&si->free_clusters)) {
ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+ list_del(&ci->list);
+ spin_lock(&ci->lock);
+ ci->order = order;
+ ci->flags = 0;
+ spin_unlock(&ci->lock);
+ tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+ } else if (!list_empty(&si->nonfull_clusters[order])) {
+ ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
+ list_del(&ci->list);
+ spin_lock(&ci->lock);
+ ci->flags = 0;
+ spin_unlock(&ci->lock);
tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
} else if (!list_empty(&si->discard_clusters)) {
/*
@@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
ci->count = 0;
+ ci->order = 0;
ci->flags = 0;
free_cluster(si, ci);
unlock_cluster(ci);
@@ -2919,6 +2944,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
INIT_LIST_HEAD(&p->free_clusters);
INIT_LIST_HEAD(&p->discard_clusters);
+ for (i = 0; i < SWAP_NR_ORDERS; i++)
+ INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
if (page_nr == 0 || page_nr > swap_header->info.last_page)
--
2.45.2.803.g4e1b14247a-goog
^ permalink raw reply related [flat|nested] 43+ messages in thread
* [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
2024-07-11 7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
@ 2024-07-11 7:29 ` Chris Li
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
2024-07-18 5:50 ` Huang, Ying
4 siblings, 0 replies; 43+ messages in thread
From: Chris Li @ 2024-07-11 7:29 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
Previously the SSD and HDD share the same swap_map scan loop in
scan_swap_map_slots(). This function is complex and hard to flow
the execution flow.
scan_swap_map_try_ssd_cluster() can already do most of the heavy
lifting to locate the candidate swap range in the cluster. However
it needs to go back to scan_swap_map_slots() to check conflict
and then perform the allocation.
When scan_swap_map_try_ssd_cluster() failed, it still depended on
the scan_swap_map_slots() to do brute force scanning of the swap_map.
When the swapfile is large and almost full, it will take some CPU
time to go through the swap_map array.
Get rid of the cluster allocation dependency on the swap_map scan
loop in scan_swap_map_slots(). Streamline the cluster allocation
code path. No more conflict checks.
For order 0 swap entry, when run out of free and nonfull list.
It will allocate from the higher order nonfull cluster list.
Users should see less CPU time spent on searching the free swap
slot when swapfile is almost full.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
mm/swapfile.c | 297 ++++++++++++++++++++++++++++++++--------------------------
1 file changed, 166 insertions(+), 131 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e13a33664cfa..b967e628ae65 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,6 +53,8 @@
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
static void free_swap_count_continuations(struct swap_info_struct *);
+static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+ unsigned int nr_entries);
static DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
@@ -301,6 +303,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
return ci - si->cluster_info;
}
+static inline unsigned int cluster_offset(struct swap_info_struct *si,
+ struct swap_cluster_info *ci)
+{
+ return cluster_index(si, ci) * SWAPFILE_CLUSTER;
+}
+
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
unsigned long offset)
{
@@ -371,11 +379,15 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
+ VM_BUG_ON(!spin_is_locked(&si->lock));
+ VM_BUG_ON(!spin_is_locked(&ci->lock));
+
if (ci->flags & CLUSTER_FLAG_NONFULL)
list_move_tail(&ci->list, &si->free_clusters);
else
list_add_tail(&ci->list, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
+ ci->order = 0;
}
/*
@@ -430,8 +442,10 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi
struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
VM_BUG_ON(cluster_index(si, ci) != idx);
+ VM_BUG_ON(!spin_is_locked(&si->lock));
+ VM_BUG_ON(!spin_is_locked(&ci->lock));
+ VM_BUG_ON(ci->count);
list_del(&ci->list);
- ci->count = 0;
ci->flags = 0;
return ci;
}
@@ -439,6 +453,8 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi
static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
VM_BUG_ON(ci->count != 0);
+ VM_BUG_ON(!spin_is_locked(&si->lock));
+ VM_BUG_ON(!spin_is_locked(&ci->lock));
/*
* If the swap is discardable, prepare discard the cluster
* instead of free it immediately. The cluster will be freed
@@ -495,52 +511,96 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
return;
VM_BUG_ON(ci->count == 0);
+ VM_BUG_ON(cluster_is_free(ci));
+ VM_BUG_ON(!spin_is_locked(&p->lock));
+ VM_BUG_ON(!spin_is_locked(&ci->lock));
ci->count--;
if (!ci->count)
return free_cluster(p, ci);
if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+ VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
- ci->flags |= CLUSTER_FLAG_NONFULL;
+ ci->flags = CLUSTER_FLAG_NONFULL;
}
}
-/*
- * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
- * cluster list. Avoiding such abuse to avoid list corruption.
- */
-static bool
-scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
- unsigned long offset, int order)
-{
- struct percpu_cluster *percpu_cluster;
- bool conflict;
- struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
- offset /= SWAPFILE_CLUSTER;
- conflict = !list_empty(&si->free_clusters) &&
- offset != first - si->cluster_info &&
- cluster_is_free(&si->cluster_info[offset]);
-
- if (!conflict)
- return false;
+static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start,
+ unsigned int nr_pages)
+{
+ unsigned char *p = si->swap_map + start;
+ unsigned char *end = p + nr_pages;
+
+ while (p < end)
+ if (*p++)
+ return false;
- percpu_cluster = this_cpu_ptr(si->percpu_cluster);
- percpu_cluster->next[order] = SWAP_NEXT_INVALID;
return true;
}
-static inline bool swap_range_empty(char *swap_map, unsigned int start,
- unsigned int nr_pages)
+
+static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
+ unsigned int start, unsigned char usage,
+ unsigned int order)
{
- unsigned int i;
+ unsigned int nr_pages = 1 << order;
- for (i = 0; i < nr_pages; i++) {
- if (swap_map[start + i])
- return false;
+ if (cluster_is_free(ci)) {
+ if (nr_pages < SWAPFILE_CLUSTER) {
+ list_move_tail(&ci->list, &si->nonfull_clusters[order]);
+ ci->flags = CLUSTER_FLAG_NONFULL;
+ }
+ ci->order = order;
}
- return true;
+ memset(si->swap_map + start, usage, nr_pages);
+ swap_range_alloc(si, start, nr_pages);
+ ci->count += nr_pages;
+
+ if (ci->count == SWAPFILE_CLUSTER) {
+ VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL)));
+ list_del(&ci->list);
+ ci->flags = 0;
+ }
+}
+
+static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset, unsigned int *foundp,
+ unsigned int order, unsigned char usage)
+{
+ unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
+ unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
+ unsigned int nr_pages = 1 << order;
+ struct swap_cluster_info *ci;
+
+ if (end < nr_pages)
+ return SWAP_NEXT_INVALID;
+ end -= nr_pages;
+
+ ci = lock_cluster(si, offset);
+ if (ci->count + nr_pages > SWAPFILE_CLUSTER) {
+ offset = SWAP_NEXT_INVALID;
+ goto done;
+ }
+
+ while (offset <= end) {
+ if (cluster_scan_range(si, offset, nr_pages)) {
+ cluster_alloc_range(si, ci, offset, usage, order);
+ *foundp = offset;
+ if (ci->count == SWAPFILE_CLUSTER) {
+ offset = SWAP_NEXT_INVALID;
+ goto done;
+ }
+ offset += nr_pages;
+ break;
+ }
+ offset += nr_pages;
+ }
+ if (offset > end)
+ offset = SWAP_NEXT_INVALID;
+done:
+ unlock_cluster(ci);
+ return offset;
}
/*
@@ -548,71 +608,63 @@ static inline bool swap_range_empty(char *swap_map, unsigned int start,
* pool (a cluster). This might involve allocating a new cluster for current CPU
* too.
*/
-static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
- unsigned long *offset, unsigned long *scan_base, int order)
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order, unsigned char usage)
{
- unsigned int nr_pages = 1 << order;
struct percpu_cluster *cluster;
- struct swap_cluster_info *ci;
- unsigned int tmp, max;
+ struct swap_cluster_info *ci, *n;
+ unsigned int offset, found = 0;
new_cluster:
+ VM_BUG_ON(!spin_is_locked(&si->lock));
cluster = this_cpu_ptr(si->percpu_cluster);
- tmp = cluster->next[order];
- if (tmp == SWAP_NEXT_INVALID) {
- if (!list_empty(&si->free_clusters)) {
- ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
- list_del(&ci->list);
- spin_lock(&ci->lock);
- ci->order = order;
- ci->flags = 0;
- spin_unlock(&ci->lock);
- tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
- } else if (!list_empty(&si->nonfull_clusters[order])) {
- ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
- list_del(&ci->list);
- spin_lock(&ci->lock);
- ci->flags = 0;
- spin_unlock(&ci->lock);
- tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
- } else if (!list_empty(&si->discard_clusters)) {
- /*
- * we don't have free cluster but have some clusters in
- * discarding, do discard now and reclaim them, then
- * reread cluster_next_cpu since we dropped si->lock
- */
- swap_do_scheduled_discard(si);
- *scan_base = this_cpu_read(*si->cluster_next_cpu);
- *offset = *scan_base;
- goto new_cluster;
- } else
- return false;
+ offset = cluster->next[order];
+ if (offset) {
+ offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
+ if (found)
+ goto done;
}
- /*
- * Other CPUs can use our cluster if they can't find a free cluster,
- * check if there is still free entry in the cluster, maintaining
- * natural alignment.
- */
- max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
- if (tmp < max) {
- ci = lock_cluster(si, tmp);
- while (tmp < max) {
- if (swap_range_empty(si->swap_map, tmp, nr_pages))
- break;
- tmp += nr_pages;
+ list_for_each_entry_safe(ci, n, &si->free_clusters, list) {
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
+ if (found)
+ goto done;
+ VM_BUG_ON(1);
+ }
+
+ if (order < PMD_ORDER) {
+ list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
+ if (found)
+ goto done;
}
- unlock_cluster(ci);
}
- if (tmp >= max) {
- cluster->next[order] = SWAP_NEXT_INVALID;
+
+ if (!list_empty(&si->discard_clusters)) {
+ /*
+ * we don't have free cluster but have some clusters in
+ * discarding, do discard now and reclaim them, then
+ * reread cluster_next_cpu since we dropped si->lock
+ */
+ swap_do_scheduled_discard(si);
goto new_cluster;
}
- *offset = tmp;
- *scan_base = tmp;
- tmp += nr_pages;
- cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
- return true;
+
+ if (order)
+ goto done;
+
+ for (int o = order + 1; o < SWAP_NR_ORDERS; o++) {
+ struct swap_cluster_info *ci, *n;
+
+ list_for_each_entry_safe(ci, n, &si->nonfull_clusters[o], list) {
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
+ if (found)
+ goto done;
+ }
+ }
+
+done:
+ cluster->next[order] = offset;
+ return found;
}
static void __del_from_avail_list(struct swap_info_struct *p)
@@ -747,11 +799,29 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
return false;
}
+static int cluster_alloc_swap(struct swap_info_struct *si,
+ unsigned char usage, int nr,
+ swp_entry_t slots[], int order)
+{
+ int n_ret = 0;
+
+ VM_BUG_ON(!si->cluster_info);
+
+ while (n_ret < nr) {
+ unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
+
+ if (!offset)
+ break;
+ slots[n_ret++] = swp_entry(si->type, offset);
+ }
+
+ return n_ret;
+}
+
static int scan_swap_map_slots(struct swap_info_struct *si,
unsigned char usage, int nr,
swp_entry_t slots[], int order)
{
- struct swap_cluster_info *ci;
unsigned long offset;
unsigned long scan_base;
unsigned long last_in_cluster = 0;
@@ -790,26 +860,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
return 0;
}
+ if (si->cluster_info)
+ return cluster_alloc_swap(si, usage, nr, slots, order);
+
si->flags += SWP_SCANNING;
- /*
- * Use percpu scan base for SSD to reduce lock contention on
- * cluster and swap cache. For HDD, sequential access is more
- * important.
- */
- if (si->flags & SWP_SOLIDSTATE)
- scan_base = this_cpu_read(*si->cluster_next_cpu);
- else
- scan_base = si->cluster_next;
+
+ /* For HDD, sequential access is more important. */
+ scan_base = si->cluster_next;
offset = scan_base;
- /* SSD algorithm */
- if (si->cluster_info) {
- if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
- if (order > 0)
- goto no_page;
- goto scan;
- }
- } else if (unlikely(!si->cluster_nr--)) {
+ if (unlikely(!si->cluster_nr--)) {
if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
si->cluster_nr = SWAPFILE_CLUSTER - 1;
goto checks;
@@ -820,8 +880,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
/*
* If seek is expensive, start searching for new cluster from
* start of partition, to minimize the span of allocated swap.
- * If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info
- * case, just handled by scan_swap_map_try_ssd_cluster() above.
*/
scan_base = offset = si->lowest_bit;
last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
@@ -849,19 +907,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}
checks:
- if (si->cluster_info) {
- while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
- /* take a break if we already got some slots */
- if (n_ret)
- goto done;
- if (!scan_swap_map_try_ssd_cluster(si, &offset,
- &scan_base, order)) {
- if (order > 0)
- goto no_page;
- goto scan;
- }
- }
- }
if (!(si->flags & SWP_WRITEOK))
goto no_page;
if (!si->highest_bit)
@@ -869,11 +914,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
if (offset > si->highest_bit)
scan_base = offset = si->lowest_bit;
- ci = lock_cluster(si, offset);
/* reuse swap entry of cache-only swap if not busy. */
if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
int swap_was_freed;
- unlock_cluster(ci);
spin_unlock(&si->lock);
swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
spin_lock(&si->lock);
@@ -884,15 +927,12 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}
if (si->swap_map[offset]) {
- unlock_cluster(ci);
if (!n_ret)
goto scan;
else
goto done;
}
memset(si->swap_map + offset, usage, nr_pages);
- add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
- unlock_cluster(ci);
swap_range_alloc(si, offset, nr_pages);
slots[n_ret++] = swp_entry(si->type, offset);
@@ -913,13 +953,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
latency_ration = LATENCY_LIMIT;
}
- /* try to get more slots in cluster */
- if (si->cluster_info) {
- if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
- goto checks;
- if (order > 0)
- goto done;
- } else if (si->cluster_nr && !si->swap_map[++offset]) {
+ if (si->cluster_nr && !si->swap_map[++offset]) {
/* non-ssd case, still more slots in cluster? */
--si->cluster_nr;
goto checks;
@@ -988,8 +1022,6 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
ci->count = 0;
- ci->order = 0;
- ci->flags = 0;
free_cluster(si, ci);
unlock_cluster(ci);
swap_range_free(si, offset, SWAPFILE_CLUSTER);
@@ -3001,8 +3033,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
ci = cluster_info + idx;
if (idx >= nr_clusters)
continue;
- if (ci->count)
+ if (ci->count) {
+ ci->flags = CLUSTER_FLAG_NONFULL;
+ list_add_tail(&ci->list, &p->nonfull_clusters[0]);
continue;
+ }
ci->flags = CLUSTER_FLAG_FREE;
list_add_tail(&ci->list, &p->free_clusters);
}
--
2.45.2.803.g4e1b14247a-goog
^ permalink raw reply related [flat|nested] 43+ messages in thread
* Re: [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (2 preceding siblings ...)
2024-07-11 7:29 ` [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Chris Li
@ 2024-07-11 10:02 ` Ryan Roberts
2024-07-11 14:08 ` Chris Li
2024-07-18 5:50 ` Huang, Ying
4 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-11 10:02 UTC (permalink / raw)
To: Chris Li, Andrew Morton
Cc: Kairui Song, Hugh Dickins, Huang, Ying, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 11/07/2024 08:29, Chris Li wrote:
> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from the new empty cluster list. It has a fragmentation issue
> reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The reason is that all the empty clusters have been exhausted while
> there are plenty of free swap entries in the cluster that are
> not 100% free.
>
> Remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> The patch 3 of this series gives the swap SSD allocation
> a new separate code path from the HDD allocation. The new allocator
> use cluster list only and do not global scan swap_map[] without lock
> any more.
>
> This streamline the swap allocation for SSD. The code matches the execution
> flow much better.
>
> User impact: For users that allocate and free mix order mTHP swapping,
> It greatly improves the success rate of the mTHP swap allocation after the
> initial phase.
>
> It also performs faster when the swapfile is close to full, because the
> allocator can get the non full cluster from a list rather than scanning
> a lot of swap_map entries.
>
> This series still lacks the swap cache reclaim feature. The reclaim series
> of patches are under development and testing right now. Will post the
> mail list soon. For this reason, the patch 3 is consider RFC and not
> ready to merge.
>
> With Barry's mthp test program V2:
>
> Without:
> $ ./thp_swap_allocator_test -a
> Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
> Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
> ...
> Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
>
> $ ./thp_swap_allocator_test -a -s
> Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> ..
> Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>
> $ ./thp_swap_allocator_test -s
> Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> ..
> Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>
> $ ./thp_swap_allocator_test
> Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> ..
> Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>
> With:
> $ ./thp_swap_allocator_test -a
> Iteration 1: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> ...
> Iteration 98: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 99: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 100: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>
> $ ./thp_swap_allocator_test -a -s
> Iteration 1: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 5: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 6: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 7: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 8: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 9: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 10: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 11: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 12: swpout inc: 232, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 13: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
> Iteration 15: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 16: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 17: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 18: swpout inc: 234, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 19: swpout inc: 220, swpout fallback inc: 6, Fallback percentage: 2.65%
> Iteration 20: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 21: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 22: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 23: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 24: swpout inc: 232, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 25: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 26: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 27: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 29: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 30: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 31: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 32: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 33: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 34: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 35: swpout inc: 230, swpout fallback inc: 3, Fallback percentage: 1.29%
> Iteration 36: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 37: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 38: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 39: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 40: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 41: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 42: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 43: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 44: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 45: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 46: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
> Iteration 47: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 48: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 49: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 50: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 51: swpout inc: 224, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 52: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 53: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 54: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 55: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 56: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 57: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 58: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 59: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 60: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 61: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 62: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 63: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 64: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 65: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 66: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 67: swpout inc: 220, swpout fallback inc: 2, Fallback percentage: 0.90%
> Iteration 68: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 69: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 70: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 71: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 72: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 73: swpout inc: 218, swpout fallback inc: 5, Fallback percentage: 2.24%
> Iteration 74: swpout inc: 223, swpout fallback inc: 5, Fallback percentage: 2.19%
> Iteration 75: swpout inc: 222, swpout fallback inc: 7, Fallback percentage: 3.06%
> Iteration 76: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 77: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 78: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 79: swpout inc: 223, swpout fallback inc: 2, Fallback percentage: 0.89%
> Iteration 80: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 81: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 82: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 83: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 84: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 85: swpout inc: 213, swpout fallback inc: 1, Fallback percentage: 0.47%
> Iteration 86: swpout inc: 215, swpout fallback inc: 8, Fallback percentage: 3.59%
> Iteration 87: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 88: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 89: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
> Iteration 90: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 91: swpout inc: 214, swpout fallback inc: 1, Fallback percentage: 0.47%
> Iteration 92: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 93: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 94: swpout inc: 223, swpout fallback inc: 2, Fallback percentage: 0.89%
> Iteration 95: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 96: swpout inc: 223, swpout fallback inc: 4, Fallback percentage: 1.76%
> Iteration 97: swpout inc: 223, swpout fallback inc: 7, Fallback percentage: 3.04%
> Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 99: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 100: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
Great results!
>
> $ ./thp_swap_allocator_test
> Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 134, swpout fallback inc: 98, Fallback percentage: 42.24%
> Iteration 3: swpout inc: 72, swpout fallback inc: 154, Fallback percentage: 68.14%
> Iteration 4: swpout inc: 40, swpout fallback inc: 183, Fallback percentage: 82.06%
> Iteration 5: swpout inc: 27, swpout fallback inc: 199, Fallback percentage: 88.05%
> Iteration 6: swpout inc: 22, swpout fallback inc: 202, Fallback percentage: 90.18%
> Iteration 7: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
> Iteration 8: swpout inc: 14, swpout fallback inc: 214, Fallback percentage: 93.86%
> Iteration 9: swpout inc: 5, swpout fallback inc: 221, Fallback percentage: 97.79%
> Iteration 10: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
> ...
> Iteration 97: swpout inc: 12, swpout fallback inc: 207, Fallback percentage: 94.52%
> Iteration 98: swpout inc: 8, swpout fallback inc: 219, Fallback percentage: 96.48%
> Iteration 99: swpout inc: 16, swpout fallback inc: 218, Fallback percentage: 93.16%
> Iteration 100: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
>
> $ ./thp_swap_allocator_test -s
> Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 84, swpout fallback inc: 148, Fallback percentage: 63.79%
> Iteration 3: swpout inc: 39, swpout fallback inc: 195, Fallback percentage: 83.33%
> Iteration 4: swpout inc: 16, swpout fallback inc: 217, Fallback percentage: 93.13%
> Iteration 5: swpout inc: 11, swpout fallback inc: 214, Fallback percentage: 95.11%
> Iteration 6: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
> ...
> Iteration 96: swpout inc: 5, swpout fallback inc: 225, Fallback percentage: 97.83%
> Iteration 97: swpout inc: 2, swpout fallback inc: 215, Fallback percentage: 99.08%
> Iteration 98: swpout inc: 2, swpout fallback inc: 220, Fallback percentage: 99.10%
> Iteration 99: swpout inc: 4, swpout fallback inc: 222, Fallback percentage: 98.23%
> Iteration 100: swpout inc: 3, swpout fallback inc: 221, Fallback percentage: 98.66%
>
> Kernel compile under tmpfs with cgroup memory.max = 2G.
> 12 core 24 hyperthreading, 32 jobs.
>
> HDD swap 3 runs average, 20G swap file:
>
> Without:
> user 4186.290
> system 421.743
> real 597.317
>
> With:
> user 4113.897
> system 413.123
> real 659.543
If I've understood this correctly, this test is taking~10% longer in wall time?
But your changes shouldn't affect HDD swap path? So what's the reason for this?
I'm hoping to review this properly next week. It would be great to get this in
sooner rather than later IMHO.
Thanks,
Ryan
>
> SSD swap 10 runs average, 20G swap partition:
>
> Without:
> user 4736.810
> system 500.921
> real 250.243
>
> With:
> user 4729.478
> system 500.265
> real 249.633
>
> Two zram swap:
> zram0 1.4G zram1 20G.
> The idea is forcing the zram0 almost
> full then overflow to zram1:
>
> Two zram 10 runs average:
>
> Without:
> user 4600.693
> system 384.105
> real 238.735
>
> With:
> user 4604.502
> system 382.087
> real 239.063
>
> Reported-by: Barry Song <21cnbao@gmail.com>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Changes in v4:
> - Remove a warning in patch 2.
> - Allocating from the free cluster list before the nonfull list. Revert the v3 behavior.
> - Add cluster_index and cluster_offset function.
> - Patch 3 has a new allocating path for SSD.
> - HDD swap allocation does not need to consider clusters any more.
>
> Changes in v3:
> - Using V1 as base.
> - Rename "next" to "list" for the list field, suggested by Ying.
> - Update comment for the locking rules for cluster fields and list,
> suggested by Ying.
> - Allocate from the nonfull list before attempting free list, suggested
> by Kairui.
> - Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org
>
> Changes in v2:
> - Abandoned.
> - Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
>
> ---
> Chris Li (3):
> mm: swap: swap cluster switch to double link list
> mm: swap: mTHP allocate swap entries from nonfull list
> RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()
>
> include/linux/swap.h | 30 ++--
> mm/swapfile.c | 490 +++++++++++++++++++++++----------------------------
> 2 files changed, 238 insertions(+), 282 deletions(-)
> ---
> base-commit: ff3a648ecb9409aff1448cf4f6aa41d78c69a3bc
> change-id: 20240523-swap-allocator-1534c480ece4
>
> Best regards,
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
@ 2024-07-11 14:08 ` Chris Li
2024-07-15 14:10 ` Ryan Roberts
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-11 14:08 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Thu, Jul 11, 2024 at 3:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> > Kernel compile under tmpfs with cgroup memory.max = 2G.
> > 12 core 24 hyperthreading, 32 jobs.
> >
> > HDD swap 3 runs average, 20G swap file:
> >
> > Without:
> > user 4186.290
> > system 421.743
> > real 597.317
> >
> > With:
> > user 4113.897
> > system 413.123
> > real 659.543
>
> If I've understood this correctly, this test is taking~10% longer in wall time?
Most likely due to the high variance in measurement and fewer
measuring samples 3 vs 10. Most of that wall time is waiting for IO.
It is likely just noise.
> But your changes shouldn't affect HDD swap path? So what's the reason for this?
The change did affect HDD swap path in the sense that it did not need
to check for si->cluster_info any more. A small gain there.
The wall clock time is more than double the SSD or zram. Which means
most of the time the system is waiting for HDD IO to complete (wait is
98%) , there will be much higher variance for sure. At this point the
wall clock we are measuring the wait mostly, not the actual work. The
system time is quicker, that is good.
I now have a dedicated machine to run the HDD swap now. The HDD is
very very slow to swap. The point of the HDD test is being able to
complete the run without OOM. Because of the high latency in HDD,
there will be more memory pressure. It did catch some other bugs in my
internal version of the patch.
> I'm hoping to review this properly next week. It would be great to get this in
> sooner rather than later IMHO.
Thank you. This new code path is much easier to work with than the
previous SSD and HDD mixed allocation path. I am able to implement the
cluster reservation experiment in the new allocator much quicker.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-11 14:08 ` Chris Li
@ 2024-07-15 14:10 ` Ryan Roberts
2024-07-15 18:14 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-15 14:10 UTC (permalink / raw)
To: Chris Li
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On 11/07/2024 15:08, Chris Li wrote:
> On Thu, Jul 11, 2024 at 3:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>> Kernel compile under tmpfs with cgroup memory.max = 2G.
>>> 12 core 24 hyperthreading, 32 jobs.
>>>
>>> HDD swap 3 runs average, 20G swap file:
>>>
>>> Without:
>>> user 4186.290
>>> system 421.743
>>> real 597.317
>>>
>>> With:
>>> user 4113.897
>>> system 413.123
>>> real 659.543
>>
>> If I've understood this correctly, this test is taking~10% longer in wall time?
>
> Most likely due to the high variance in measurement and fewer
> measuring samples 3 vs 10. Most of that wall time is waiting for IO.
> It is likely just noise.
OK, that certainly makes sense, as long as you're sure its noise. The other
(unlikely) possibility is that somehow the HDD placement descisions are
changing, which increases waiting due to increased seek times.
>
>> But your changes shouldn't affect HDD swap path? So what's the reason for this?
>
> The change did affect HDD swap path in the sense that it did not need
> to check for si->cluster_info any more. A small gain there.
>
> The wall clock time is more than double the SSD or zram. Which means
> most of the time the system is waiting for HDD IO to complete (wait is
> 98%) , there will be much higher variance for sure. At this point the
> wall clock we are measuring the wait mostly, not the actual work. The
> system time is quicker, that is good.
>
> I now have a dedicated machine to run the HDD swap now. The HDD is
> very very slow to swap. The point of the HDD test is being able to
> complete the run without OOM. Because of the high latency in HDD,
> there will be more memory pressure. It did catch some other bugs in my
> internal version of the patch.
>
>> I'm hoping to review this properly next week. It would be great to get this in
>> sooner rather than later IMHO.
>
> Thank you. This new code path is much easier to work with than the
> previous SSD and HDD mixed allocation path. I am able to implement the
> cluster reservation experiment in the new allocator much quicker.
>
> Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 1/3] mm: swap: swap cluster switch to double link list
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
@ 2024-07-15 14:57 ` Ryan Roberts
2024-07-16 22:11 ` Chris Li
2024-07-18 6:26 ` Huang, Ying
1 sibling, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-15 14:57 UTC (permalink / raw)
To: Chris Li, Andrew Morton
Cc: Kairui Song, Hugh Dickins, Huang, Ying, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 11/07/2024 08:29, Chris Li wrote:
> Previously, the swap cluster used a cluster index as a pointer
> to construct a custom single link list type "swap_cluster_list".
> The next cluster pointer is shared with the cluster->count.
> It prevents puting the non free cluster into a list.
>
> Change the cluster to use the standard double link list instead.
> This allows tracing the nonfull cluster in the follow up patch.
> That way, it is faster to get to the nonfull cluster of that order.
>
> Remove the cluster getter/setter for accessing the cluster
> struct member.
>
> The list operation is protected by the swap_info_struct->lock.
>
> Change cluster code to use "struct swap_cluster_info *" to
> reference the cluster rather than by using index. That is more
> consistent with the list manipulation. It avoids the repeat
> adding index to the cluser_info. The code is easier to understand.
>
> Remove the cluster next pointer is NULL flag, the double link
> list can handle the empty list pretty well.
>
> The "swap_cluster_info" struct is two pointer bigger, because
> 512 swap entries share one swap struct, it has very little impact
> on the average memory usage per swap entry. For 1TB swapfile, the
> swap cluster data structure increases from 8MB to 24MB.
>
> Other than the list conversion, there is no real function change
> in this patch.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> include/linux/swap.h | 26 +++---
> mm/swapfile.c | 225 ++++++++++++++-------------------------------------
> 2 files changed, 70 insertions(+), 181 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e473fe6cfb7a..e9be95468fc7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -243,22 +243,21 @@ enum {
> * free clusters are organized into a list. We fetch an entry from the list to
> * get a free cluster.
> *
> - * The data field stores next cluster if the cluster is free or cluster usage
> - * counter otherwise. The flags field determines if a cluster is free. This is
> - * protected by swap_info_struct.lock.
> + * The flags field determines if a cluster is free. This is
> + * protected by cluster lock.
> */
> struct swap_cluster_info {
> spinlock_t lock; /*
> * Protect swap_cluster_info fields
> - * and swap_info_struct->swap_map
> - * elements correspond to the swap
> - * cluster
> + * other than list, and swap_info_struct->swap_map
> + * elements correspond to the swap cluster.
nit: correspond -> corresponding
> */
> - unsigned int data:24;
> - unsigned int flags:8;
> + u16 count;
Just to make sure I've understood correctly; count can safely be 16 bit (down
from previous 24 bit) because the max it will ever be is the number of swap
entries in the cluster, and that's currently no bigger than 512, and in future
we wouldn't expect it to ever get bigger than 8192 (number of pages in PMD-order
for arm64 64K base pages). 8192 is represented in 14 bits.
> + u8 flags;
> + struct list_head list;
> };
> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> +
nit: why the blank line?
>
> /*
> * The first page in the swap file is the swap header, which is always marked
> @@ -283,11 +282,6 @@ struct percpu_cluster {
> unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> };
>
> -struct swap_cluster_list {
> - struct swap_cluster_info head;
> - struct swap_cluster_info tail;
> -};
> -
> /*
> * The in-memory structure used to track swap areas.
> */
> @@ -301,7 +295,7 @@ struct swap_info_struct {
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> - struct swap_cluster_list free_clusters; /* free clusters list */
> + struct list_head free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> unsigned int highest_bit; /* index of last free in swap_map */
> unsigned int pages; /* total of usable pages of swap */
> @@ -332,7 +326,7 @@ struct swap_info_struct {
> * list.
> */
> struct work_struct discard_work; /* discard worker */
> - struct swap_cluster_list discard_clusters; /* discard clusters list */
> + struct list_head discard_clusters; /* discard clusters list */
> struct plist_node avail_lists[]; /*
> * entries in swap_avail_heads, one
> * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f7224bc1320c..f70d25005d2c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> #endif
> #define LATENCY_LIMIT 256
>
> -static inline void cluster_set_flag(struct swap_cluster_info *info,
> - unsigned int flag)
> -{
> - info->flags = flag;
> -}
> -
> -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_count(struct swap_cluster_info *info,
> - unsigned int c)
> -{
> - info->data = c;
> -}
> -
> -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> - unsigned int c, unsigned int f)
> -{
> - info->flags = f;
> - info->data = c;
> -}
> -
> -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_next(struct swap_cluster_info *info,
> - unsigned int n)
> -{
> - info->data = n;
> -}
> -
> -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> - unsigned int n, unsigned int f)
> -{
> - info->flags = f;
> - info->data = n;
> -}
> -
> static inline bool cluster_is_free(struct swap_cluster_info *info)
> {
> return info->flags & CLUSTER_FLAG_FREE;
> }
>
> -static inline bool cluster_is_null(struct swap_cluster_info *info)
> -{
> - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> -}
> -
> -static inline void cluster_set_null(struct swap_cluster_info *info)
> +static inline unsigned int cluster_index(struct swap_info_struct *si,
> + struct swap_cluster_info *ci)
> {
> - info->flags = CLUSTER_FLAG_NEXT_NULL;
> - info->data = 0;
> + return ci - si->cluster_info;
> }
>
> static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> spin_unlock(&si->lock);
> }
>
> -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> -{
> - return cluster_is_null(&list->head);
> -}
> -
> -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> -{
> - return cluster_next(&list->head);
> -}
> -
> -static void cluster_list_init(struct swap_cluster_list *list)
> -{
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> -}
> -
> -static void cluster_list_add_tail(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci,
> - unsigned int idx)
> -{
> - if (cluster_list_empty(list)) {
> - cluster_set_next_flag(&list->head, idx, 0);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - } else {
> - struct swap_cluster_info *ci_tail;
> - unsigned int tail = cluster_next(&list->tail);
> -
> - /*
> - * Nested cluster lock, but both cluster locks are
> - * only acquired when we held swap_info_struct->lock
> - */
> - ci_tail = ci + tail;
> - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
Just to confirm my understanding, there was never previously any list
manipulation where the si->lock wasn't held, so this ensuring that both clusters
were locked for list manipulation was unnecessary? I don't see any extra places
where you are taking si->lock, so I guess that must be the case (or your new
regime is unsafe...).
> - cluster_set_next(ci_tail, idx);
> - spin_unlock(&ci_tail->lock);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - }
> -}
> -
> -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci)
> -{
> - unsigned int idx;
> -
> - idx = cluster_next(&list->head);
> - if (cluster_next(&list->tail) == idx) {
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> - } else
> - cluster_set_next_flag(&list->head,
> - cluster_next(&ci[idx]), 0);
> -
> - return idx;
> -}
> -
> /* Add a cluster to discard list and schedule it to do discard */
> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> - unsigned int idx)
> + struct swap_cluster_info *ci)
> {
> + unsigned int idx = cluster_index(si, ci);
> /*
> * If scan_swap_map_slots() can't find a free cluster, it will check
> * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>
> - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> -
> + list_add_tail(&ci->list, &si->discard_clusters);
> schedule_work(&si->discard_work);
> }
>
> -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> -
> - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&si->free_clusters, ci, idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->list, &si->free_clusters);
> }
>
> /*
> @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> static void swap_do_scheduled_discard(struct swap_info_struct *si)
> {
> - struct swap_cluster_info *info, *ci;
> + struct swap_cluster_info *ci;
> unsigned int idx;
>
> - info = si->cluster_info;
> -
> - while (!cluster_list_empty(&si->discard_clusters)) {
> - idx = cluster_list_del_first(&si->discard_clusters, info);
> + while (!list_empty(&si->discard_clusters)) {
> + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> + list_del(&ci->list);
> + idx = cluster_index(si, ci);
> spin_unlock(&si->lock);
>
> discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> SWAPFILE_CLUSTER);
>
> spin_lock(&si->lock);
> - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> - __free_cluster(si, idx);
> +
> + spin_lock(&ci->lock);
> + __free_cluster(si, ci);
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> 0, SWAPFILE_CLUSTER);
> - unlock_cluster(ci);
> + spin_unlock(&ci->lock);
> }
> }
>
> @@ -521,20 +418,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> complete(&si->comp);
> }
>
> -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
nit: long line; consider separating the variable declaration and assignment?
>
> - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> - cluster_list_del_first(&si->free_clusters, ci);
> - cluster_set_count_flag(ci + idx, 0, 0);
> + VM_BUG_ON(cluster_index(si, ci) != idx);
> + list_del(&ci->list);
> + ci->count = 0;
> + ci->flags = 0;
> + return ci;
> }
>
> -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info + idx;
> -
> - VM_BUG_ON(cluster_count(ci) != 0);
> + VM_BUG_ON(ci->count != 0);
> /*
> * If the swap is discardable, prepare discard the cluster
> * instead of free it immediately. The cluster will be freed
> @@ -542,11 +439,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> - swap_cluster_schedule_discard(si, idx);
> + swap_cluster_schedule_discard(si, ci);
> return;
> }
>
> - __free_cluster(si, idx);
> + __free_cluster(si, ci);
> }
>
> /*
> @@ -559,15 +456,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> unsigned long count)
> {
> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> + struct swap_cluster_info *ci = cluster_info + idx;
>
> if (!cluster_info)
> return;
> - if (cluster_is_free(&cluster_info[idx]))
> + if (cluster_is_free(ci))
> alloc_cluster(p, idx);
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) + count);
> + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> + ci->count += count;
> }
>
> /*
> @@ -581,24 +478,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> }
>
> /*
> - * The cluster corresponding to page_nr decreases one usage. If the usage
> - * counter becomes 0, which means no page in the cluster is in using, we can
> - * optionally discard the cluster and add it to free cluster list.
> + * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * which means no page in the cluster is in using, we can optionally discard
nit: "in using" -> "in use"
> + * the cluster and add it to free cluster list.
> */
> -static void dec_cluster_info_page(struct swap_info_struct *p,
> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> {
> - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> -
> - if (!cluster_info)
> + if (!p->cluster_info)
> return;
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) - 1);
> + VM_BUG_ON(ci->count == 0);
> + ci->count--;
>
> - if (cluster_count(&cluster_info[idx]) == 0)
> - free_cluster(p, idx);
> + if (!ci->count)
> + free_cluster(p, ci);
> }
>
> /*
> @@ -611,10 +504,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> {
> struct percpu_cluster *percpu_cluster;
> bool conflict;
> -
> + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
nit: long line; consider splitting variable declaration and assignment.
nit: You're removed the blank line between variable declarations and statements;
did you run checkpatch.pl?
> offset /= SWAPFILE_CLUSTER;
> - conflict = !cluster_list_empty(&si->free_clusters) &&
> - offset != cluster_list_first(&si->free_clusters) &&
> + conflict = !list_empty(&si->free_clusters) &&
> + offset != first - si->cluster_info &&
> cluster_is_free(&si->cluster_info[offset]);
>
> if (!conflict)
> @@ -655,10 +548,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> cluster = this_cpu_ptr(si->percpu_cluster);
> tmp = cluster->next[order];
> if (tmp == SWAP_NEXT_INVALID) {
> - if (!cluster_list_empty(&si->free_clusters)) {
> - tmp = cluster_next(&si->free_clusters.head) *
> - SWAPFILE_CLUSTER;
> - } else if (!cluster_list_empty(&si->discard_clusters)) {
> + if (!list_empty(&si->free_clusters)) {
> + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> + } else if (!list_empty(&si->discard_clusters)) {
> /*
> * we don't have free cluster but have some clusters in
> * discarding, do discard now and reclaim them, then
> @@ -1070,8 +963,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>
> ci = lock_cluster(si, offset);
> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> - cluster_set_count_flag(ci, 0, 0);
> - free_cluster(si, idx);
> + ci->count = 0;
> + ci->flags = 0;
> + free_cluster(si, ci);
> unlock_cluster(ci);
> swap_range_free(si, offset, SWAPFILE_CLUSTER);
> }
> @@ -1344,7 +1238,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> count = p->swap_map[offset];
> VM_BUG_ON(count != SWAP_HAS_CACHE);
> p->swap_map[offset] = 0;
> - dec_cluster_info_page(p, p->cluster_info, offset);
> + dec_cluster_info_page(p, ci);
> unlock_cluster(ci);
>
> mem_cgroup_uncharge_swap(entry, 1);
> @@ -3022,8 +2916,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>
> nr_good_pages = maxpages - 1; /* omit header page */
>
> - cluster_list_init(&p->free_clusters);
> - cluster_list_init(&p->discard_clusters);
> + INIT_LIST_HEAD(&p->free_clusters);
> + INIT_LIST_HEAD(&p->discard_clusters);
>
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> @@ -3074,14 +2968,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> j = (k + col) % SWAP_CLUSTER_COLS;
> for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> + struct swap_cluster_info *ci;
> idx = i * SWAP_CLUSTER_COLS + j;
> + ci = cluster_info + idx;
> if (idx >= nr_clusters)
> continue;
> - if (cluster_count(&cluster_info[idx]))
> + if (ci->count)
> continue;
> - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&p->free_clusters, cluster_info,
> - idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->list, &p->free_clusters);
> }
> }
> return nr_extents;
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-11 7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
@ 2024-07-15 15:40 ` Ryan Roberts
2024-07-16 22:46 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-15 15:40 UTC (permalink / raw)
To: Chris Li, Andrew Morton
Cc: Kairui Song, Hugh Dickins, Huang, Ying, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 11/07/2024 08:29, Chris Li wrote:
> Track the nonfull cluster as well as the empty cluster
> on lists. Each order has one nonfull cluster list.
>
> The cluster will remember which order it was used during
> new cluster allocation.
>
> When the cluster has free entry, add to the nonfull[order]
> list. When the free cluster list is empty, also allocate
> from the nonempty list of that order.
>
> This improves the mTHP swap allocation success rate.
>
> There are limitations if the distribution of numbers of
> different orders of mTHP changes a lot. e.g. there are a lot
> of nonfull cluster assign to order A while later time there
> are a lot of order B allocation while very little allocation
> in order A. Currently the cluster used by order A will not
> reused by order B unless the cluster is 100% empty.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> include/linux/swap.h | 4 ++++
> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
> 2 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e9be95468fc7..db8d6000c116 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -254,9 +254,11 @@ struct swap_cluster_info {
> */
> u16 count;
> u8 flags;
> + u8 order;
> struct list_head list;
> };
> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>
>
> /*
> @@ -296,6 +298,8 @@ struct swap_info_struct {
> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> struct list_head free_clusters; /* free clusters list */
> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> + /* list of cluster that contains at least one free slot */
> unsigned int lowest_bit; /* index of first free in swap_map */
> unsigned int highest_bit; /* index of last free in swap_map */
> unsigned int pages; /* total of usable pages of swap */
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f70d25005d2c..e13a33664cfa 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>
> - list_add_tail(&ci->list, &si->discard_clusters);
> + if (ci->flags)
I'm not sure this is future proof; what happens if a flag is added in future
that does not indicate that the cluster is on a list. Perhaps explicitly check
CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
> + list_move_tail(&ci->list, &si->discard_clusters);
> + else
> + list_add_tail(&ci->list, &si->discard_clusters);
> + ci->flags = 0;
Bug: (I think?) the cluster ends up on the discard_clusters list and
swap_do_scheduled_discard() calls __free_cluster() which will then call
list_add_tail() to put it on the free_clusters list. But since it is on the
discard_list at that point, shouldn't it call list_move_tail()?
> schedule_work(&si->discard_work);
> }
>
> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> + if (ci->flags & CLUSTER_FLAG_NONFULL)
> + list_move_tail(&ci->list, &si->free_clusters);
> + else
> + list_add_tail(&ci->list, &si->free_clusters);
> ci->flags = CLUSTER_FLAG_FREE;
> - list_add_tail(&ci->list, &si->free_clusters);
> }
>
> /*
> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> ci->count--;
>
> if (!ci->count)
> - free_cluster(p, ci);
> + return free_cluster(p, ci);
nit: I'm not sure what the kernel style guide says about this, but I'm not a
huge fan of returning void. I'd find it clearer if you just turn the below `if`
into an `else if`.
> +
> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
I find the transitions when you add and remove a cluster from the
nonfull_clusters list a bit strange (if I've understood correctly): It is added
to the list whenever there is at least one free swap entry if not already on the
list. But you take it off the list when assigning it as the current cluster for
a cpu in scan_swap_map_try_ssd_cluster().
So you could have this situation:
- cpuA allocs cluster from free list (exclusive to that cpu)
- cpuA allocs 1 swap entry from current cluster
- swap entry is freed; cluster added to nonfull_clusters
- cpuB "allocs" cluster from nonfull_clusters
At this point both cpuA and cpuB share the same cluster as their current
cluster. So why not just put the cluster on the nonfull_clusters list at
allocation time (when removed from free_list) and only remove it from the
nonfull_clusters list when it is completely full (or at least definitely doesn't
have room for an `order` allocation)? Then you allow "stealing" always instead
of just sometimes. You would likely want to move the cluster to the end of the
nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
chances of multiple CPUs using the same cluster.
Another potential optimization (which was in my hacked version IIRC) is to only
add/remove from nonfull list when `total - count` crosses the (1 << order)
boundary rather than when becoming completely full. You definitely won't be able
to allocate order-2 if there are only 3 pages available, for example.
> + ci->flags |= CLUSTER_FLAG_NONFULL;
> + }
> }
>
> /*
> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> if (tmp == SWAP_NEXT_INVALID) {
> if (!list_empty(&si->free_clusters)) {
> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> + list_del(&ci->list);
> + spin_lock(&ci->lock);
> + ci->order = order;
> + ci->flags = 0;
> + spin_unlock(&ci->lock);
> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> + } else if (!list_empty(&si->nonfull_clusters[order])) {
> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
> + list_del(&ci->list);
> + spin_lock(&ci->lock);
> + ci->flags = 0;
> + spin_unlock(&ci->lock);
> tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> } else if (!list_empty(&si->discard_clusters)) {
> /*
> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> ci = lock_cluster(si, offset);
> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> ci->count = 0;
> + ci->order = 0;
> ci->flags = 0;
Wonder if it would be better to put this in __free_cluster()?
Thanks,
Ryan
> free_cluster(si, ci);
> unlock_cluster(ci);
> @@ -2919,6 +2944,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> INIT_LIST_HEAD(&p->free_clusters);
> INIT_LIST_HEAD(&p->discard_clusters);
>
> + for (i = 0; i < SWAP_NR_ORDERS; i++)
> + INIT_LIST_HEAD(&p->nonfull_clusters[i]);
> +
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> if (page_nr == 0 || page_nr > swap_header->info.last_page)
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-15 14:10 ` Ryan Roberts
@ 2024-07-15 18:14 ` Chris Li
0 siblings, 0 replies; 43+ messages in thread
From: Chris Li @ 2024-07-15 18:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Mon, Jul 15, 2024 at 7:10 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/07/2024 15:08, Chris Li wrote:
> > On Thu, Jul 11, 2024 at 3:02 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >>> Kernel compile under tmpfs with cgroup memory.max = 2G.
> >>> 12 core 24 hyperthreading, 32 jobs.
> >>>
> >>> HDD swap 3 runs average, 20G swap file:
> >>>
> >>> Without:
> >>> user 4186.290
> >>> system 421.743
> >>> real 597.317
> >>>
> >>> With:
> >>> user 4113.897
> >>> system 413.123
> >>> real 659.543
> >>
> >> If I've understood this correctly, this test is taking~10% longer in wall time?
> >
> > Most likely due to the high variance in measurement and fewer
> > measuring samples 3 vs 10. Most of that wall time is waiting for IO.
> > It is likely just noise.
>
> OK, that certainly makes sense, as long as you're sure its noise. The other
> (unlikely) possibility is that somehow the HDD placement descisions are
> changing, which increases waiting due to increased seek times.
I sure did not change the HDD placement, if the HDD allocation is
different from the previous code, that should be a bug.
I mostly remove the cluster code path in HDD swap entry allocation.
I did the HDD run mostly to make sure the HDD can still take some
stress test on the swapping without crashing.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 1/3] mm: swap: swap cluster switch to double link list
2024-07-15 14:57 ` Ryan Roberts
@ 2024-07-16 22:11 ` Chris Li
0 siblings, 0 replies; 43+ messages in thread
From: Chris Li @ 2024-07-16 22:11 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Mon, Jul 15, 2024 at 7:57 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/07/2024 08:29, Chris Li wrote:
> > Previously, the swap cluster used a cluster index as a pointer
> > to construct a custom single link list type "swap_cluster_list".
> > The next cluster pointer is shared with the cluster->count.
> > It prevents puting the non free cluster into a list.
> >
> > Change the cluster to use the standard double link list instead.
> > This allows tracing the nonfull cluster in the follow up patch.
> > That way, it is faster to get to the nonfull cluster of that order.
> >
> > Remove the cluster getter/setter for accessing the cluster
> > struct member.
> >
> > The list operation is protected by the swap_info_struct->lock.
> >
> > Change cluster code to use "struct swap_cluster_info *" to
> > reference the cluster rather than by using index. That is more
> > consistent with the list manipulation. It avoids the repeat
> > adding index to the cluser_info. The code is easier to understand.
> >
> > Remove the cluster next pointer is NULL flag, the double link
> > list can handle the empty list pretty well.
> >
> > The "swap_cluster_info" struct is two pointer bigger, because
> > 512 swap entries share one swap struct, it has very little impact
> > on the average memory usage per swap entry. For 1TB swapfile, the
> > swap cluster data structure increases from 8MB to 24MB.
> >
> > Other than the list conversion, there is no real function change
> > in this patch.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > include/linux/swap.h | 26 +++---
> > mm/swapfile.c | 225 ++++++++++++++-------------------------------------
> > 2 files changed, 70 insertions(+), 181 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index e473fe6cfb7a..e9be95468fc7 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -243,22 +243,21 @@ enum {
> > * free clusters are organized into a list. We fetch an entry from the list to
> > * get a free cluster.
> > *
> > - * The data field stores next cluster if the cluster is free or cluster usage
> > - * counter otherwise. The flags field determines if a cluster is free. This is
> > - * protected by swap_info_struct.lock.
> > + * The flags field determines if a cluster is free. This is
> > + * protected by cluster lock.
> > */
> > struct swap_cluster_info {
> > spinlock_t lock; /*
> > * Protect swap_cluster_info fields
> > - * and swap_info_struct->swap_map
> > - * elements correspond to the swap
> > - * cluster
> > + * other than list, and swap_info_struct->swap_map
> > + * elements correspond to the swap cluster.
>
> nit: correspond -> corresponding
Done.
>
> > */
> > - unsigned int data:24;
> > - unsigned int flags:8;
> > + u16 count;
>
> Just to make sure I've understood correctly; count can safely be 16 bit (down
> from previous 24 bit) because the max it will ever be is the number of swap
Yes. The count does not need to point to a cluster index any more.
It just needs to hold whatever swap entries in the cluster.
> entries in the cluster, and that's currently no bigger than 512, and in future
> we wouldn't expect it to ever get bigger than 8192 (number of pages in PMD-order
> for arm64 64K base pages). 8192 is represented in 14 bits.
We can change the cluster bigger than the PMD order if we want to, we
just need to make sure the count can hold the number of swap entries
in a cluster.
>
> > + u8 flags;
> > + struct list_head list;
> > };
> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> > +
>
> nit: why the blank line?
Good catch. Removed.
>
> >
> > /*
> > * The first page in the swap file is the swap header, which is always marked
> > @@ -283,11 +282,6 @@ struct percpu_cluster {
> > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > };
> >
> > -struct swap_cluster_list {
> > - struct swap_cluster_info head;
> > - struct swap_cluster_info tail;
> > -};
> > -
> > /*
> > * The in-memory structure used to track swap areas.
> > */
> > @@ -301,7 +295,7 @@ struct swap_info_struct {
> > unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> > unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > - struct swap_cluster_list free_clusters; /* free clusters list */
> > + struct list_head free_clusters; /* free clusters list */
> > unsigned int lowest_bit; /* index of first free in swap_map */
> > unsigned int highest_bit; /* index of last free in swap_map */
> > unsigned int pages; /* total of usable pages of swap */
> > @@ -332,7 +326,7 @@ struct swap_info_struct {
> > * list.
> > */
> > struct work_struct discard_work; /* discard worker */
> > - struct swap_cluster_list discard_clusters; /* discard clusters list */
> > + struct list_head discard_clusters; /* discard clusters list */
> > struct plist_node avail_lists[]; /*
> > * entries in swap_avail_heads, one
> > * entry per node.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f7224bc1320c..f70d25005d2c 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> > #endif
> > #define LATENCY_LIMIT 256
> >
> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
> > - unsigned int flag)
> > -{
> > - info->flags = flag;
> > -}
> > -
> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_count(struct swap_cluster_info *info,
> > - unsigned int c)
> > -{
> > - info->data = c;
> > -}
> > -
> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> > - unsigned int c, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = c;
> > -}
> > -
> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_next(struct swap_cluster_info *info,
> > - unsigned int n)
> > -{
> > - info->data = n;
> > -}
> > -
> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> > - unsigned int n, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = n;
> > -}
> > -
> > static inline bool cluster_is_free(struct swap_cluster_info *info)
> > {
> > return info->flags & CLUSTER_FLAG_FREE;
> > }
> >
> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
> > -{
> > - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> > -}
> > -
> > -static inline void cluster_set_null(struct swap_cluster_info *info)
> > +static inline unsigned int cluster_index(struct swap_info_struct *si,
> > + struct swap_cluster_info *ci)
> > {
> > - info->flags = CLUSTER_FLAG_NEXT_NULL;
> > - info->data = 0;
> > + return ci - si->cluster_info;
> > }
> >
> > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> > @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > spin_unlock(&si->lock);
> > }
> >
> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> > -{
> > - return cluster_is_null(&list->head);
> > -}
> > -
> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> > -{
> > - return cluster_next(&list->head);
> > -}
> > -
> > -static void cluster_list_init(struct swap_cluster_list *list)
> > -{
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > -}
> > -
> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci,
> > - unsigned int idx)
> > -{
> > - if (cluster_list_empty(list)) {
> > - cluster_set_next_flag(&list->head, idx, 0);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - } else {
> > - struct swap_cluster_info *ci_tail;
> > - unsigned int tail = cluster_next(&list->tail);
> > -
> > - /*
> > - * Nested cluster lock, but both cluster locks are
> > - * only acquired when we held swap_info_struct->lock
> > - */
> > - ci_tail = ci + tail;
> > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
>
> Just to confirm my understanding, there was never previously any list
> manipulation where the si->lock wasn't held, so this ensuring that both clusters
> were locked for list manipulation was unnecessary? I don't see any extra places
> where you are taking si->lock, so I guess that must be the case (or your new
> regime is unsafe...).
Yes, and some of my later patch assertions on si->lock are taken on
those code paths.
The reason for change is that adding the cluster to the list head will
need to change more than just the current cluster. It will also change
the cluster before and after that. So the cluster->lock protecting the
cluster->list field can't be true any more. It is actually depending
on the si->lock to protect the list head.
>
> > - cluster_set_next(ci_tail, idx);
> > - spin_unlock(&ci_tail->lock);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - }
> > -}
> > -
> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci)
> > -{
> > - unsigned int idx;
> > -
> > - idx = cluster_next(&list->head);
> > - if (cluster_next(&list->tail) == idx) {
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > - } else
> > - cluster_set_next_flag(&list->head,
> > - cluster_next(&ci[idx]), 0);
> > -
> > - return idx;
> > -}
> > -
> > /* Add a cluster to discard list and schedule it to do discard */
> > static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > - unsigned int idx)
> > + struct swap_cluster_info *ci)
> > {
> > + unsigned int idx = cluster_index(si, ci);
> > /*
> > * If scan_swap_map_slots() can't find a free cluster, it will check
> > * si->swap_map directly. To make sure the discarding cluster isn't
> > @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> > -
> > + list_add_tail(&ci->list, &si->discard_clusters);
> > schedule_work(&si->discard_work);
> > }
> >
> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > -
> > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> > - cluster_list_add_tail(&si->free_clusters, ci, idx);
> > + ci->flags = CLUSTER_FLAG_FREE;
> > + list_add_tail(&ci->list, &si->free_clusters);
> > }
> >
> > /*
> > @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > {
> > - struct swap_cluster_info *info, *ci;
> > + struct swap_cluster_info *ci;
> > unsigned int idx;
> >
> > - info = si->cluster_info;
> > -
> > - while (!cluster_list_empty(&si->discard_clusters)) {
> > - idx = cluster_list_del_first(&si->discard_clusters, info);
> > + while (!list_empty(&si->discard_clusters)) {
> > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > + list_del(&ci->list);
> > + idx = cluster_index(si, ci);
> > spin_unlock(&si->lock);
> >
> > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > SWAPFILE_CLUSTER);
> >
> > spin_lock(&si->lock);
> > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > - __free_cluster(si, idx);
> > +
> > + spin_lock(&ci->lock);
> > + __free_cluster(si, ci);
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > 0, SWAPFILE_CLUSTER);
> > - unlock_cluster(ci);
> > + spin_unlock(&ci->lock);
> > }
> > }
> >
> > @@ -521,20 +418,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> > complete(&si->comp);
> > }
> >
> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>
> nit: long line; consider separating the variable declaration and assignment?
Done.
>
> >
> > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> > - cluster_list_del_first(&si->free_clusters, ci);
> > - cluster_set_count_flag(ci + idx, 0, 0);
> > + VM_BUG_ON(cluster_index(si, ci) != idx);
> > + list_del(&ci->list);
> > + ci->count = 0;
> > + ci->flags = 0;
> > + return ci;
> > }
> >
> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info + idx;
> > -
> > - VM_BUG_ON(cluster_count(ci) != 0);
> > + VM_BUG_ON(ci->count != 0);
> > /*
> > * If the swap is discardable, prepare discard the cluster
> > * instead of free it immediately. The cluster will be freed
> > @@ -542,11 +439,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> > (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> > - swap_cluster_schedule_discard(si, idx);
> > + swap_cluster_schedule_discard(si, ci);
> > return;
> > }
> >
> > - __free_cluster(si, idx);
> > + __free_cluster(si, ci);
> > }
> >
> > /*
> > @@ -559,15 +456,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> > unsigned long count)
> > {
> > unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > + struct swap_cluster_info *ci = cluster_info + idx;
> >
> > if (!cluster_info)
> > return;
> > - if (cluster_is_free(&cluster_info[idx]))
> > + if (cluster_is_free(ci))
> > alloc_cluster(p, idx);
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) + count);
> > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> > + ci->count += count;
> > }
> >
> > /*
> > @@ -581,24 +478,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> > }
> >
> > /*
> > - * The cluster corresponding to page_nr decreases one usage. If the usage
> > - * counter becomes 0, which means no page in the cluster is in using, we can
> > - * optionally discard the cluster and add it to free cluster list.
> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * which means no page in the cluster is in using, we can optionally discard
>
> nit: "in using" -> "in use"
Done. BTW, the "in using" in from the original code before my patch :-)
>
> > + * the cluster and add it to free cluster list.
> > */
> > -static void dec_cluster_info_page(struct swap_info_struct *p,
> > - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> > {
> > - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > -
> > - if (!cluster_info)
> > + if (!p->cluster_info)
> > return;
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) - 1);
> > + VM_BUG_ON(ci->count == 0);
> > + ci->count--;
> >
> > - if (cluster_count(&cluster_info[idx]) == 0)
> > - free_cluster(p, idx);
> > + if (!ci->count)
> > + free_cluster(p, ci);
> > }
> >
> > /*
> > @@ -611,10 +504,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> > {
> > struct percpu_cluster *percpu_cluster;
> > bool conflict;
> > -
> > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>
> nit: long line; consider splitting variable declaration and assignment.
Done.
>
> nit: You're removed the blank line between variable declarations and statements;
> did you run checkpatch.pl?
Not yet. I will make sure I run it on the next revision. Thanks for
the reminder.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-15 15:40 ` Ryan Roberts
@ 2024-07-16 22:46 ` Chris Li
2024-07-17 10:14 ` Ryan Roberts
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-16 22:46 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 11/07/2024 08:29, Chris Li wrote:
> > Track the nonfull cluster as well as the empty cluster
> > on lists. Each order has one nonfull cluster list.
> >
> > The cluster will remember which order it was used during
> > new cluster allocation.
> >
> > When the cluster has free entry, add to the nonfull[order]
> > list. When the free cluster list is empty, also allocate
> > from the nonempty list of that order.
> >
> > This improves the mTHP swap allocation success rate.
> >
> > There are limitations if the distribution of numbers of
> > different orders of mTHP changes a lot. e.g. there are a lot
> > of nonfull cluster assign to order A while later time there
> > are a lot of order B allocation while very little allocation
> > in order A. Currently the cluster used by order A will not
> > reused by order B unless the cluster is 100% empty.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > include/linux/swap.h | 4 ++++
> > mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
> > 2 files changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index e9be95468fc7..db8d6000c116 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -254,9 +254,11 @@ struct swap_cluster_info {
> > */
> > u16 count;
> > u8 flags;
> > + u8 order;
> > struct list_head list;
> > };
> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >
> >
> > /*
> > @@ -296,6 +298,8 @@ struct swap_info_struct {
> > unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > struct list_head free_clusters; /* free clusters list */
> > + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> > + /* list of cluster that contains at least one free slot */
> > unsigned int lowest_bit; /* index of first free in swap_map */
> > unsigned int highest_bit; /* index of last free in swap_map */
> > unsigned int pages; /* total of usable pages of swap */
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f70d25005d2c..e13a33664cfa 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > - list_add_tail(&ci->list, &si->discard_clusters);
> > + if (ci->flags)
>
> I'm not sure this is future proof; what happens if a flag is added in future
> that does not indicate that the cluster is on a list. Perhaps explicitly check
> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
Currently flags are only used to track which list it is on. BTW, this
line has changed to check for explicite which list in patch 3 the big
rewrite. I can move that line change to patch 2 if you want.
>
> > + list_move_tail(&ci->list, &si->discard_clusters);
> > + else
> > + list_add_tail(&ci->list, &si->discard_clusters);
> > + ci->flags = 0;
>
> Bug: (I think?) the cluster ends up on the discard_clusters list and
> swap_do_scheduled_discard() calls __free_cluster() which will then call
swap_do_scheduled_discard() delete the entry from discard list.
The flag does not track the discard list state.
> list_add_tail() to put it on the free_clusters list. But since it is on the
> discard_list at that point, shouldn't it call list_move_tail()?
See above. Call list_move_tail() would be a mistake.
>
> > schedule_work(&si->discard_work);
> > }
> >
> > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > + if (ci->flags & CLUSTER_FLAG_NONFULL)
> > + list_move_tail(&ci->list, &si->free_clusters);
> > + else
> > + list_add_tail(&ci->list, &si->free_clusters);
> > ci->flags = CLUSTER_FLAG_FREE;
> > - list_add_tail(&ci->list, &si->free_clusters);
> > }
> >
> > /*
> > @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> > ci->count--;
> >
> > if (!ci->count)
> > - free_cluster(p, ci);
> > + return free_cluster(p, ci);
>
> nit: I'm not sure what the kernel style guide says about this, but I'm not a
> huge fan of returning void. I'd find it clearer if you just turn the below `if`
> into an `else if`.
I try to avoid 'else if' if possible.
Changed to
if (!ci->count) {
free_cluster(p, ci);
return;
}
>
> > +
> > + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> > + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>
> I find the transitions when you add and remove a cluster from the
> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> to the list whenever there is at least one free swap entry if not already on the
> list. But you take it off the list when assigning it as the current cluster for
> a cpu in scan_swap_map_try_ssd_cluster().
>
> So you could have this situation:
>
> - cpuA allocs cluster from free list (exclusive to that cpu)
> - cpuA allocs 1 swap entry from current cluster
> - swap entry is freed; cluster added to nonfull_clusters
> - cpuB "allocs" cluster from nonfull_clusters
>
> At this point both cpuA and cpuB share the same cluster as their current
> cluster. So why not just put the cluster on the nonfull_clusters list at
> allocation time (when removed from free_list) and only remove it from the
The big rewrite on patch 3 does that, taking it off the free list and
moving it into nonfull.
I am only making the minimal change in this step so the big rewrite can land.
> nonfull_clusters list when it is completely full (or at least definitely doesn't
> have room for an `order` allocation)? Then you allow "stealing" always instead
> of just sometimes. You would likely want to move the cluster to the end of the
> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> chances of multiple CPUs using the same cluster.
For nonfull clusters it is less important to avoid multiple CPU
sharing the cluster. Because the cluster already has previous swap
entries allocated from the previous CPU. Those behaviors will be fine
tuned after the patch 3 big rewrite. Try to make this patch simple.
> Another potential optimization (which was in my hacked version IIRC) is to only
> add/remove from nonfull list when `total - count` crosses the (1 << order)
> boundary rather than when becoming completely full. You definitely won't be able
> to allocate order-2 if there are only 3 pages available, for example.
That is in patch 3 as well. This patch is just doing the bare minimum
to introduce the nonfull list.
>
> > + ci->flags |= CLUSTER_FLAG_NONFULL;
> > + }
> > }
> >
> > /*
> > @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> > if (tmp == SWAP_NEXT_INVALID) {
> > if (!list_empty(&si->free_clusters)) {
> > ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > + list_del(&ci->list);
> > + spin_lock(&ci->lock);
> > + ci->order = order;
> > + ci->flags = 0;
> > + spin_unlock(&ci->lock);
> > + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> > + } else if (!list_empty(&si->nonfull_clusters[order])) {
> > + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
> > + list_del(&ci->list);
> > + spin_lock(&ci->lock);
> > + ci->flags = 0;
> > + spin_unlock(&ci->lock);
> > tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> > } else if (!list_empty(&si->discard_clusters)) {
> > /*
> > @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > ci = lock_cluster(si, offset);
> > memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > ci->count = 0;
> > + ci->order = 0;
> > ci->flags = 0;
>
> Wonder if it would be better to put this in __free_cluster()?
Both flags and order were moved to __free_cluster() in patch 3 of this
series. The order is best assigned together with flags when the
cluster changes the list.
Thanks for the review. The patch 3 big rewrite is the heavy lifting.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-16 22:46 ` Chris Li
@ 2024-07-17 10:14 ` Ryan Roberts
2024-07-17 15:41 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-17 10:14 UTC (permalink / raw)
To: Chris Li
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On 16/07/2024 23:46, Chris Li wrote:
> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 11/07/2024 08:29, Chris Li wrote:
>>> Track the nonfull cluster as well as the empty cluster
>>> on lists. Each order has one nonfull cluster list.
>>>
>>> The cluster will remember which order it was used during
>>> new cluster allocation.
>>>
>>> When the cluster has free entry, add to the nonfull[order]
>>> list. When the free cluster list is empty, also allocate
>>> from the nonempty list of that order.
>>>
>>> This improves the mTHP swap allocation success rate.
>>>
>>> There are limitations if the distribution of numbers of
>>> different orders of mTHP changes a lot. e.g. there are a lot
>>> of nonfull cluster assign to order A while later time there
>>> are a lot of order B allocation while very little allocation
>>> in order A. Currently the cluster used by order A will not
>>> reused by order B unless the cluster is 100% empty.
>>>
>>> Signed-off-by: Chris Li <chrisl@kernel.org>
>>> ---
>>> include/linux/swap.h | 4 ++++
>>> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
>>> 2 files changed, 35 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>> index e9be95468fc7..db8d6000c116 100644
>>> --- a/include/linux/swap.h
>>> +++ b/include/linux/swap.h
>>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>>> */
>>> u16 count;
>>> u8 flags;
>>> + u8 order;
>>> struct list_head list;
>>> };
>>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>>
>>>
>>> /*
>>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>>> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
>>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>> struct list_head free_clusters; /* free clusters list */
>>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>>> + /* list of cluster that contains at least one free slot */
>>> unsigned int lowest_bit; /* index of first free in swap_map */
>>> unsigned int highest_bit; /* index of last free in swap_map */
>>> unsigned int pages; /* total of usable pages of swap */
>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>> index f70d25005d2c..e13a33664cfa 100644
>>> --- a/mm/swapfile.c
>>> +++ b/mm/swapfile.c
>>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>>>
>>> - list_add_tail(&ci->list, &si->discard_clusters);
>>> + if (ci->flags)
>>
>> I'm not sure this is future proof; what happens if a flag is added in future
>> that does not indicate that the cluster is on a list. Perhaps explicitly check
>> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
>
> Currently flags are only used to track which list it is on.
Yes, I get that it works correctly at the moment. I just don't think it's wise
for the code to assume that any flag being set means its on a list; that feels
fragile for future.
> BTW, this
> line has changed to check for explicite which list in patch 3 the big
> rewrite. I can move that line change to patch 2 if you want.
That would get my vote; let's make every patch as good as it can be.
>
>>
>>> + list_move_tail(&ci->list, &si->discard_clusters);
>>> + else
>>> + list_add_tail(&ci->list, &si->discard_clusters);
>>> + ci->flags = 0;
>>
>> Bug: (I think?) the cluster ends up on the discard_clusters list and
>> swap_do_scheduled_discard() calls __free_cluster() which will then call
>
> swap_do_scheduled_discard() delete the entry from discard list.
Ahh yes, my bad!
> The flag does not track the discard list state.
>
>> list_add_tail() to put it on the free_clusters list. But since it is on the
>> discard_list at that point, shouldn't it call list_move_tail()?
>
> See above. Call list_move_tail() would be a mistake.
>
>>
>>> schedule_work(&si->discard_work);
>>> }
>>>
>>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>>> {
>>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
>>> + list_move_tail(&ci->list, &si->free_clusters);
>>> + else
>>> + list_add_tail(&ci->list, &si->free_clusters);
>>> ci->flags = CLUSTER_FLAG_FREE;
>>> - list_add_tail(&ci->list, &si->free_clusters);
>>> }
>>>
>>> /*
>>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>>> ci->count--;
>>>
>>> if (!ci->count)
>>> - free_cluster(p, ci);
>>> + return free_cluster(p, ci);
>>
>> nit: I'm not sure what the kernel style guide says about this, but I'm not a
>> huge fan of returning void. I'd find it clearer if you just turn the below `if`
>> into an `else if`.
>
> I try to avoid 'else if' if possible.
> Changed to
> if (!ci->count) {
> free_cluster(p, ci);
> return;
> }
ok
>
>>
>>> +
>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>
>> I find the transitions when you add and remove a cluster from the
>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> to the list whenever there is at least one free swap entry if not already on the
>> list. But you take it off the list when assigning it as the current cluster for
>> a cpu in scan_swap_map_try_ssd_cluster().
>>
>> So you could have this situation:
>>
>> - cpuA allocs cluster from free list (exclusive to that cpu)
>> - cpuA allocs 1 swap entry from current cluster
>> - swap entry is freed; cluster added to nonfull_clusters
>> - cpuB "allocs" cluster from nonfull_clusters
>>
>> At this point both cpuA and cpuB share the same cluster as their current
>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> allocation time (when removed from free_list) and only remove it from the
>
> The big rewrite on patch 3 does that, taking it off the free list and
> moving it into nonfull.
Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
scan_swap_map_slots()" I assumed that was just a refactoring of the code to
separate the SSD and HDD code paths. Personally I'd prefer to see the
refactoring separated from behavioural changes.
Since the patch was titled RFC and I thought it was just refactoring, I was
deferring review. But sounds like it is actually required to realize the test
results quoted on the cover letter?
> I am only making the minimal change in this step so the big rewrite can land.
>
>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> of just sometimes. You would likely want to move the cluster to the end of the
>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> chances of multiple CPUs using the same cluster.
>
> For nonfull clusters it is less important to avoid multiple CPU
> sharing the cluster. Because the cluster already has previous swap
> entries allocated from the previous CPU.
But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
could be slightly ahead of cpuB so that cpuA allocates all the free pages and
cpuB just ends up scanning and finding nothing to allocate. I think do want to
share the cluster when you really need to, but try to avoid it if there are
other options, and I think moving the cluster to the end of the list might be a
way to help that?
> Those behaviors will be fine
> tuned after the patch 3 big rewrite. Try to make this patch simple.
>
>> Another potential optimization (which was in my hacked version IIRC) is to only
>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>> boundary rather than when becoming completely full. You definitely won't be able
>> to allocate order-2 if there are only 3 pages available, for example.
>
> That is in patch 3 as well. This patch is just doing the bare minimum
> to introduce the nonfull list.
>
>>
>>> + ci->flags |= CLUSTER_FLAG_NONFULL;
>>> + }
>>> }
>>>
>>> /*
>>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>> if (tmp == SWAP_NEXT_INVALID) {
>>> if (!list_empty(&si->free_clusters)) {
>>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>>> + list_del(&ci->list);
>>> + spin_lock(&ci->lock);
>>> + ci->order = order;
>>> + ci->flags = 0;
>>> + spin_unlock(&ci->lock);
>>> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>> + } else if (!list_empty(&si->nonfull_clusters[order])) {
>>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
>>> + list_del(&ci->list);
>>> + spin_lock(&ci->lock);
>>> + ci->flags = 0;
>>> + spin_unlock(&ci->lock);
>>> tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>> } else if (!list_empty(&si->discard_clusters)) {
>>> /*
>>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>> ci = lock_cluster(si, offset);
>>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>>> ci->count = 0;
>>> + ci->order = 0;
>>> ci->flags = 0;
>>
>> Wonder if it would be better to put this in __free_cluster()?
>
> Both flags and order were moved to __free_cluster() in patch 3 of this
> series. The order is best assigned together with flags when the
> cluster changes the list.
>
> Thanks for the review. The patch 3 big rewrite is the heavy lifting.
OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
the series? I'll try to take a look at it today.
>
> Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-17 10:14 ` Ryan Roberts
@ 2024-07-17 15:41 ` Chris Li
2024-07-18 7:53 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-17 15:41 UTC (permalink / raw)
To: Ryan Roberts
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 16/07/2024 23:46, Chris Li wrote:
> > On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 11/07/2024 08:29, Chris Li wrote:
> >>> Track the nonfull cluster as well as the empty cluster
> >>> on lists. Each order has one nonfull cluster list.
> >>>
> >>> The cluster will remember which order it was used during
> >>> new cluster allocation.
> >>>
> >>> When the cluster has free entry, add to the nonfull[order]
> >>> list. When the free cluster list is empty, also allocate
> >>> from the nonempty list of that order.
> >>>
> >>> This improves the mTHP swap allocation success rate.
> >>>
> >>> There are limitations if the distribution of numbers of
> >>> different orders of mTHP changes a lot. e.g. there are a lot
> >>> of nonfull cluster assign to order A while later time there
> >>> are a lot of order B allocation while very little allocation
> >>> in order A. Currently the cluster used by order A will not
> >>> reused by order B unless the cluster is 100% empty.
> >>>
> >>> Signed-off-by: Chris Li <chrisl@kernel.org>
> >>> ---
> >>> include/linux/swap.h | 4 ++++
> >>> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
> >>> 2 files changed, 35 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
> >>> index e9be95468fc7..db8d6000c116 100644
> >>> --- a/include/linux/swap.h
> >>> +++ b/include/linux/swap.h
> >>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
> >>> */
> >>> u16 count;
> >>> u8 flags;
> >>> + u8 order;
> >>> struct list_head list;
> >>> };
> >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
> >>>
> >>>
> >>> /*
> >>> @@ -296,6 +298,8 @@ struct swap_info_struct {
> >>> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> >>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> >>> struct list_head free_clusters; /* free clusters list */
> >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
> >>> + /* list of cluster that contains at least one free slot */
> >>> unsigned int lowest_bit; /* index of first free in swap_map */
> >>> unsigned int highest_bit; /* index of last free in swap_map */
> >>> unsigned int pages; /* total of usable pages of swap */
> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
> >>> index f70d25005d2c..e13a33664cfa 100644
> >>> --- a/mm/swapfile.c
> >>> +++ b/mm/swapfile.c
> >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> >>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> >>> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >>>
> >>> - list_add_tail(&ci->list, &si->discard_clusters);
> >>> + if (ci->flags)
> >>
> >> I'm not sure this is future proof; what happens if a flag is added in future
> >> that does not indicate that the cluster is on a list. Perhaps explicitly check
> >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
> >
> > Currently flags are only used to track which list it is on.
>
> Yes, I get that it works correctly at the moment. I just don't think it's wise
> for the code to assume that any flag being set means its on a list; that feels
> fragile for future.
ACK.
>
> > BTW, this
> > line has changed to check for explicite which list in patch 3 the big
> > rewrite. I can move that line change to patch 2 if you want.
>
> That would get my vote; let's make every patch as good as it can be.
Done.
>
> >
> >>
> >>> + list_move_tail(&ci->list, &si->discard_clusters);
> >>> + else
> >>> + list_add_tail(&ci->list, &si->discard_clusters);
> >>> + ci->flags = 0;
> >>
> >> Bug: (I think?) the cluster ends up on the discard_clusters list and
> >> swap_do_scheduled_discard() calls __free_cluster() which will then call
> >
> > swap_do_scheduled_discard() delete the entry from discard list.
>
> Ahh yes, my bad!
>
> > The flag does not track the discard list state.
> >
> >> list_add_tail() to put it on the free_clusters list. But since it is on the
> >> discard_list at that point, shouldn't it call list_move_tail()?
> >
> > See above. Call list_move_tail() would be a mistake.
> >
> >>
> >>> schedule_work(&si->discard_work);
> >>> }
> >>>
> >>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> >>> {
> >>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
> >>> + list_move_tail(&ci->list, &si->free_clusters);
> >>> + else
> >>> + list_add_tail(&ci->list, &si->free_clusters);
> >>> ci->flags = CLUSTER_FLAG_FREE;
> >>> - list_add_tail(&ci->list, &si->free_clusters);
> >>> }
> >>>
> >>> /*
> >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
> >>> ci->count--;
> >>>
> >>> if (!ci->count)
> >>> - free_cluster(p, ci);
> >>> + return free_cluster(p, ci);
> >>
> >> nit: I'm not sure what the kernel style guide says about this, but I'm not a
> >> huge fan of returning void. I'd find it clearer if you just turn the below `if`
> >> into an `else if`.
> >
> > I try to avoid 'else if' if possible.
> > Changed to
> > if (!ci->count) {
> > free_cluster(p, ci);
> > return;
> > }
>
> ok
>
> >
> >>
> >>> +
> >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >>
> >> I find the transitions when you add and remove a cluster from the
> >> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >> to the list whenever there is at least one free swap entry if not already on the
> >> list. But you take it off the list when assigning it as the current cluster for
> >> a cpu in scan_swap_map_try_ssd_cluster().
> >>
> >> So you could have this situation:
> >>
> >> - cpuA allocs cluster from free list (exclusive to that cpu)
> >> - cpuA allocs 1 swap entry from current cluster
> >> - swap entry is freed; cluster added to nonfull_clusters
> >> - cpuB "allocs" cluster from nonfull_clusters
> >>
> >> At this point both cpuA and cpuB share the same cluster as their current
> >> cluster. So why not just put the cluster on the nonfull_clusters list at
> >> allocation time (when removed from free_list) and only remove it from the
> >
> > The big rewrite on patch 3 does that, taking it off the free list and
> > moving it into nonfull.
>
> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> separate the SSD and HDD code paths. Personally I'd prefer to see the
> refactoring separated from behavioural changes.
It is not a refactoring. It is a big rewrite of the swap allocator
using the cluster. Behavior change is expected. The goal is completely
removing the brute force scanning of swap_map[] array for cluster swap
allocation.
>
> Since the patch was titled RFC and I thought it was just refactoring, I was
> deferring review. But sounds like it is actually required to realize the test
> results quoted on the cover letter?
Yes, required because it handles the previous fall out case try_ssd()
failed. This big rewrite has gone through a lot of testing and bug
fix. It is pretty stable now. The only reason I keep it as RFC is
because it is not feature complete. Currently it does not do swap
cache reclaim. The next version will have swap cache reclaim and
remove the RFC.
>
> > I am only making the minimal change in this step so the big rewrite can land.
> >
> >> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >> have room for an `order` allocation)? Then you allow "stealing" always instead
> >> of just sometimes. You would likely want to move the cluster to the end of the
> >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >> chances of multiple CPUs using the same cluster.
> >
> > For nonfull clusters it is less important to avoid multiple CPU
> > sharing the cluster. Because the cluster already has previous swap
> > entries allocated from the previous CPU.
>
> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
That happens to exist per cpu next pointer already. When the other CPU
advances to the next cluster pointer, it can cross with the other
CPU's next cluster pointer.
> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> share the cluster when you really need to, but try to avoid it if there are
> other options, and I think moving the cluster to the end of the list might be a
> way to help that?
Simply moving to the end of the list can create a possible deadloop
when all clusters have been scanned and not available swap range
found.
We have tried many different approaches including moving to the end of
the list. It can cause more fragmentation because each CPU allocates
their swap slot cache (64 entries) from a different cluster.
> > Those behaviors will be fine
> > tuned after the patch 3 big rewrite. Try to make this patch simple.
Again, I want to keep it simple here so patch 3 can land.
> >> Another potential optimization (which was in my hacked version IIRC) is to only
> >> add/remove from nonfull list when `total - count` crosses the (1 << order)
> >> boundary rather than when becoming completely full. You definitely won't be able
> >> to allocate order-2 if there are only 3 pages available, for example.
> >
> > That is in patch 3 as well. This patch is just doing the bare minimum
> > to introduce the nonfull list.
> >
> >>
> >>> + ci->flags |= CLUSTER_FLAG_NONFULL;
> >>> + }
> >>> }
> >>>
> >>> /*
> >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >>> if (tmp == SWAP_NEXT_INVALID) {
> >>> if (!list_empty(&si->free_clusters)) {
> >>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >>> + list_del(&ci->list);
> >>> + spin_lock(&ci->lock);
> >>> + ci->order = order;
> >>> + ci->flags = 0;
> >>> + spin_unlock(&ci->lock);
> >>> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >>> + } else if (!list_empty(&si->nonfull_clusters[order])) {
> >>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
> >>> + list_del(&ci->list);
> >>> + spin_lock(&ci->lock);
> >>> + ci->flags = 0;
> >>> + spin_unlock(&ci->lock);
> >>> tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> >>> } else if (!list_empty(&si->discard_clusters)) {
> >>> /*
> >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> >>> ci = lock_cluster(si, offset);
> >>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> >>> ci->count = 0;
> >>> + ci->order = 0;
> >>> ci->flags = 0;
> >>
> >> Wonder if it would be better to put this in __free_cluster()?
> >
> > Both flags and order were moved to __free_cluster() in patch 3 of this
> > series. The order is best assigned together with flags when the
> > cluster changes the list.
> >
> > Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>
> OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
> the series? I'll try to take a look at it today.
Yes, it is the cluster swap allocator big rewrite.
Thank you for taking a look.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (3 preceding siblings ...)
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
@ 2024-07-18 5:50 ` Huang, Ying
2024-07-26 5:51 ` Chris Li
4 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-18 5:50 UTC (permalink / raw)
To: Chris Li
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
> When commit 845982eb264bc "mm: swap: allow storage of all mTHP
> orders" is introduced, it only allocates the mTHP swap entries
> from the new empty cluster list. It has a fragmentation issue
> reported by Barry.
>
> https://lore.kernel.org/all/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
>
> The reason is that all the empty clusters have been exhausted while
> there are plenty of free swap entries in the cluster that are
> not 100% free.
>
> Remember the swap allocation order in the cluster.
> Keep track of the per order non full cluster list for later allocation.
>
> The patch 3 of this series gives the swap SSD allocation
> a new separate code path from the HDD allocation. The new allocator
> use cluster list only and do not global scan swap_map[] without lock
> any more.
>
> This streamline the swap allocation for SSD. The code matches the execution
> flow much better.
>
> User impact: For users that allocate and free mix order mTHP swapping,
> It greatly improves the success rate of the mTHP swap allocation after the
> initial phase.
>
> It also performs faster when the swapfile is close to full, because the
> allocator can get the non full cluster from a list rather than scanning
> a lot of swap_map entries.
>
> This series still lacks the swap cache reclaim feature. The reclaim series
> of patches are under development and testing right now. Will post the
> mail list soon. For this reason, the patch 3 is consider RFC and not
> ready to merge.
>
> With Barry's mthp test program V2:
>
> Without:
> $ ./thp_swap_allocator_test -a
> Iteration 1: swpout inc: 32, swpout fallback inc: 192, Fallback percentage: 85.71%
> Iteration 2: swpout inc: 0, swpout fallback inc: 231, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 227, Fallback percentage: 100.00%
> ...
> Iteration 98: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 215, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
>
> $ ./thp_swap_allocator_test -a -s
> Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> ..
> Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>
> $ ./thp_swap_allocator_test -s
> Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> ..
> Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>
> $ ./thp_swap_allocator_test
> Iteration 1: swpout inc: 0, swpout fallback inc: 224, Fallback percentage: 100.00%
> Iteration 2: swpout inc: 0, swpout fallback inc: 218, Fallback percentage: 100.00%
> Iteration 3: swpout inc: 0, swpout fallback inc: 222, Fallback percentage: 100.00%
> ..
> Iteration 98: swpout inc: 0, swpout fallback inc: 228, Fallback percentage: 100.00%
> Iteration 99: swpout inc: 0, swpout fallback inc: 230, Fallback percentage: 100.00%
> Iteration 100: swpout inc: 0, swpout fallback inc: 229, Fallback percentage: 100.00%
>
> With:
> $ ./thp_swap_allocator_test -a
> Iteration 1: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> ...
> Iteration 98: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 99: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 100: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
>
> $ ./thp_swap_allocator_test -a -s
> Iteration 1: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 3: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 4: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 5: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 6: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 7: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 8: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 9: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 10: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 11: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 12: swpout inc: 232, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 13: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 14: swpout inc: 223, swpout fallback inc: 3, Fallback percentage: 1.33%
> Iteration 15: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 16: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 17: swpout inc: 212, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 18: swpout inc: 234, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 19: swpout inc: 220, swpout fallback inc: 6, Fallback percentage: 2.65%
> Iteration 20: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 21: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 22: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 23: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 24: swpout inc: 232, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 25: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 26: swpout inc: 230, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 27: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 28: swpout inc: 225, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 29: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 30: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 31: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 32: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 33: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 34: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 35: swpout inc: 230, swpout fallback inc: 3, Fallback percentage: 1.29%
> Iteration 36: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 37: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 38: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 39: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 40: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 41: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 42: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 43: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 44: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 45: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 46: swpout inc: 221, swpout fallback inc: 2, Fallback percentage: 0.90%
> Iteration 47: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 48: swpout inc: 220, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 49: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 50: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 51: swpout inc: 224, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 52: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 53: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 54: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 55: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 56: swpout inc: 226, swpout fallback inc: 2, Fallback percentage: 0.88%
> Iteration 57: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 58: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 59: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 60: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 61: swpout inc: 217, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 62: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 63: swpout inc: 223, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 64: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 65: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 66: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 67: swpout inc: 220, swpout fallback inc: 2, Fallback percentage: 0.90%
> Iteration 68: swpout inc: 224, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 69: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 70: swpout inc: 219, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 71: swpout inc: 225, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 72: swpout inc: 231, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 73: swpout inc: 218, swpout fallback inc: 5, Fallback percentage: 2.24%
> Iteration 74: swpout inc: 223, swpout fallback inc: 5, Fallback percentage: 2.19%
> Iteration 75: swpout inc: 222, swpout fallback inc: 7, Fallback percentage: 3.06%
> Iteration 76: swpout inc: 226, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 77: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 78: swpout inc: 215, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 79: swpout inc: 223, swpout fallback inc: 2, Fallback percentage: 0.89%
> Iteration 80: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 81: swpout inc: 218, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 82: swpout inc: 228, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 83: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 84: swpout inc: 222, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 85: swpout inc: 213, swpout fallback inc: 1, Fallback percentage: 0.47%
> Iteration 86: swpout inc: 215, swpout fallback inc: 8, Fallback percentage: 3.59%
> Iteration 87: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 88: swpout inc: 227, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 89: swpout inc: 222, swpout fallback inc: 6, Fallback percentage: 2.63%
> Iteration 90: swpout inc: 224, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 91: swpout inc: 214, swpout fallback inc: 1, Fallback percentage: 0.47%
> Iteration 92: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 93: swpout inc: 221, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 94: swpout inc: 223, swpout fallback inc: 2, Fallback percentage: 0.89%
> Iteration 95: swpout inc: 222, swpout fallback inc: 1, Fallback percentage: 0.45%
> Iteration 96: swpout inc: 223, swpout fallback inc: 4, Fallback percentage: 1.76%
> Iteration 97: swpout inc: 223, swpout fallback inc: 7, Fallback percentage: 3.04%
> Iteration 98: swpout inc: 227, swpout fallback inc: 1, Fallback percentage: 0.44%
> Iteration 99: swpout inc: 229, swpout fallback inc: 1, Fallback percentage: 0.43%
> Iteration 100: swpout inc: 229, swpout fallback inc: 0, Fallback percentage: 0.00%
>
> $ ./thp_swap_allocator_test
> Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 134, swpout fallback inc: 98, Fallback percentage: 42.24%
> Iteration 3: swpout inc: 72, swpout fallback inc: 154, Fallback percentage: 68.14%
> Iteration 4: swpout inc: 40, swpout fallback inc: 183, Fallback percentage: 82.06%
> Iteration 5: swpout inc: 27, swpout fallback inc: 199, Fallback percentage: 88.05%
> Iteration 6: swpout inc: 22, swpout fallback inc: 202, Fallback percentage: 90.18%
> Iteration 7: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
> Iteration 8: swpout inc: 14, swpout fallback inc: 214, Fallback percentage: 93.86%
> Iteration 9: swpout inc: 5, swpout fallback inc: 221, Fallback percentage: 97.79%
> Iteration 10: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
> ...
> Iteration 97: swpout inc: 12, swpout fallback inc: 207, Fallback percentage: 94.52%
> Iteration 98: swpout inc: 8, swpout fallback inc: 219, Fallback percentage: 96.48%
> Iteration 99: swpout inc: 16, swpout fallback inc: 218, Fallback percentage: 93.16%
> Iteration 100: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
>
> $ ./thp_swap_allocator_test -s
> Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
> Iteration 2: swpout inc: 84, swpout fallback inc: 148, Fallback percentage: 63.79%
> Iteration 3: swpout inc: 39, swpout fallback inc: 195, Fallback percentage: 83.33%
> Iteration 4: swpout inc: 16, swpout fallback inc: 217, Fallback percentage: 93.13%
> Iteration 5: swpout inc: 11, swpout fallback inc: 214, Fallback percentage: 95.11%
> Iteration 6: swpout inc: 10, swpout fallback inc: 218, Fallback percentage: 95.61%
> ...
> Iteration 96: swpout inc: 5, swpout fallback inc: 225, Fallback percentage: 97.83%
> Iteration 97: swpout inc: 2, swpout fallback inc: 215, Fallback percentage: 99.08%
> Iteration 98: swpout inc: 2, swpout fallback inc: 220, Fallback percentage: 99.10%
> Iteration 99: swpout inc: 4, swpout fallback inc: 222, Fallback percentage: 98.23%
> Iteration 100: swpout inc: 3, swpout fallback inc: 221, Fallback percentage: 98.66%
>
> Kernel compile under tmpfs with cgroup memory.max = 2G.
> 12 core 24 hyperthreading, 32 jobs.
>
> HDD swap 3 runs average, 20G swap file:
>
> Without:
> user 4186.290
> system 421.743
> real 597.317
>
> With:
> user 4113.897
> system 413.123
> real 659.543
>
> SSD swap 10 runs average, 20G swap partition:
>
> Without:
> user 4736.810
> system 500.921
> real 250.243
>
> With:
> user 4729.478
> system 500.265
> real 249.633
>
> Two zram swap:
> zram0 1.4G zram1 20G.
> The idea is forcing the zram0 almost
> full then overflow to zram1:
>
> Two zram 10 runs average:
>
> Without:
> user 4600.693
> system 384.105
> real 238.735
>
> With:
> user 4604.502
> system 382.087
> real 239.063
>
> Reported-by: Barry Song <21cnbao@gmail.com>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> Changes in v4:
> - Remove a warning in patch 2.
> - Allocating from the free cluster list before the nonfull list. Revert the v3 behavior.
> - Add cluster_index and cluster_offset function.
> - Patch 3 has a new allocating path for SSD.
> - HDD swap allocation does not need to consider clusters any more.
It appears that my comments in the following emails are ignored?
https://lore.kernel.org/linux-mm/87bk3pzr5p.fsf@yhuang6-desk2.ccr.corp.intel.com/
https://lore.kernel.org/linux-mm/874j9hzqr3.fsf@yhuang6-desk2.ccr.corp.intel.com/
> changes in v3:
> - Using V1 as base.
> - Rename "next" to "list" for the list field, suggested by Ying.
> - Update comment for the locking rules for cluster fields and list,
> suggested by Ying.
> - Allocate from the nonfull list before attempting free list, suggested
> by Kairui.
> - Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org
>
> Changes in v2:
> - Abandoned.
> - Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
>
> ---
> Chris Li (3):
> mm: swap: swap cluster switch to double link list
> mm: swap: mTHP allocate swap entries from nonfull list
> RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()
>
> include/linux/swap.h | 30 ++--
> mm/swapfile.c | 490 +++++++++++++++++++++++----------------------------
> 2 files changed, 238 insertions(+), 282 deletions(-)
> ---
> base-commit: ff3a648ecb9409aff1448cf4f6aa41d78c69a3bc
> change-id: 20240523-swap-allocator-1534c480ece4
>
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 1/3] mm: swap: swap cluster switch to double link list
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
2024-07-15 14:57 ` Ryan Roberts
@ 2024-07-18 6:26 ` Huang, Ying
2024-07-26 5:46 ` Chris Li
1 sibling, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-18 6:26 UTC (permalink / raw)
To: Chris Li
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> Previously, the swap cluster used a cluster index as a pointer
> to construct a custom single link list type "swap_cluster_list".
> The next cluster pointer is shared with the cluster->count.
> It prevents puting the non free cluster into a list.
>
> Change the cluster to use the standard double link list instead.
> This allows tracing the nonfull cluster in the follow up patch.
> That way, it is faster to get to the nonfull cluster of that order.
>
> Remove the cluster getter/setter for accessing the cluster
> struct member.
>
> The list operation is protected by the swap_info_struct->lock.
>
> Change cluster code to use "struct swap_cluster_info *" to
> reference the cluster rather than by using index. That is more
> consistent with the list manipulation. It avoids the repeat
> adding index to the cluser_info. The code is easier to understand.
>
> Remove the cluster next pointer is NULL flag, the double link
> list can handle the empty list pretty well.
>
> The "swap_cluster_info" struct is two pointer bigger, because
> 512 swap entries share one swap struct, it has very little impact
~~~~
swap_cluster_info ?
> on the average memory usage per swap entry. For 1TB swapfile, the
> swap cluster data structure increases from 8MB to 24MB.
>
> Other than the list conversion, there is no real function change
> in this patch.
>
> Signed-off-by: Chris Li <chrisl@kernel.org>
> ---
> include/linux/swap.h | 26 +++---
> mm/swapfile.c | 225 ++++++++++++++-------------------------------------
> 2 files changed, 70 insertions(+), 181 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e473fe6cfb7a..e9be95468fc7 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -243,22 +243,21 @@ enum {
> * free clusters are organized into a list. We fetch an entry from the list to
> * get a free cluster.
> *
> - * The data field stores next cluster if the cluster is free or cluster usage
> - * counter otherwise. The flags field determines if a cluster is free. This is
> - * protected by swap_info_struct.lock.
> + * The flags field determines if a cluster is free. This is
> + * protected by cluster lock.
> */
> struct swap_cluster_info {
> spinlock_t lock; /*
> * Protect swap_cluster_info fields
> - * and swap_info_struct->swap_map
> - * elements correspond to the swap
> - * cluster
> + * other than list, and swap_info_struct->swap_map
> + * elements correspond to the swap cluster.
> */
> - unsigned int data:24;
> - unsigned int flags:8;
> + u16 count;
> + u8 flags;
> + struct list_head list;
> };
> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> +
>
> /*
> * The first page in the swap file is the swap header, which is always marked
> @@ -283,11 +282,6 @@ struct percpu_cluster {
> unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> };
>
> -struct swap_cluster_list {
> - struct swap_cluster_info head;
> - struct swap_cluster_info tail;
> -};
> -
> /*
> * The in-memory structure used to track swap areas.
> */
> @@ -301,7 +295,7 @@ struct swap_info_struct {
> unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> - struct swap_cluster_list free_clusters; /* free clusters list */
> + struct list_head free_clusters; /* free clusters list */
> unsigned int lowest_bit; /* index of first free in swap_map */
> unsigned int highest_bit; /* index of last free in swap_map */
> unsigned int pages; /* total of usable pages of swap */
> @@ -332,7 +326,7 @@ struct swap_info_struct {
> * list.
> */
> struct work_struct discard_work; /* discard worker */
> - struct swap_cluster_list discard_clusters; /* discard clusters list */
> + struct list_head discard_clusters; /* discard clusters list */
> struct plist_node avail_lists[]; /*
> * entries in swap_avail_heads, one
> * entry per node.
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index f7224bc1320c..f70d25005d2c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> #endif
> #define LATENCY_LIMIT 256
>
> -static inline void cluster_set_flag(struct swap_cluster_info *info,
> - unsigned int flag)
> -{
> - info->flags = flag;
> -}
> -
> -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_count(struct swap_cluster_info *info,
> - unsigned int c)
> -{
> - info->data = c;
> -}
> -
> -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> - unsigned int c, unsigned int f)
> -{
> - info->flags = f;
> - info->data = c;
> -}
> -
> -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> -{
> - return info->data;
> -}
> -
> -static inline void cluster_set_next(struct swap_cluster_info *info,
> - unsigned int n)
> -{
> - info->data = n;
> -}
> -
> -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> - unsigned int n, unsigned int f)
> -{
> - info->flags = f;
> - info->data = n;
> -}
> -
> static inline bool cluster_is_free(struct swap_cluster_info *info)
> {
> return info->flags & CLUSTER_FLAG_FREE;
> }
>
> -static inline bool cluster_is_null(struct swap_cluster_info *info)
> -{
> - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> -}
> -
> -static inline void cluster_set_null(struct swap_cluster_info *info)
> +static inline unsigned int cluster_index(struct swap_info_struct *si,
> + struct swap_cluster_info *ci)
> {
> - info->flags = CLUSTER_FLAG_NEXT_NULL;
> - info->data = 0;
> + return ci - si->cluster_info;
> }
>
> static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> spin_unlock(&si->lock);
> }
>
> -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> -{
> - return cluster_is_null(&list->head);
> -}
> -
> -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> -{
> - return cluster_next(&list->head);
> -}
> -
> -static void cluster_list_init(struct swap_cluster_list *list)
> -{
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> -}
> -
> -static void cluster_list_add_tail(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci,
> - unsigned int idx)
> -{
> - if (cluster_list_empty(list)) {
> - cluster_set_next_flag(&list->head, idx, 0);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - } else {
> - struct swap_cluster_info *ci_tail;
> - unsigned int tail = cluster_next(&list->tail);
> -
> - /*
> - * Nested cluster lock, but both cluster locks are
> - * only acquired when we held swap_info_struct->lock
> - */
> - ci_tail = ci + tail;
> - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> - cluster_set_next(ci_tail, idx);
> - spin_unlock(&ci_tail->lock);
> - cluster_set_next_flag(&list->tail, idx, 0);
> - }
> -}
> -
> -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> - struct swap_cluster_info *ci)
> -{
> - unsigned int idx;
> -
> - idx = cluster_next(&list->head);
> - if (cluster_next(&list->tail) == idx) {
> - cluster_set_null(&list->head);
> - cluster_set_null(&list->tail);
> - } else
> - cluster_set_next_flag(&list->head,
> - cluster_next(&ci[idx]), 0);
> -
> - return idx;
> -}
> -
> /* Add a cluster to discard list and schedule it to do discard */
> static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> - unsigned int idx)
> + struct swap_cluster_info *ci)
> {
> + unsigned int idx = cluster_index(si, ci);
> /*
> * If scan_swap_map_slots() can't find a free cluster, it will check
> * si->swap_map directly. To make sure the discarding cluster isn't
> @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>
> - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> -
> + list_add_tail(&ci->list, &si->discard_clusters);
> schedule_work(&si->discard_work);
> }
>
> -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> -
> - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&si->free_clusters, ci, idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->list, &si->free_clusters);
> }
>
> /*
> @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> static void swap_do_scheduled_discard(struct swap_info_struct *si)
> {
> - struct swap_cluster_info *info, *ci;
> + struct swap_cluster_info *ci;
> unsigned int idx;
>
> - info = si->cluster_info;
> -
> - while (!cluster_list_empty(&si->discard_clusters)) {
> - idx = cluster_list_del_first(&si->discard_clusters, info);
> + while (!list_empty(&si->discard_clusters)) {
> + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> + list_del(&ci->list);
> + idx = cluster_index(si, ci);
> spin_unlock(&si->lock);
>
> discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> SWAPFILE_CLUSTER);
>
> spin_lock(&si->lock);
> - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> - __free_cluster(si, idx);
> +
> + spin_lock(&ci->lock);
> + __free_cluster(si, ci);
> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> 0, SWAPFILE_CLUSTER);
> - unlock_cluster(ci);
> + spin_unlock(&ci->lock);
> }
> }
>
> @@ -521,20 +418,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> complete(&si->comp);
> }
>
> -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> {
> - struct swap_cluster_info *ci = si->cluster_info;
> + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>
> - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> - cluster_list_del_first(&si->free_clusters, ci);
> - cluster_set_count_flag(ci + idx, 0, 0);
> + VM_BUG_ON(cluster_index(si, ci) != idx);
> + list_del(&ci->list);
> + ci->count = 0;
> + ci->flags = 0;
> + return ci;
> }
>
> -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> {
> - struct swap_cluster_info *ci = si->cluster_info + idx;
> -
> - VM_BUG_ON(cluster_count(ci) != 0);
> + VM_BUG_ON(ci->count != 0);
> /*
> * If the swap is discardable, prepare discard the cluster
> * instead of free it immediately. The cluster will be freed
> @@ -542,11 +439,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> */
> if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> - swap_cluster_schedule_discard(si, idx);
> + swap_cluster_schedule_discard(si, ci);
> return;
> }
>
> - __free_cluster(si, idx);
> + __free_cluster(si, ci);
> }
>
> /*
> @@ -559,15 +456,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> unsigned long count)
> {
> unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> + struct swap_cluster_info *ci = cluster_info + idx;
>
> if (!cluster_info)
> return;
> - if (cluster_is_free(&cluster_info[idx]))
> + if (cluster_is_free(ci))
> alloc_cluster(p, idx);
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) + count);
> + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> + ci->count += count;
> }
>
> /*
> @@ -581,24 +478,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> }
>
> /*
> - * The cluster corresponding to page_nr decreases one usage. If the usage
> - * counter becomes 0, which means no page in the cluster is in using, we can
> - * optionally discard the cluster and add it to free cluster list.
> + * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * which means no page in the cluster is in using, we can optionally discard
> + * the cluster and add it to free cluster list.
> */
> -static void dec_cluster_info_page(struct swap_info_struct *p,
> - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> {
> - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> -
> - if (!cluster_info)
> + if (!p->cluster_info)
> return;
>
> - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> - cluster_set_count(&cluster_info[idx],
> - cluster_count(&cluster_info[idx]) - 1);
> + VM_BUG_ON(ci->count == 0);
> + ci->count--;
>
> - if (cluster_count(&cluster_info[idx]) == 0)
> - free_cluster(p, idx);
> + if (!ci->count)
> + free_cluster(p, ci);
> }
>
> /*
> @@ -611,10 +504,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> {
> struct percpu_cluster *percpu_cluster;
> bool conflict;
> -
> + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> offset /= SWAPFILE_CLUSTER;
> - conflict = !cluster_list_empty(&si->free_clusters) &&
> - offset != cluster_list_first(&si->free_clusters) &&
> + conflict = !list_empty(&si->free_clusters) &&
> + offset != first - si->cluster_info &&
offset != cluster_index(si, first) ?
> cluster_is_free(&si->cluster_info[offset]);
>
> if (!conflict)
> @@ -655,10 +548,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> cluster = this_cpu_ptr(si->percpu_cluster);
> tmp = cluster->next[order];
> if (tmp == SWAP_NEXT_INVALID) {
> - if (!cluster_list_empty(&si->free_clusters)) {
> - tmp = cluster_next(&si->free_clusters.head) *
> - SWAPFILE_CLUSTER;
> - } else if (!cluster_list_empty(&si->discard_clusters)) {
> + if (!list_empty(&si->free_clusters)) {
> + ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
> + } else if (!list_empty(&si->discard_clusters)) {
> /*
> * we don't have free cluster but have some clusters in
> * discarding, do discard now and reclaim them, then
> @@ -1070,8 +963,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>
> ci = lock_cluster(si, offset);
> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> - cluster_set_count_flag(ci, 0, 0);
> - free_cluster(si, idx);
> + ci->count = 0;
> + ci->flags = 0;
> + free_cluster(si, ci);
> unlock_cluster(ci);
> swap_range_free(si, offset, SWAPFILE_CLUSTER);
> }
> @@ -1344,7 +1238,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> count = p->swap_map[offset];
> VM_BUG_ON(count != SWAP_HAS_CACHE);
> p->swap_map[offset] = 0;
> - dec_cluster_info_page(p, p->cluster_info, offset);
> + dec_cluster_info_page(p, ci);
> unlock_cluster(ci);
>
> mem_cgroup_uncharge_swap(entry, 1);
> @@ -3022,8 +2916,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
>
> nr_good_pages = maxpages - 1; /* omit header page */
>
> - cluster_list_init(&p->free_clusters);
> - cluster_list_init(&p->discard_clusters);
> + INIT_LIST_HEAD(&p->free_clusters);
> + INIT_LIST_HEAD(&p->discard_clusters);
>
> for (i = 0; i < swap_header->info.nr_badpages; i++) {
> unsigned int page_nr = swap_header->info.badpages[i];
> @@ -3074,14 +2968,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
> for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
> j = (k + col) % SWAP_CLUSTER_COLS;
> for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
> + struct swap_cluster_info *ci;
> idx = i * SWAP_CLUSTER_COLS + j;
> + ci = cluster_info + idx;
> if (idx >= nr_clusters)
> continue;
> - if (cluster_count(&cluster_info[idx]))
> + if (ci->count)
> continue;
> - cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
> - cluster_list_add_tail(&p->free_clusters, cluster_info,
> - idx);
> + ci->flags = CLUSTER_FLAG_FREE;
> + list_add_tail(&ci->list, &p->free_clusters);
> }
> }
> return nr_extents;
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-17 15:41 ` Chris Li
@ 2024-07-18 7:53 ` Huang, Ying
2024-07-19 10:30 ` Ryan Roberts
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-18 7:53 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 16/07/2024 23:46, Chris Li wrote:
>> > On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>
>> >> On 11/07/2024 08:29, Chris Li wrote:
>> >>> Track the nonfull cluster as well as the empty cluster
>> >>> on lists. Each order has one nonfull cluster list.
>> >>>
>> >>> The cluster will remember which order it was used during
>> >>> new cluster allocation.
>> >>>
>> >>> When the cluster has free entry, add to the nonfull[order]
>> >>> list. When the free cluster list is empty, also allocate
>> >>> from the nonempty list of that order.
>> >>>
>> >>> This improves the mTHP swap allocation success rate.
>> >>>
>> >>> There are limitations if the distribution of numbers of
>> >>> different orders of mTHP changes a lot. e.g. there are a lot
>> >>> of nonfull cluster assign to order A while later time there
>> >>> are a lot of order B allocation while very little allocation
>> >>> in order A. Currently the cluster used by order A will not
>> >>> reused by order B unless the cluster is 100% empty.
>> >>>
>> >>> Signed-off-by: Chris Li <chrisl@kernel.org>
>> >>> ---
>> >>> include/linux/swap.h | 4 ++++
>> >>> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
>> >>> 2 files changed, 35 insertions(+), 3 deletions(-)
>> >>>
>> >>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> >>> index e9be95468fc7..db8d6000c116 100644
>> >>> --- a/include/linux/swap.h
>> >>> +++ b/include/linux/swap.h
>> >>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>> >>> */
>> >>> u16 count;
>> >>> u8 flags;
>> >>> + u8 order;
>> >>> struct list_head list;
>> >>> };
>> >>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>> >>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>> >>>
>> >>>
>> >>> /*
>> >>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>> >>> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
>> >>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>> >>> struct list_head free_clusters; /* free clusters list */
>> >>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>> >>> + /* list of cluster that contains at least one free slot */
>> >>> unsigned int lowest_bit; /* index of first free in swap_map */
>> >>> unsigned int highest_bit; /* index of last free in swap_map */
>> >>> unsigned int pages; /* total of usable pages of swap */
>> >>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>> >>> index f70d25005d2c..e13a33664cfa 100644
>> >>> --- a/mm/swapfile.c
>> >>> +++ b/mm/swapfile.c
>> >>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>> >>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>> >>> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>> >>>
>> >>> - list_add_tail(&ci->list, &si->discard_clusters);
>> >>> + if (ci->flags)
>> >>
>> >> I'm not sure this is future proof; what happens if a flag is added in future
>> >> that does not indicate that the cluster is on a list. Perhaps explicitly check
>> >> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
>> >
>> > Currently flags are only used to track which list it is on.
>>
>> Yes, I get that it works correctly at the moment. I just don't think it's wise
>> for the code to assume that any flag being set means its on a list; that feels
>> fragile for future.
>
> ACK.
>
>>
>> > BTW, this
>> > line has changed to check for explicite which list in patch 3 the big
>> > rewrite. I can move that line change to patch 2 if you want.
>>
>> That would get my vote; let's make every patch as good as it can be.
>
> Done.
>
>>
>> >
>> >>
>> >>> + list_move_tail(&ci->list, &si->discard_clusters);
>> >>> + else
>> >>> + list_add_tail(&ci->list, &si->discard_clusters);
>> >>> + ci->flags = 0;
>> >>
>> >> Bug: (I think?) the cluster ends up on the discard_clusters list and
>> >> swap_do_scheduled_discard() calls __free_cluster() which will then call
>> >
>> > swap_do_scheduled_discard() delete the entry from discard list.
>>
>> Ahh yes, my bad!
>>
>> > The flag does not track the discard list state.
>> >
>> >> list_add_tail() to put it on the free_clusters list. But since it is on the
>> >> discard_list at that point, shouldn't it call list_move_tail()?
>> >
>> > See above. Call list_move_tail() would be a mistake.
>> >
>> >>
>> >>> schedule_work(&si->discard_work);
>> >>> }
>> >>>
>> >>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>> >>> {
>> >>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
>> >>> + list_move_tail(&ci->list, &si->free_clusters);
>> >>> + else
>> >>> + list_add_tail(&ci->list, &si->free_clusters);
>> >>> ci->flags = CLUSTER_FLAG_FREE;
>> >>> - list_add_tail(&ci->list, &si->free_clusters);
>> >>> }
>> >>>
>> >>> /*
>> >>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>> >>> ci->count--;
>> >>>
>> >>> if (!ci->count)
>> >>> - free_cluster(p, ci);
>> >>> + return free_cluster(p, ci);
>> >>
>> >> nit: I'm not sure what the kernel style guide says about this, but I'm not a
>> >> huge fan of returning void. I'd find it clearer if you just turn the below `if`
>> >> into an `else if`.
>> >
>> > I try to avoid 'else if' if possible.
>> > Changed to
>> > if (!ci->count) {
>> > free_cluster(p, ci);
>> > return;
>> > }
>>
>> ok
>>
>> >
>> >>
>> >>> +
>> >>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >>
>> >> I find the transitions when you add and remove a cluster from the
>> >> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >> to the list whenever there is at least one free swap entry if not already on the
>> >> list. But you take it off the list when assigning it as the current cluster for
>> >> a cpu in scan_swap_map_try_ssd_cluster().
>> >>
>> >> So you could have this situation:
>> >>
>> >> - cpuA allocs cluster from free list (exclusive to that cpu)
>> >> - cpuA allocs 1 swap entry from current cluster
>> >> - swap entry is freed; cluster added to nonfull_clusters
>> >> - cpuB "allocs" cluster from nonfull_clusters
>> >>
>> >> At this point both cpuA and cpuB share the same cluster as their current
>> >> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >> allocation time (when removed from free_list) and only remove it from the
>> >
>> > The big rewrite on patch 3 does that, taking it off the free list and
>> > moving it into nonfull.
>>
>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> refactoring separated from behavioural changes.
>
> It is not a refactoring. It is a big rewrite of the swap allocator
> using the cluster. Behavior change is expected. The goal is completely
> removing the brute force scanning of swap_map[] array for cluster swap
> allocation.
>
>>
>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> deferring review. But sounds like it is actually required to realize the test
>> results quoted on the cover letter?
>
> Yes, required because it handles the previous fall out case try_ssd()
> failed. This big rewrite has gone through a lot of testing and bug
> fix. It is pretty stable now. The only reason I keep it as RFC is
> because it is not feature complete. Currently it does not do swap
> cache reclaim. The next version will have swap cache reclaim and
> remove the RFC.
>
>>
>> > I am only making the minimal change in this step so the big rewrite can land.
>> >
>> >> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >> of just sometimes. You would likely want to move the cluster to the end of the
>> >> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >> chances of multiple CPUs using the same cluster.
>> >
>> > For nonfull clusters it is less important to avoid multiple CPU
>> > sharing the cluster. Because the cluster already has previous swap
>> > entries allocated from the previous CPU.
>>
>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>
> That happens to exist per cpu next pointer already. When the other CPU
> advances to the next cluster pointer, it can cross with the other
> CPU's next cluster pointer.
No. si->percpu_cluster[cpu].next will keep in the current per cpu
cluster only. If it doesn't do that, we should fix it.
I agree with Ryan that we should make per cpu cluster correct. A
cluster in per cpu cluster shouldn't be put in nonfull list. When we
scan to the end of a per cpu cluster, we can put the cluster in nonfull
list if necessary. And, we should make it correct in this patch instead
of later in series. I understand that you want to make the patch itself
simple, but it's important to make code simple to be understood too.
Consistent design choice will do that.
>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> share the cluster when you really need to, but try to avoid it if there are
>> other options, and I think moving the cluster to the end of the list might be a
>> way to help that?
>
> Simply moving to the end of the list can create a possible deadloop
> when all clusters have been scanned and not available swap range
> found.
This is another reason that we should put the cluster in
nonfull_clusters[order--] if there are no free swap entry with "order"
in the cluster. It makes design complex to keep it in
nonfull_clusters[order].
> We have tried many different approaches including moving to the end of
> the list. It can cause more fragmentation because each CPU allocates
> their swap slot cache (64 entries) from a different cluster.
>
>> > Those behaviors will be fine
>> > tuned after the patch 3 big rewrite. Try to make this patch simple.
>
> Again, I want to keep it simple here so patch 3 can land.
>
>> >> Another potential optimization (which was in my hacked version IIRC) is to only
>> >> add/remove from nonfull list when `total - count` crosses the (1 << order)
>> >> boundary rather than when becoming completely full. You definitely won't be able
>> >> to allocate order-2 if there are only 3 pages available, for example.
>> >
>> > That is in patch 3 as well. This patch is just doing the bare minimum
>> > to introduce the nonfull list.
>> >
>> >>
>> >>> + ci->flags |= CLUSTER_FLAG_NONFULL;
>> >>> + }
>> >>> }
>> >>>
>> >>> /*
>> >>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>> >>> if (tmp == SWAP_NEXT_INVALID) {
>> >>> if (!list_empty(&si->free_clusters)) {
>> >>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>> >>> + list_del(&ci->list);
>> >>> + spin_lock(&ci->lock);
>> >>> + ci->order = order;
>> >>> + ci->flags = 0;
>> >>> + spin_unlock(&ci->lock);
>> >>> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>> >>> + } else if (!list_empty(&si->nonfull_clusters[order])) {
>> >>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
>> >>> + list_del(&ci->list);
>> >>> + spin_lock(&ci->lock);
>> >>> + ci->flags = 0;
>> >>> + spin_unlock(&ci->lock);
>> >>> tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>> >>> } else if (!list_empty(&si->discard_clusters)) {
>> >>> /*
>> >>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>> >>> ci = lock_cluster(si, offset);
>> >>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>> >>> ci->count = 0;
>> >>> + ci->order = 0;
>> >>> ci->flags = 0;
>> >>
>> >> Wonder if it would be better to put this in __free_cluster()?
>> >
>> > Both flags and order were moved to __free_cluster() in patch 3 of this
>> > series. The order is best assigned together with flags when the
>> > cluster changes the list.
>> >
>> > Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>>
>> OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
>> the series? I'll try to take a look at it today.
>
> Yes, it is the cluster swap allocator big rewrite.
>
> Thank you for taking a look.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-18 7:53 ` Huang, Ying
@ 2024-07-19 10:30 ` Ryan Roberts
2024-07-22 2:14 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-19 10:30 UTC (permalink / raw)
To: Huang, Ying, Chris Li
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 18/07/2024 08:53, Huang, Ying wrote:
> Chris Li <chrisl@kernel.org> writes:
>
>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 16/07/2024 23:46, Chris Li wrote:
>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>> Track the nonfull cluster as well as the empty cluster
>>>>>> on lists. Each order has one nonfull cluster list.
>>>>>>
>>>>>> The cluster will remember which order it was used during
>>>>>> new cluster allocation.
>>>>>>
>>>>>> When the cluster has free entry, add to the nonfull[order]
>>>>>> list. When the free cluster list is empty, also allocate
>>>>>> from the nonempty list of that order.
>>>>>>
>>>>>> This improves the mTHP swap allocation success rate.
>>>>>>
>>>>>> There are limitations if the distribution of numbers of
>>>>>> different orders of mTHP changes a lot. e.g. there are a lot
>>>>>> of nonfull cluster assign to order A while later time there
>>>>>> are a lot of order B allocation while very little allocation
>>>>>> in order A. Currently the cluster used by order A will not
>>>>>> reused by order B unless the cluster is 100% empty.
>>>>>>
>>>>>> Signed-off-by: Chris Li <chrisl@kernel.org>
>>>>>> ---
>>>>>> include/linux/swap.h | 4 ++++
>>>>>> mm/swapfile.c | 34 +++++++++++++++++++++++++++++++---
>>>>>> 2 files changed, 35 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>>>>>> index e9be95468fc7..db8d6000c116 100644
>>>>>> --- a/include/linux/swap.h
>>>>>> +++ b/include/linux/swap.h
>>>>>> @@ -254,9 +254,11 @@ struct swap_cluster_info {
>>>>>> */
>>>>>> u16 count;
>>>>>> u8 flags;
>>>>>> + u8 order;
>>>>>> struct list_head list;
>>>>>> };
>>>>>> #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
>>>>>> +#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
>>>>>>
>>>>>>
>>>>>> /*
>>>>>> @@ -296,6 +298,8 @@ struct swap_info_struct {
>>>>>> unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
>>>>>> struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
>>>>>> struct list_head free_clusters; /* free clusters list */
>>>>>> + struct list_head nonfull_clusters[SWAP_NR_ORDERS];
>>>>>> + /* list of cluster that contains at least one free slot */
>>>>>> unsigned int lowest_bit; /* index of first free in swap_map */
>>>>>> unsigned int highest_bit; /* index of last free in swap_map */
>>>>>> unsigned int pages; /* total of usable pages of swap */
>>>>>> diff --git a/mm/swapfile.c b/mm/swapfile.c
>>>>>> index f70d25005d2c..e13a33664cfa 100644
>>>>>> --- a/mm/swapfile.c
>>>>>> +++ b/mm/swapfile.c
>>>>>> @@ -361,14 +361,21 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
>>>>>> memset(si->swap_map + idx * SWAPFILE_CLUSTER,
>>>>>> SWAP_MAP_BAD, SWAPFILE_CLUSTER);
>>>>>>
>>>>>> - list_add_tail(&ci->list, &si->discard_clusters);
>>>>>> + if (ci->flags)
>>>>>
>>>>> I'm not sure this is future proof; what happens if a flag is added in future
>>>>> that does not indicate that the cluster is on a list. Perhaps explicitly check
>>>>> CLUSTER_FLAG_NONFULL? Or `if (!list_empty(&ci->list))`.
>>>>
>>>> Currently flags are only used to track which list it is on.
>>>
>>> Yes, I get that it works correctly at the moment. I just don't think it's wise
>>> for the code to assume that any flag being set means its on a list; that feels
>>> fragile for future.
>>
>> ACK.
>>
>>>
>>>> BTW, this
>>>> line has changed to check for explicite which list in patch 3 the big
>>>> rewrite. I can move that line change to patch 2 if you want.
>>>
>>> That would get my vote; let's make every patch as good as it can be.
>>
>> Done.
>>
>>>
>>>>
>>>>>
>>>>>> + list_move_tail(&ci->list, &si->discard_clusters);
>>>>>> + else
>>>>>> + list_add_tail(&ci->list, &si->discard_clusters);
>>>>>> + ci->flags = 0;
>>>>>
>>>>> Bug: (I think?) the cluster ends up on the discard_clusters list and
>>>>> swap_do_scheduled_discard() calls __free_cluster() which will then call
>>>>
>>>> swap_do_scheduled_discard() delete the entry from discard list.
>>>
>>> Ahh yes, my bad!
>>>
>>>> The flag does not track the discard list state.
>>>>
>>>>> list_add_tail() to put it on the free_clusters list. But since it is on the
>>>>> discard_list at that point, shouldn't it call list_move_tail()?
>>>>
>>>> See above. Call list_move_tail() would be a mistake.
>>>>
>>>>>
>>>>>> schedule_work(&si->discard_work);
>>>>>> }
>>>>>>
>>>>>> static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
>>>>>> {
>>>>>> + if (ci->flags & CLUSTER_FLAG_NONFULL)
>>>>>> + list_move_tail(&ci->list, &si->free_clusters);
>>>>>> + else
>>>>>> + list_add_tail(&ci->list, &si->free_clusters);
>>>>>> ci->flags = CLUSTER_FLAG_FREE;
>>>>>> - list_add_tail(&ci->list, &si->free_clusters);
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> @@ -491,7 +498,12 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
>>>>>> ci->count--;
>>>>>>
>>>>>> if (!ci->count)
>>>>>> - free_cluster(p, ci);
>>>>>> + return free_cluster(p, ci);
>>>>>
>>>>> nit: I'm not sure what the kernel style guide says about this, but I'm not a
>>>>> huge fan of returning void. I'd find it clearer if you just turn the below `if`
>>>>> into an `else if`.
>>>>
>>>> I try to avoid 'else if' if possible.
>>>> Changed to
>>>> if (!ci->count) {
>>>> free_cluster(p, ci);
>>>> return;
>>>> }
>>>
>>> ok
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>
>>>>> I find the transitions when you add and remove a cluster from the
>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>
>>>>> So you could have this situation:
>>>>>
>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>
>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>
>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>> moving it into nonfull.
>>>
>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>> refactoring separated from behavioural changes.
>>
>> It is not a refactoring. It is a big rewrite of the swap allocator
>> using the cluster. Behavior change is expected. The goal is completely
>> removing the brute force scanning of swap_map[] array for cluster swap
>> allocation.
>>
>>>
>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>> deferring review. But sounds like it is actually required to realize the test
>>> results quoted on the cover letter?
>>
>> Yes, required because it handles the previous fall out case try_ssd()
>> failed. This big rewrite has gone through a lot of testing and bug
>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> because it is not feature complete. Currently it does not do swap
>> cache reclaim. The next version will have swap cache reclaim and
>> remove the RFC.
>>
>>>
>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>
>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>> chances of multiple CPUs using the same cluster.
>>>>
>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>> sharing the cluster. Because the cluster already has previous swap
>>>> entries allocated from the previous CPU.
>>>
>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>
>> That happens to exist per cpu next pointer already. When the other CPU
>> advances to the next cluster pointer, it can cross with the other
>> CPU's next cluster pointer.
>
> No. si->percpu_cluster[cpu].next will keep in the current per cpu
> cluster only. If it doesn't do that, we should fix it.
>
> I agree with Ryan that we should make per cpu cluster correct. A
> cluster in per cpu cluster shouldn't be put in nonfull list. When we
> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> list if necessary. And, we should make it correct in this patch instead
> of later in series. I understand that you want to make the patch itself
> simple, but it's important to make code simple to be understood too.
> Consistent design choice will do that.
I think I'm actually arguing for the opposite of what you suggest here.
As I see it, there are 2 possible approaches; either a cluster is always
considered exclusive to a single cpu when its set as a per-cpu cluster, so it
does not appear on the nonfull list. Or a cluster is considered sharable in this
case, in which case it should be added to the nonfull list.
The code at the moment sort of does both; when a cpu decides to use a cluster in
the nonfull list, it removes it from that list to make it exclusive. But as soon
as a single swap entry is freed from that cluster it is put back on the list.
This neither-one-policy-nor-the-other seems odd to me.
I think Huang, Ying is arguing to keep it always exclusive while installed as a
per-cpu cluster. I was arguing to make it always shared. Perhaps the best
approach is to implement the exclusive policy in this patch (you'd need a flag
to note if any pages were freed while in exclusive use, then when exclusive use
completes, put it back on the nonfull list if the flag was set). Then migrate to
the shared approach as part of the "big rewrite"?
>
>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>> share the cluster when you really need to, but try to avoid it if there are
>>> other options, and I think moving the cluster to the end of the list might be a
>>> way to help that?
>>
>> Simply moving to the end of the list can create a possible deadloop
>> when all clusters have been scanned and not available swap range
>> found.
>
> This is another reason that we should put the cluster in
> nonfull_clusters[order--] if there are no free swap entry with "order"
> in the cluster. It makes design complex to keep it in
> nonfull_clusters[order].
>
>> We have tried many different approaches including moving to the end of
>> the list. It can cause more fragmentation because each CPU allocates
>> their swap slot cache (64 entries) from a different cluster.
>>
>>>> Those behaviors will be fine
>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>
>> Again, I want to keep it simple here so patch 3 can land.
>>
>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>
>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>> to introduce the nonfull list.
>>>>
>>>>>
>>>>>> + ci->flags |= CLUSTER_FLAG_NONFULL;
>>>>>> + }
>>>>>> }
>>>>>>
>>>>>> /*
>>>>>> @@ -550,6 +562,18 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
>>>>>> if (tmp == SWAP_NEXT_INVALID) {
>>>>>> if (!list_empty(&si->free_clusters)) {
>>>>>> ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
>>>>>> + list_del(&ci->list);
>>>>>> + spin_lock(&ci->lock);
>>>>>> + ci->order = order;
>>>>>> + ci->flags = 0;
>>>>>> + spin_unlock(&ci->lock);
>>>>>> + tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>>>>> + } else if (!list_empty(&si->nonfull_clusters[order])) {
>>>>>> + ci = list_first_entry(&si->nonfull_clusters[order], struct swap_cluster_info, list);
>>>>>> + list_del(&ci->list);
>>>>>> + spin_lock(&ci->lock);
>>>>>> + ci->flags = 0;
>>>>>> + spin_unlock(&ci->lock);
>>>>>> tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
>>>>>> } else if (!list_empty(&si->discard_clusters)) {
>>>>>> /*
>>>>>> @@ -964,6 +988,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
>>>>>> ci = lock_cluster(si, offset);
>>>>>> memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
>>>>>> ci->count = 0;
>>>>>> + ci->order = 0;
>>>>>> ci->flags = 0;
>>>>>
>>>>> Wonder if it would be better to put this in __free_cluster()?
>>>>
>>>> Both flags and order were moved to __free_cluster() in patch 3 of this
>>>> series. The order is best assigned together with flags when the
>>>> cluster changes the list.
>>>>
>>>> Thanks for the review. The patch 3 big rewrite is the heavy lifting.
>>>
>>> OK, but sounds like patch 3 isn't really RFC after all, but a crucial part of
>>> the series? I'll try to take a look at it today.
>>
>> Yes, it is the cluster swap allocator big rewrite.
>>
>> Thank you for taking a look.
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-19 10:30 ` Ryan Roberts
@ 2024-07-22 2:14 ` Huang, Ying
2024-07-22 7:51 ` Ryan Roberts
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-22 2:14 UTC (permalink / raw)
To: Ryan Roberts
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
Ryan Roberts <ryan.roberts@arm.com> writes:
> On 18/07/2024 08:53, Huang, Ying wrote:
>> Chris Li <chrisl@kernel.org> writes:
>>
>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 11/07/2024 08:29, Chris Li wrote:
[snip]
>>>>>>> +
>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>
>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>
>>>>>> So you could have this situation:
>>>>>>
>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>
>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>
>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>> moving it into nonfull.
>>>>
>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>> refactoring separated from behavioural changes.
>>>
>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>> using the cluster. Behavior change is expected. The goal is completely
>>> removing the brute force scanning of swap_map[] array for cluster swap
>>> allocation.
>>>
>>>>
>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>> deferring review. But sounds like it is actually required to realize the test
>>>> results quoted on the cover letter?
>>>
>>> Yes, required because it handles the previous fall out case try_ssd()
>>> failed. This big rewrite has gone through a lot of testing and bug
>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>> because it is not feature complete. Currently it does not do swap
>>> cache reclaim. The next version will have swap cache reclaim and
>>> remove the RFC.
>>>
>>>>
>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>
>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>> chances of multiple CPUs using the same cluster.
>>>>>
>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>> entries allocated from the previous CPU.
>>>>
>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>
>>> That happens to exist per cpu next pointer already. When the other CPU
>>> advances to the next cluster pointer, it can cross with the other
>>> CPU's next cluster pointer.
>>
>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>> cluster only. If it doesn't do that, we should fix it.
>>
>> I agree with Ryan that we should make per cpu cluster correct. A
>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> list if necessary. And, we should make it correct in this patch instead
>> of later in series. I understand that you want to make the patch itself
>> simple, but it's important to make code simple to be understood too.
>> Consistent design choice will do that.
>
> I think I'm actually arguing for the opposite of what you suggest here.
Sorry, I misunderstood your words.
> As I see it, there are 2 possible approaches; either a cluster is always
> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> does not appear on the nonfull list. Or a cluster is considered sharable in this
> case, in which case it should be added to the nonfull list.
>
> The code at the moment sort of does both; when a cpu decides to use a cluster in
> the nonfull list, it removes it from that list to make it exclusive. But as soon
> as a single swap entry is freed from that cluster it is put back on the list.
> This neither-one-policy-nor-the-other seems odd to me.
>
> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> per-cpu cluster.
Yes.
> I was arguing to make it always shared. Perhaps the best
> approach is to implement the exclusive policy in this patch (you'd need a flag
> to note if any pages were freed while in exclusive use, then when exclusive use
> completes, put it back on the nonfull list if the flag was set). Then migrate to
> the shared approach as part of the "big rewrite"?
>>
>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>> share the cluster when you really need to, but try to avoid it if there are
>>>> other options, and I think moving the cluster to the end of the list might be a
>>>> way to help that?
>>>
>>> Simply moving to the end of the list can create a possible deadloop
>>> when all clusters have been scanned and not available swap range
>>> found.
I also think that the shared approach has dead loop issue.
>> This is another reason that we should put the cluster in
>> nonfull_clusters[order--] if there are no free swap entry with "order"
>> in the cluster. It makes design complex to keep it in
>> nonfull_clusters[order].
>>
>>> We have tried many different approaches including moving to the end of
>>> the list. It can cause more fragmentation because each CPU allocates
>>> their swap slot cache (64 entries) from a different cluster.
>>>
>>>>> Those behaviors will be fine
>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>
>>> Again, I want to keep it simple here so patch 3 can land.
>>>
>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>
>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>> to introduce the nonfull list.
>>>>>
[snip]
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-22 2:14 ` Huang, Ying
@ 2024-07-22 7:51 ` Ryan Roberts
2024-07-22 8:49 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-22 7:51 UTC (permalink / raw)
To: Huang, Ying
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 22/07/2024 03:14, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
>> On 18/07/2024 08:53, Huang, Ying wrote:
>>> Chris Li <chrisl@kernel.org> writes:
>>>
>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>
>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>
> [snip]
>
>>>>>>>> +
>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>
>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>
>>>>>>> So you could have this situation:
>>>>>>>
>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>
>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>
>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>> moving it into nonfull.
>>>>>
>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>> refactoring separated from behavioural changes.
>>>>
>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>> using the cluster. Behavior change is expected. The goal is completely
>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>> allocation.
>>>>
>>>>>
>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>> results quoted on the cover letter?
>>>>
>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>> because it is not feature complete. Currently it does not do swap
>>>> cache reclaim. The next version will have swap cache reclaim and
>>>> remove the RFC.
>>>>
>>>>>
>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>
>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>
>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>> entries allocated from the previous CPU.
>>>>>
>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>
>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>> advances to the next cluster pointer, it can cross with the other
>>>> CPU's next cluster pointer.
>>>
>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>> cluster only. If it doesn't do that, we should fix it.
>>>
>>> I agree with Ryan that we should make per cpu cluster correct. A
>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>> list if necessary. And, we should make it correct in this patch instead
>>> of later in series. I understand that you want to make the patch itself
>>> simple, but it's important to make code simple to be understood too.
>>> Consistent design choice will do that.
>>
>> I think I'm actually arguing for the opposite of what you suggest here.
>
> Sorry, I misunderstood your words.
>
>> As I see it, there are 2 possible approaches; either a cluster is always
>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> case, in which case it should be added to the nonfull list.
>>
>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> as a single swap entry is freed from that cluster it is put back on the list.
>> This neither-one-policy-nor-the-other seems odd to me.
>>
>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> per-cpu cluster.
>
> Yes.
>
>> I was arguing to make it always shared. Perhaps the best
>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> to note if any pages were freed while in exclusive use, then when exclusive use
>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> the shared approach as part of the "big rewrite"?
>>>
>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>> way to help that?
>>>>
>>>> Simply moving to the end of the list can create a possible deadloop
>>>> when all clusters have been scanned and not available swap range
>>>> found.
>
> I also think that the shared approach has dead loop issue.
What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
won't know when to stop dequeing/requeuing clusters on the nonfull list and will
go forever? That's surely just an implementation issue to solve? It's not a
reason to avoid the design principle; if we agree that maintaining sharability
of the cluster is preferred then the code must be written to guard against the
dead loop problem. It could be done by remembering the first cluster you
dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
to it. (I think holding the si lock will protect against concurrently freeing
the cluster so it should definitely remain in the list?).
Which actually makes me wonder; what is the mechanism that prevents the current
per-cpu cluster from being freed? Is that just handled by the conflict detection
thingy? Perhaps that would be better handled with a flag to mark it in use, or
raise count when its current. (If Chris has implemented that in the "big
rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>
>>> This is another reason that we should put the cluster in
>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>> in the cluster. It makes design complex to keep it in
>>> nonfull_clusters[order].
>>>
>>>> We have tried many different approaches including moving to the end of
>>>> the list. It can cause more fragmentation because each CPU allocates
>>>> their swap slot cache (64 entries) from a different cluster.
>>>>
>>>>>> Those behaviors will be fine
>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>
>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>
>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>
>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>> to introduce the nonfull list.
>>>>>>
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-22 7:51 ` Ryan Roberts
@ 2024-07-22 8:49 ` Huang, Ying
2024-07-22 9:54 ` Ryan Roberts
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-22 8:49 UTC (permalink / raw)
To: Ryan Roberts
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
Ryan Roberts <ryan.roberts@arm.com> writes:
> On 22/07/2024 03:14, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>> Chris Li <chrisl@kernel.org> writes:
>>>>
>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>
>> [snip]
>>
>>>>>>>>> +
>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>
>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>
>>>>>>>> So you could have this situation:
>>>>>>>>
>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>
>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>
>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>> moving it into nonfull.
>>>>>>
>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>> refactoring separated from behavioural changes.
>>>>>
>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>> allocation.
>>>>>
>>>>>>
>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>> results quoted on the cover letter?
>>>>>
>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>> because it is not feature complete. Currently it does not do swap
>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>> remove the RFC.
>>>>>
>>>>>>
>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>
>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>
>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>> entries allocated from the previous CPU.
>>>>>>
>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>
>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>> advances to the next cluster pointer, it can cross with the other
>>>>> CPU's next cluster pointer.
>>>>
>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>>> cluster only. If it doesn't do that, we should fix it.
>>>>
>>>> I agree with Ryan that we should make per cpu cluster correct. A
>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>> list if necessary. And, we should make it correct in this patch instead
>>>> of later in series. I understand that you want to make the patch itself
>>>> simple, but it's important to make code simple to be understood too.
>>>> Consistent design choice will do that.
>>>
>>> I think I'm actually arguing for the opposite of what you suggest here.
>>
>> Sorry, I misunderstood your words.
>>
>>> As I see it, there are 2 possible approaches; either a cluster is always
>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>> case, in which case it should be added to the nonfull list.
>>>
>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>> as a single swap entry is freed from that cluster it is put back on the list.
>>> This neither-one-policy-nor-the-other seems odd to me.
>>>
>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>> per-cpu cluster.
>>
>> Yes.
>>
>>> I was arguing to make it always shared. Perhaps the best
>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>> the shared approach as part of the "big rewrite"?
>>>>
>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>> way to help that?
>>>>>
>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>> when all clusters have been scanned and not available swap range
>>>>> found.
>>
>> I also think that the shared approach has dead loop issue.
>
> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> go forever? That's surely just an implementation issue to solve? It's not a
> reason to avoid the design principle; if we agree that maintaining sharability
> of the cluster is preferred then the code must be written to guard against the
> dead loop problem. It could be done by remembering the first cluster you
> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> to it. (I think holding the si lock will protect against concurrently freeing
> the cluster so it should definitely remain in the list?).
I believe that you can find some way to avoid the dead loop issue,
although your suggestion may kill the performance via looping a long list
of nonfull clusters. And, I understand that in some situations it may
be better to share clusters among CPUs. So my suggestion is,
- Make swap_cluster_info->order more accurate, don't pretend that we
have free swap entries with that order even after we are sure that we
haven't.
My question is whether it's so important to share the per-cpu cluster
among CPUs? I suggest to start with simple design, that is, per-CPU
cluster will not be shared among CPUs in most cases.
Another choice for sharing is when we run short of free swap space, we
disable per-CPU cluster and allocate from the shared non-full cluster
list directly.
> Which actually makes me wonder; what is the mechanism that prevents the current
> per-cpu cluster from being freed? Is that just handled by the conflict detection
> thingy? Perhaps that would be better handled with a flag to mark it in use, or
> raise count when its current. (If Chris has implemented that in the "big
> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
Yes. We may need a flag for that.
>>
>>>> This is another reason that we should put the cluster in
>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>> in the cluster. It makes design complex to keep it in
>>>> nonfull_clusters[order].
>>>>
>>>>> We have tried many different approaches including moving to the end of
>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>
>>>>>>> Those behaviors will be fine
>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>
>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>
>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>
>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>> to introduce the nonfull list.
>>>>>>>
>>
>> [snip]
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-22 8:49 ` Huang, Ying
@ 2024-07-22 9:54 ` Ryan Roberts
2024-07-23 6:27 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Ryan Roberts @ 2024-07-22 9:54 UTC (permalink / raw)
To: Huang, Ying
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 22/07/2024 09:49, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
>> On 22/07/2024 03:14, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>
>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>
>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>
>>> [snip]
>>>
>>>>>>>>>> +
>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>
>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>
>>>>>>>>> So you could have this situation:
>>>>>>>>>
>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>
>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>
>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>> moving it into nonfull.
>>>>>>>
>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>> refactoring separated from behavioural changes.
>>>>>>
>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>> allocation.
>>>>>>
>>>>>>>
>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>> results quoted on the cover letter?
>>>>>>
>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>> remove the RFC.
>>>>>>
>>>>>>>
>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>
>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>
>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>> entries allocated from the previous CPU.
>>>>>>>
>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>
>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>> CPU's next cluster pointer.
>>>>>
>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>> cluster only. If it doesn't do that, we should fix it.
>>>>>
>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>> list if necessary. And, we should make it correct in this patch instead
>>>>> of later in series. I understand that you want to make the patch itself
>>>>> simple, but it's important to make code simple to be understood too.
>>>>> Consistent design choice will do that.
>>>>
>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>
>>> Sorry, I misunderstood your words.
>>>
>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>> case, in which case it should be added to the nonfull list.
>>>>
>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>
>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>> per-cpu cluster.
>>>
>>> Yes.
>>>
>>>> I was arguing to make it always shared. Perhaps the best
>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>> the shared approach as part of the "big rewrite"?
>>>>>
>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>> way to help that?
>>>>>>
>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>> when all clusters have been scanned and not available swap range
>>>>>> found.
>>>
>>> I also think that the shared approach has dead loop issue.
>>
>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> go forever? That's surely just an implementation issue to solve? It's not a
>> reason to avoid the design principle; if we agree that maintaining sharability
>> of the cluster is preferred then the code must be written to guard against the
>> dead loop problem. It could be done by remembering the first cluster you
>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> to it. (I think holding the si lock will protect against concurrently freeing
>> the cluster so it should definitely remain in the list?).
>
> I believe that you can find some way to avoid the dead loop issue,
> although your suggestion may kill the performance via looping a long list
> of nonfull clusters.
I don't agree; If the clusters are considered exclusive (i.e. removed from the
list when made current for a cpu), that only reduces the size of the list by a
maximum of the number of CPUs in the system, which I suspect is pretty small
compared to the number of nonfull clusters.
> And, I understand that in some situations it may
> be better to share clusters among CPUs. So my suggestion is,
>
> - Make swap_cluster_info->order more accurate, don't pretend that we
> have free swap entries with that order even after we are sure that we
> haven't.
Is this patch pretending that today? I don't think so? But I agree that a
cluster should only be on the per-order nonfull list if we know there are at
least enough free swap entries in that cluster to cover the order. Of course
that doesn't tell us for sure because they may not be contiguous.
>
> My question is whether it's so important to share the per-cpu cluster
> among CPUs?
My rationale for sharing is that the preference previously has been to favour
efficient use of swap space; we don't want to fail a request for allocation of a
given order if there are actually slots available just because they have been
reserved by another CPU. And I'm still asserting that it should be ~zero cost to
do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
actually help improve allocation success, then I'm happy to take the exclusive
approach.
> I suggest to start with simple design, that is, per-CPU
> cluster will not be shared among CPUs in most cases.
I'm all for starting simple; I think that's what I already proposed (exclusive
in this patch, then shared in the "big rewrite"). I'm just objecting to the
current half-and-half policy in this patch.
>
> Another choice for sharing is when we run short of free swap space, we
> disable per-CPU cluster and allocate from the shared non-full cluster
> list directly.
>
>> Which actually makes me wonder; what is the mechanism that prevents the current
>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>> raise count when its current. (If Chris has implemented that in the "big
>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>
> Yes. We may need a flag for that.
>
>>>
>>>>> This is another reason that we should put the cluster in
>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>> in the cluster. It makes design complex to keep it in
>>>>> nonfull_clusters[order].
>>>>>
>>>>>> We have tried many different approaches including moving to the end of
>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>
>>>>>>>> Those behaviors will be fine
>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>
>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>
>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>
>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>> to introduce the nonfull list.
>>>>>>>>
>>>
>>> [snip]
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-22 9:54 ` Ryan Roberts
@ 2024-07-23 6:27 ` Huang, Ying
2024-07-24 8:33 ` Ryan Roberts
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-23 6:27 UTC (permalink / raw)
To: Ryan Roberts
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
Ryan Roberts <ryan.roberts@arm.com> writes:
> On 22/07/2024 09:49, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>
>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>
>>>> [snip]
>>>>
>>>>>>>>>>> +
>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>
>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>
>>>>>>>>>> So you could have this situation:
>>>>>>>>>>
>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>
>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>
>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>> moving it into nonfull.
>>>>>>>>
>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>
>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>> allocation.
>>>>>>>
>>>>>>>>
>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>> results quoted on the cover letter?
>>>>>>>
>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>> remove the RFC.
>>>>>>>
>>>>>>>>
>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>
>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>
>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>
>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>
>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>> CPU's next cluster pointer.
>>>>>>
>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>> cluster only. If it doesn't do that, we should fix it.
>>>>>>
>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>> list if necessary. And, we should make it correct in this patch instead
>>>>>> of later in series. I understand that you want to make the patch itself
>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>> Consistent design choice will do that.
>>>>>
>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>
>>>> Sorry, I misunderstood your words.
>>>>
>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>> case, in which case it should be added to the nonfull list.
>>>>>
>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>
>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>> per-cpu cluster.
>>>>
>>>> Yes.
>>>>
>>>>> I was arguing to make it always shared. Perhaps the best
>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>> the shared approach as part of the "big rewrite"?
>>>>>>
>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>> way to help that?
>>>>>>>
>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>> found.
>>>>
>>>> I also think that the shared approach has dead loop issue.
>>>
>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>> go forever? That's surely just an implementation issue to solve? It's not a
>>> reason to avoid the design principle; if we agree that maintaining sharability
>>> of the cluster is preferred then the code must be written to guard against the
>>> dead loop problem. It could be done by remembering the first cluster you
>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>> to it. (I think holding the si lock will protect against concurrently freeing
>>> the cluster so it should definitely remain in the list?).
>>
>> I believe that you can find some way to avoid the dead loop issue,
>> although your suggestion may kill the performance via looping a long list
>> of nonfull clusters.
>
> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> list when made current for a cpu), that only reduces the size of the list by a
> maximum of the number of CPUs in the system, which I suspect is pretty small
> compared to the number of nonfull clusters.
Anyway, this depends on details. If we cannot allocate a order-N swap
entry from the cluster, we should remove it from the nonfull list for
order-N (This is the behavior of this patch too). Your original
suggestion appears like that you want to keep all cluster with order-N
on the nonfull list for order-N always unless the number of free swap
entry is less than 1<<N.
>> And, I understand that in some situations it may
>> be better to share clusters among CPUs. So my suggestion is,
>>
>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> have free swap entries with that order even after we are sure that we
>> haven't.
>
> Is this patch pretending that today? I don't think so?
IIUC, in this patch swap_cluster_info->order is still "N" even if we are
sure that there are no order-N free swap entry in the cluster.
> But I agree that a
> cluster should only be on the per-order nonfull list if we know there are at
> least enough free swap entries in that cluster to cover the order. Of course
> that doesn't tell us for sure because they may not be contiguous.
We can check that when free swap entry via checking adjacent swap
entries. IMHO, the performance should be acceptable.
>>
>> My question is whether it's so important to share the per-cpu cluster
>> among CPUs?
>
> My rationale for sharing is that the preference previously has been to favour
> efficient use of swap space; we don't want to fail a request for allocation of a
> given order if there are actually slots available just because they have been
> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> actually help improve allocation success, then I'm happy to take the exclusive
> approach.
>
>> I suggest to start with simple design, that is, per-CPU
>> cluster will not be shared among CPUs in most cases.
>
> I'm all for starting simple; I think that's what I already proposed (exclusive
> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> current half-and-half policy in this patch.
Sounds good to me. We can start with exclusive solution and evaluate
whether shared solution is good.
>>
>> Another choice for sharing is when we run short of free swap space, we
>> disable per-CPU cluster and allocate from the shared non-full cluster
>> list directly.
>>
>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>> raise count when its current. (If Chris has implemented that in the "big
>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>>
>> Yes. We may need a flag for that.
>>
>>>>
>>>>>> This is another reason that we should put the cluster in
>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>> in the cluster. It makes design complex to keep it in
>>>>>> nonfull_clusters[order].
>>>>>>
>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>
>>>>>>>>> Those behaviors will be fine
>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>
>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>
>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>
>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>
>>>>
>>>> [snip]
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-23 6:27 ` Huang, Ying
@ 2024-07-24 8:33 ` Ryan Roberts
2024-07-24 22:41 ` Chris Li
2024-07-25 6:53 ` Huang, Ying
0 siblings, 2 replies; 43+ messages in thread
From: Ryan Roberts @ 2024-07-24 8:33 UTC (permalink / raw)
To: Huang, Ying
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On 23/07/2024 07:27, Huang, Ying wrote:
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
>> On 22/07/2024 09:49, Huang, Ying wrote:
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>>
>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>
>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>>
>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>>
>>>>>>>>>>> So you could have this situation:
>>>>>>>>>>>
>>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>>
>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>>
>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>>> moving it into nonfull.
>>>>>>>>>
>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>>
>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>>> allocation.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>>> results quoted on the cover letter?
>>>>>>>>
>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>>> remove the RFC.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>>
>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>>
>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>>
>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>>
>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>>> CPU's next cluster pointer.
>>>>>>>
>>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>>> cluster only. If it doesn't do that, we should fix it.
>>>>>>>
>>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>>> list if necessary. And, we should make it correct in this patch instead
>>>>>>> of later in series. I understand that you want to make the patch itself
>>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>>> Consistent design choice will do that.
>>>>>>
>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>>
>>>>> Sorry, I misunderstood your words.
>>>>>
>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>>> case, in which case it should be added to the nonfull list.
>>>>>>
>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>>
>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>>> per-cpu cluster.
>>>>>
>>>>> Yes.
>>>>>
>>>>>> I was arguing to make it always shared. Perhaps the best
>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>>> the shared approach as part of the "big rewrite"?
>>>>>>>
>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>>> way to help that?
>>>>>>>>
>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>>> found.
>>>>>
>>>>> I also think that the shared approach has dead loop issue.
>>>>
>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>>> go forever? That's surely just an implementation issue to solve? It's not a
>>>> reason to avoid the design principle; if we agree that maintaining sharability
>>>> of the cluster is preferred then the code must be written to guard against the
>>>> dead loop problem. It could be done by remembering the first cluster you
>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>>> to it. (I think holding the si lock will protect against concurrently freeing
>>>> the cluster so it should definitely remain in the list?).
>>>
>>> I believe that you can find some way to avoid the dead loop issue,
>>> although your suggestion may kill the performance via looping a long list
>>> of nonfull clusters.
>>
>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> list when made current for a cpu), that only reduces the size of the list by a
>> maximum of the number of CPUs in the system, which I suspect is pretty small
>> compared to the number of nonfull clusters.
>
> Anyway, this depends on details. If we cannot allocate a order-N swap
> entry from the cluster, we should remove it from the nonfull list for
> order-N (This is the behavior of this patch too).
Yes that's a good point, and I conceed it is more difficult to detect that
condition if the cluster is shared. I suspect that with a bit of thinking, we
could find a way though.
> Your original
> suggestion appears like that you want to keep all cluster with order-N
> on the nonfull list for order-N always unless the number of free swap
> entry is less than 1<<N.
Well I think that's certainly one of the conditions for removing it. But agree
that if a full scan of the cluster has been performed and no swap entries have
been freed since the scan started then it should also be removed from the list.
>
>>> And, I understand that in some situations it may
>>> be better to share clusters among CPUs. So my suggestion is,
>>>
>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>>> have free swap entries with that order even after we are sure that we
>>> haven't.
>>
>> Is this patch pretending that today? I don't think so?
>
> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> sure that there are no order-N free swap entry in the cluster.
Oh I see what you mean. I think you and Chris already discussed this? IIRC
Chris's point was that if you move that cluster to N-1, eventually all clusters
are for order-0 and you have no means of allocating high orders until a whole
cluster becomes free. That logic certainly makes sense to me, so think its
better for swap_cluster_info->order to remain static while the cluster is
allocated. (I only skimmed that conversation so appologies if I got the
conclusion wrong!).
>
>> But I agree that a
>> cluster should only be on the per-order nonfull list if we know there are at
>> least enough free swap entries in that cluster to cover the order. Of course
>> that doesn't tell us for sure because they may not be contiguous.
>
> We can check that when free swap entry via checking adjacent swap
> entries. IMHO, the performance should be acceptable.
Would you then use the result of that scanning to "promote" a cluster's order?
e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
a separate change on top of what Chris is doing here. For high orders there
could be quite a bit of scanning required in the worst case for every page that
gets freed.
>
>>>
>>> My question is whether it's so important to share the per-cpu cluster
>>> among CPUs?
>>
>> My rationale for sharing is that the preference previously has been to favour
>> efficient use of swap space; we don't want to fail a request for allocation of a
>> given order if there are actually slots available just because they have been
>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> actually help improve allocation success, then I'm happy to take the exclusive
>> approach.
>>
>>> I suggest to start with simple design, that is, per-CPU
>>> cluster will not be shared among CPUs in most cases.
>>
>> I'm all for starting simple; I think that's what I already proposed (exclusive
>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> current half-and-half policy in this patch.
>
> Sounds good to me. We can start with exclusive solution and evaluate
> whether shared solution is good.
Yep. And also evaluate the dynamic order inc/dec idea too...
>
>>>
>>> Another choice for sharing is when we run short of free swap space, we
>>> disable per-CPU cluster and allocate from the shared non-full cluster
>>> list directly.
>>>
>>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>>> raise count when its current. (If Chris has implemented that in the "big
>>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>>>
>>> Yes. We may need a flag for that.
>>>
>>>>>
>>>>>>> This is another reason that we should put the cluster in
>>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>>> in the cluster. It makes design complex to keep it in
>>>>>>> nonfull_clusters[order].
>>>>>>>
>>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>>
>>>>>>>>>> Those behaviors will be fine
>>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>>
>>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>>
>>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>>
>>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>>
>>>>>
>>>>> [snip]
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-24 8:33 ` Ryan Roberts
@ 2024-07-24 22:41 ` Chris Li
2024-07-25 6:43 ` Huang, Ying
2024-07-25 6:53 ` Huang, Ying
1 sibling, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-24 22:41 UTC (permalink / raw)
To: Ryan Roberts
Cc: Huang, Ying, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Hi Ryan and Ying,
Sorry I was busy. I am catching up on the email now.
On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 23/07/2024 07:27, Huang, Ying wrote:
> > Ryan Roberts <ryan.roberts@arm.com> writes:
> >
> >> On 22/07/2024 09:49, Huang, Ying wrote:
> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>
> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>>>
> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
> >>>>>>> Chris Li <chrisl@kernel.org> writes:
> >>>>>>>
> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>
> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> >>>>>
> >>>>> [snip]
> >>>>>
> >>>>>>>>>>>> +
> >>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >>>>>>>>>>>
> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
> >>>>>>>>>>>
> >>>>>>>>>>> So you could have this situation:
> >>>>>>>>>>>
> >>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
> >>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
> >>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
> >>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
> >>>>>>>>>>>
> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
> >>>>>>>>>>
> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
> >>>>>>>>>> moving it into nonfull.
> >>>>>>>>>
> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
> >>>>>>>>> refactoring separated from behavioural changes.
> >>>>>>>>
> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
> >>>>>>>> allocation.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
> >>>>>>>>> results quoted on the cover letter?
> >>>>>>>>
> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
> >>>>>>>> because it is not feature complete. Currently it does not do swap
> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
> >>>>>>>> remove the RFC.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
> >>>>>>>>>>
> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
> >>>>>>>>>>
> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
> >>>>>>>>>> entries allocated from the previous CPU.
> >>>>>>>>>
> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
> >>>>>>>>
> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
> >>>>>>>> advances to the next cluster pointer, it can cross with the other
> >>>>>>>> CPU's next cluster pointer.
> >>>>>>>
> >>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
> >>>>>>> cluster only. If it doesn't do that, we should fix it.
> >>>>>>>
> >>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> >>>>>>> list if necessary. And, we should make it correct in this patch instead
> >>>>>>> of later in series. I understand that you want to make the patch itself
> >>>>>>> simple, but it's important to make code simple to be understood too.
> >>>>>>> Consistent design choice will do that.
> >>>>>>
> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
> >>>>>
> >>>>> Sorry, I misunderstood your words.
> >>>>>
> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
> >>>>>> case, in which case it should be added to the nonfull list.
> >>>>>>
> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
> >>>>>>
> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> >>>>>> per-cpu cluster.
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>>> I was arguing to make it always shared. Perhaps the best
> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
> >>>>>> the shared approach as part of the "big rewrite"?
> >>>>>>>
> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
> >>>>>>>>> way to help that?
> >>>>>>>>
> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
> >>>>>>>> when all clusters have been scanned and not available swap range
> >>>>>>>> found.
> >>>>>
> >>>>> I also think that the shared approach has dead loop issue.
> >>>>
> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> >>>> go forever? That's surely just an implementation issue to solve? It's not a
> >>>> reason to avoid the design principle; if we agree that maintaining sharability
> >>>> of the cluster is preferred then the code must be written to guard against the
> >>>> dead loop problem. It could be done by remembering the first cluster you
> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> >>>> to it. (I think holding the si lock will protect against concurrently freeing
> >>>> the cluster so it should definitely remain in the list?).
> >>>
> >>> I believe that you can find some way to avoid the dead loop issue,
> >>> although your suggestion may kill the performance via looping a long list
> >>> of nonfull clusters.
> >>
> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> >> list when made current for a cpu), that only reduces the size of the list by a
> >> maximum of the number of CPUs in the system, which I suspect is pretty small
> >> compared to the number of nonfull clusters.
> >
> > Anyway, this depends on details. If we cannot allocate a order-N swap
> > entry from the cluster, we should remove it from the nonfull list for
> > order-N (This is the behavior of this patch too).
Yes, Kairui implements something like that in the reclaim part of the
patch series. It is after patch 3. We are heavily testing the
performance and the stability of the reclaim patches. May I post the
reclaim together with patch 3 for discussion. If you want we can
discuss the re-order the patch in a later iteration.
>
> Yes that's a good point, and I conceed it is more difficult to detect that
> condition if the cluster is shared. I suspect that with a bit of thinking, we
> could find a way though.
Kaiui has the patch series show a good performance number that beats
the current swap cache reclaim.
I want to make a point regarding the patch ordering before vs after
patch 3 (aka the big rewrite).
Previously, the "san_swap_map_try_ssd_cluster" only did partial
allocation. It does not sucessfully allocate a swap entry 100% the
time. The patch 3 makes the cluster allocation function return the
swap entry 100% of the time. There are no more fallback retry loops
outside of the cluster allocation function. Also the try_ssd function
does not do swap cache reclaims while the cluster allocation function
will need to. These two have very different constraints.
There for, adding different cluster header into
san_swap_map_try_ssd_cluste will be a lot of waste investment of
development time in the sense that, that function will need to be
rewrite any way, the end result is very different.
That is why I want to make this change patch after patch 3. There is
also the long test cycle after the modification to make sure the swap
code path is stable. I am not resisting a change of patch orders, it
is that patch can't directly be removed before patch 3 before the big
rewrite.
>
> > Your original
> > suggestion appears like that you want to keep all cluster with order-N
> > on the nonfull list for order-N always unless the number of free swap
> > entry is less than 1<<N.
>
> Well I think that's certainly one of the conditions for removing it. But agree
> that if a full scan of the cluster has been performed and no swap entries have
> been freed since the scan started then it should also be removed from the list.
Yes, in the later patch of patch, beyond patch 3, we have the almost
full cluster that for the cluster has been scan and not able to
allocate order N entry.
>
> >
> >>> And, I understand that in some situations it may
> >>> be better to share clusters among CPUs. So my suggestion is,
> >>>
> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >>> have free swap entries with that order even after we are sure that we
> >>> haven't.
> >>
> >> Is this patch pretending that today? I don't think so?
> >
> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> > sure that there are no order-N free swap entry in the cluster.
>
> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> Chris's point was that if you move that cluster to N-1, eventually all clusters
> are for order-0 and you have no means of allocating high orders until a whole
> cluster becomes free. That logic certainly makes sense to me, so think its
> better for swap_cluster_info->order to remain static while the cluster is
> allocated. (I only skimmed that conversation so appologies if I got the
> conclusion wrong!).
Yes, that is the original intent, keep the cluster order as much as possible.
>
> >
> >> But I agree that a
> >> cluster should only be on the per-order nonfull list if we know there are at
> >> least enough free swap entries in that cluster to cover the order. Of course
> >> that doesn't tell us for sure because they may not be contiguous.
> >
> > We can check that when free swap entry via checking adjacent swap
> > entries. IMHO, the performance should be acceptable.
>
> Would you then use the result of that scanning to "promote" a cluster's order?
> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> a separate change on top of what Chris is doing here. For high orders there
> could be quite a bit of scanning required in the worst case for every page that
> gets freed.
Right, I feel that is a different set of patches. Even this series is
hard enough for review. Those order promotion and demotion is heading
towards a buddy system design. I want to point out that even the buddy
system is not able to handle the case that swapfile is almost full and
the recently freed swap entries are not contiguous.
We can invest in the buddy system, which doesn't handle all the
fragmentation issues. Or I prefer to go directly to the discontiguous
swap entry. We pay a price for the indirect mapping of swap entries.
But it will solve the fragmentation issue 100%.
>
> >
> >>>
> >>> My question is whether it's so important to share the per-cpu cluster
> >>> among CPUs?
> >>
> >> My rationale for sharing is that the preference previously has been to favour
> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> given order if there are actually slots available just because they have been
> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> approach.
> >>
> >>> I suggest to start with simple design, that is, per-CPU
> >>> cluster will not be shared among CPUs in most cases.
> >>
> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> current half-and-half policy in this patch.
> >
> > Sounds good to me. We can start with exclusive solution and evaluate
> > whether shared solution is good.
>
> Yep. And also evaluate the dynamic order inc/dec idea too...
It is not able to avoid fragementation 100% of the time. I prefer the
discontinued swap entry as the next step, which guarantees forward
progress, we will not be stuck in a situation where we are not able to
allocate swap entries due to fragmentation.
Chris
>
> >
> >>>
> >>> Another choice for sharing is when we run short of free swap space, we
> >>> disable per-CPU cluster and allocate from the shared non-full cluster
> >>> list directly.
> >>>
> >>>> Which actually makes me wonder; what is the mechanism that prevents the current
> >>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
> >>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
> >>>> raise count when its current. (If Chris has implemented that in the "big
> >>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
> >>>
> >>> Yes. We may need a flag for that.
> >>>
> >>>>>
> >>>>>>> This is another reason that we should put the cluster in
> >>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
> >>>>>>> in the cluster. It makes design complex to keep it in
> >>>>>>> nonfull_clusters[order].
> >>>>>>>
> >>>>>>>> We have tried many different approaches including moving to the end of
> >>>>>>>> the list. It can cause more fragmentation because each CPU allocates
> >>>>>>>> their swap slot cache (64 entries) from a different cluster.
> >>>>>>>>
> >>>>>>>>>> Those behaviors will be fine
> >>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
> >>>>>>>>
> >>>>>>>> Again, I want to keep it simple here so patch 3 can land.
> >>>>>>>>
> >>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
> >>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
> >>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
> >>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
> >>>>>>>>>>
> >>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
> >>>>>>>>>> to introduce the nonfull list.
> >>>>>>>>>>
> >>>>>
> >>>>> [snip]
> >
> > --
> > Best Regards,
> > Huang, Ying
>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-24 22:41 ` Chris Li
@ 2024-07-25 6:43 ` Huang, Ying
2024-07-25 8:09 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-25 6:43 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> Hi Ryan and Ying,
>
> Sorry I was busy. I am catching up on the email now.
>
> On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 23/07/2024 07:27, Huang, Ying wrote:
>> > Ryan Roberts <ryan.roberts@arm.com> writes:
>> >
>> >> On 22/07/2024 09:49, Huang, Ying wrote:
>> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>
>> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>>>
>> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >>>>>>>
>> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>
>> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>>>
>> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >>>>>
>> >>>>> [snip]
>> >>>>>
>> >>>>>>>>>>>> +
>> >>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >>>>>>>>>>>
>> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >>>>>>>>>>>
>> >>>>>>>>>>> So you could have this situation:
>> >>>>>>>>>>>
>> >>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>> >>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>> >>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>> >>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>> >>>>>>>>>>>
>> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >>>>>>>>>>
>> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >>>>>>>>>> moving it into nonfull.
>> >>>>>>>>>
>> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >>>>>>>>> refactoring separated from behavioural changes.
>> >>>>>>>>
>> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >>>>>>>> allocation.
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >>>>>>>>> results quoted on the cover letter?
>> >>>>>>>>
>> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >>>>>>>> because it is not feature complete. Currently it does not do swap
>> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >>>>>>>> remove the RFC.
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >>>>>>>>>>
>> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >>>>>>>>>>
>> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >>>>>>>>>> entries allocated from the previous CPU.
>> >>>>>>>>>
>> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >>>>>>>>
>> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >>>>>>>> CPU's next cluster pointer.
>> >>>>>>>
>> >>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>> >>>>>>> cluster only. If it doesn't do that, we should fix it.
>> >>>>>>>
>> >>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >>>>>>> list if necessary. And, we should make it correct in this patch instead
>> >>>>>>> of later in series. I understand that you want to make the patch itself
>> >>>>>>> simple, but it's important to make code simple to be understood too.
>> >>>>>>> Consistent design choice will do that.
>> >>>>>>
>> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >>>>>
>> >>>>> Sorry, I misunderstood your words.
>> >>>>>
>> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >>>>>> case, in which case it should be added to the nonfull list.
>> >>>>>>
>> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >>>>>>
>> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >>>>>> per-cpu cluster.
>> >>>>>
>> >>>>> Yes.
>> >>>>>
>> >>>>>> I was arguing to make it always shared. Perhaps the best
>> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >>>>>> the shared approach as part of the "big rewrite"?
>> >>>>>>>
>> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >>>>>>>>> way to help that?
>> >>>>>>>>
>> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >>>>>>>> when all clusters have been scanned and not available swap range
>> >>>>>>>> found.
>> >>>>>
>> >>>>> I also think that the shared approach has dead loop issue.
>> >>>>
>> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >>>> of the cluster is preferred then the code must be written to guard against the
>> >>>> dead loop problem. It could be done by remembering the first cluster you
>> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >>>> the cluster so it should definitely remain in the list?).
>> >>>
>> >>> I believe that you can find some way to avoid the dead loop issue,
>> >>> although your suggestion may kill the performance via looping a long list
>> >>> of nonfull clusters.
>> >>
>> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >> list when made current for a cpu), that only reduces the size of the list by a
>> >> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >> compared to the number of nonfull clusters.
>> >
>> > Anyway, this depends on details. If we cannot allocate a order-N swap
>> > entry from the cluster, we should remove it from the nonfull list for
>> > order-N (This is the behavior of this patch too).
>
> Yes, Kairui implements something like that in the reclaim part of the
> patch series. It is after patch 3. We are heavily testing the
> performance and the stability of the reclaim patches. May I post the
> reclaim together with patch 3 for discussion. If you want we can
> discuss the re-order the patch in a later iteration.
>
>>
>> Yes that's a good point, and I conceed it is more difficult to detect that
>> condition if the cluster is shared. I suspect that with a bit of thinking, we
>> could find a way though.
>
> Kaiui has the patch series show a good performance number that beats
> the current swap cache reclaim.
>
> I want to make a point regarding the patch ordering before vs after
> patch 3 (aka the big rewrite).
> Previously, the "san_swap_map_try_ssd_cluster" only did partial
> allocation. It does not sucessfully allocate a swap entry 100% the
> time. The patch 3 makes the cluster allocation function return the
> swap entry 100% of the time. There are no more fallback retry loops
> outside of the cluster allocation function. Also the try_ssd function
> does not do swap cache reclaims while the cluster allocation function
> will need to. These two have very different constraints.
>
> There for, adding different cluster header into
> san_swap_map_try_ssd_cluste will be a lot of waste investment of
> development time in the sense that, that function will need to be
> rewrite any way, the end result is very different.
I am not a big fan of implementing the final solution directly.
Personally, I prefer to improve step by step.
> That is why I want to make this change patch after patch 3. There is
> also the long test cycle after the modification to make sure the swap
> code path is stable. I am not resisting a change of patch orders, it
> is that patch can't directly be removed before patch 3 before the big
> rewrite.
>
>
>>
>> > Your original
>> > suggestion appears like that you want to keep all cluster with order-N
>> > on the nonfull list for order-N always unless the number of free swap
>> > entry is less than 1<<N.
>>
>> Well I think that's certainly one of the conditions for removing it. But agree
>> that if a full scan of the cluster has been performed and no swap entries have
>> been freed since the scan started then it should also be removed from the list.
>
> Yes, in the later patch of patch, beyond patch 3, we have the almost
> full cluster that for the cluster has been scan and not able to
> allocate order N entry.
>
>>
>> >
>> >>> And, I understand that in some situations it may
>> >>> be better to share clusters among CPUs. So my suggestion is,
>> >>>
>> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >>> have free swap entries with that order even after we are sure that we
>> >>> haven't.
>> >>
>> >> Is this patch pretending that today? I don't think so?
>> >
>> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> > sure that there are no order-N free swap entry in the cluster.
>>
>> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> are for order-0 and you have no means of allocating high orders until a whole
>> cluster becomes free. That logic certainly makes sense to me, so think its
>> better for swap_cluster_info->order to remain static while the cluster is
>> allocated. (I only skimmed that conversation so appologies if I got the
>> conclusion wrong!).
>
> Yes, that is the original intent, keep the cluster order as much as possible.
>
>>
>> >
>> >> But I agree that a
>> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> that doesn't tell us for sure because they may not be contiguous.
>> >
>> > We can check that when free swap entry via checking adjacent swap
>> > entries. IMHO, the performance should be acceptable.
>>
>> Would you then use the result of that scanning to "promote" a cluster's order?
>> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> a separate change on top of what Chris is doing here. For high orders there
>> could be quite a bit of scanning required in the worst case for every page that
>> gets freed.
>
> Right, I feel that is a different set of patches. Even this series is
> hard enough for review. Those order promotion and demotion is heading
> towards a buddy system design. I want to point out that even the buddy
> system is not able to handle the case that swapfile is almost full and
> the recently freed swap entries are not contiguous.
>
> We can invest in the buddy system, which doesn't handle all the
> fragmentation issues. Or I prefer to go directly to the discontiguous
> swap entry. We pay a price for the indirect mapping of swap entries.
> But it will solve the fragmentation issue 100%.
It's good if we can solve the fragmentation issue 100%. Just need to
pay attention to the cost.
>>
>> >
>> >>>
>> >>> My question is whether it's so important to share the per-cpu cluster
>> >>> among CPUs?
>> >>
>> >> My rationale for sharing is that the preference previously has been to favour
>> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> given order if there are actually slots available just because they have been
>> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> approach.
>> >>
>> >>> I suggest to start with simple design, that is, per-CPU
>> >>> cluster will not be shared among CPUs in most cases.
>> >>
>> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> current half-and-half policy in this patch.
>> >
>> > Sounds good to me. We can start with exclusive solution and evaluate
>> > whether shared solution is good.
>>
>> Yep. And also evaluate the dynamic order inc/dec idea too...
>
> It is not able to avoid fragementation 100% of the time. I prefer the
> discontinued swap entry as the next step, which guarantees forward
> progress, we will not be stuck in a situation where we are not able to
> allocate swap entries due to fragmentation.
If my understanding were correct, the implementation complexity of the
order promotion/demotion isn't at the same level of that of discontinued
swap entry.
--
Best Regards,
Huang, Ying
>
>>
>> >
>> >>>
>> >>> Another choice for sharing is when we run short of free swap space, we
>> >>> disable per-CPU cluster and allocate from the shared non-full cluster
>> >>> list directly.
>> >>>
>> >>>> Which actually makes me wonder; what is the mechanism that prevents the current
>> >>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>> >>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>> >>>> raise count when its current. (If Chris has implemented that in the "big
>> >>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>> >>>
>> >>> Yes. We may need a flag for that.
>> >>>
>> >>>>>
>> >>>>>>> This is another reason that we should put the cluster in
>> >>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>> >>>>>>> in the cluster. It makes design complex to keep it in
>> >>>>>>> nonfull_clusters[order].
>> >>>>>>>
>> >>>>>>>> We have tried many different approaches including moving to the end of
>> >>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>> >>>>>>>> their swap slot cache (64 entries) from a different cluster.
>> >>>>>>>>
>> >>>>>>>>>> Those behaviors will be fine
>> >>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>> >>>>>>>>
>> >>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>> >>>>>>>>
>> >>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>> >>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>> >>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>> >>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>> >>>>>>>>>>
>> >>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>> >>>>>>>>>> to introduce the nonfull list.
>> >>>>>>>>>>
>> >>>>>
>> >>>>> [snip]
>> >
>> > --
>> > Best Regards,
>> > Huang, Ying
>>
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-24 8:33 ` Ryan Roberts
2024-07-24 22:41 ` Chris Li
@ 2024-07-25 6:53 ` Huang, Ying
2024-07-25 8:26 ` Chris Li
1 sibling, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-25 6:53 UTC (permalink / raw)
To: Ryan Roberts
Cc: Chris Li, Andrew Morton, Kairui Song, Hugh Dickins, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
Ryan Roberts <ryan.roberts@arm.com> writes:
> On 23/07/2024 07:27, Huang, Ying wrote:
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>>> On 22/07/2024 09:49, Huang, Ying wrote:
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>
>>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>>>>>>>> Chris Li <chrisl@kernel.org> writes:
>>>>>>>>
>>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>>>>>>
>>>>>> [snip]
>>>>>>
>>>>>>>>>>>>> +
>>>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>>>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>>>>>>>>>>>>
>>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>>>>>>>>>>>>
>>>>>>>>>>>> So you could have this situation:
>>>>>>>>>>>>
>>>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>>>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>>>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>>>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>>>>>>>>>>>>
>>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>>>>>>>>>>>
>>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>>>>>>>>>>> moving it into nonfull.
>>>>>>>>>>
>>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>>>>>>>>>> refactoring separated from behavioural changes.
>>>>>>>>>
>>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>>>>>>>>> allocation.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>>>>>>>>>> results quoted on the cover letter?
>>>>>>>>>
>>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>>>>>>>>> because it is not feature complete. Currently it does not do swap
>>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>>>>>>>>> remove the RFC.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>>>>>>>>>>>
>>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>>>>>>>>>>>
>>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>>>>>>>>>>> entries allocated from the previous CPU.
>>>>>>>>>>
>>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>>>>>>>>>
>>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>>>>>>>>> advances to the next cluster pointer, it can cross with the other
>>>>>>>>> CPU's next cluster pointer.
>>>>>>>>
>>>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>>>>>>>> cluster only. If it doesn't do that, we should fix it.
>>>>>>>>
>>>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>>>>>>>> list if necessary. And, we should make it correct in this patch instead
>>>>>>>> of later in series. I understand that you want to make the patch itself
>>>>>>>> simple, but it's important to make code simple to be understood too.
>>>>>>>> Consistent design choice will do that.
>>>>>>>
>>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>>>>>>
>>>>>> Sorry, I misunderstood your words.
>>>>>>
>>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>>>>>>> case, in which case it should be added to the nonfull list.
>>>>>>>
>>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>>>>>>>
>>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>>>>>>> per-cpu cluster.
>>>>>>
>>>>>> Yes.
>>>>>>
>>>>>>> I was arguing to make it always shared. Perhaps the best
>>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>>>>>>> the shared approach as part of the "big rewrite"?
>>>>>>>>
>>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>>>>>>>>>> way to help that?
>>>>>>>>>
>>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>>>>>>>>> when all clusters have been scanned and not available swap range
>>>>>>>>> found.
>>>>>>
>>>>>> I also think that the shared approach has dead loop issue.
>>>>>
>>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>>>>> go forever? That's surely just an implementation issue to solve? It's not a
>>>>> reason to avoid the design principle; if we agree that maintaining sharability
>>>>> of the cluster is preferred then the code must be written to guard against the
>>>>> dead loop problem. It could be done by remembering the first cluster you
>>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>>>>> to it. (I think holding the si lock will protect against concurrently freeing
>>>>> the cluster so it should definitely remain in the list?).
>>>>
>>>> I believe that you can find some way to avoid the dead loop issue,
>>>> although your suggestion may kill the performance via looping a long list
>>>> of nonfull clusters.
>>>
>>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>>> list when made current for a cpu), that only reduces the size of the list by a
>>> maximum of the number of CPUs in the system, which I suspect is pretty small
>>> compared to the number of nonfull clusters.
>>
>> Anyway, this depends on details. If we cannot allocate a order-N swap
>> entry from the cluster, we should remove it from the nonfull list for
>> order-N (This is the behavior of this patch too).
>
> Yes that's a good point, and I conceed it is more difficult to detect that
> condition if the cluster is shared. I suspect that with a bit of thinking, we
> could find a way though.
>
>> Your original
>> suggestion appears like that you want to keep all cluster with order-N
>> on the nonfull list for order-N always unless the number of free swap
>> entry is less than 1<<N.
>
> Well I think that's certainly one of the conditions for removing it. But agree
> that if a full scan of the cluster has been performed and no swap entries have
> been freed since the scan started then it should also be removed from the list.
>
>>
>>>> And, I understand that in some situations it may
>>>> be better to share clusters among CPUs. So my suggestion is,
>>>>
>>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>>>> have free swap entries with that order even after we are sure that we
>>>> haven't.
>>>
>>> Is this patch pretending that today? I don't think so?
>>
>> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> sure that there are no order-N free swap entry in the cluster.
>
> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> Chris's point was that if you move that cluster to N-1, eventually all clusters
> are for order-0 and you have no means of allocating high orders until a whole
> cluster becomes free. That logic certainly makes sense to me, so think its
> better for swap_cluster_info->order to remain static while the cluster is
> allocated. (I only skimmed that conversation so appologies if I got the
> conclusion wrong!).
>
>>
>>> But I agree that a
>>> cluster should only be on the per-order nonfull list if we know there are at
>>> least enough free swap entries in that cluster to cover the order. Of course
>>> that doesn't tell us for sure because they may not be contiguous.
>>
>> We can check that when free swap entry via checking adjacent swap
>> entries. IMHO, the performance should be acceptable.
>
> Would you then use the result of that scanning to "promote" a cluster's order?
> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> a separate change on top of what Chris is doing here. For high orders there
> could be quite a bit of scanning required in the worst case for every page that
> gets freed.
We can try to optimize it to control overhead if necessary.
>>
>>>>
>>>> My question is whether it's so important to share the per-cpu cluster
>>>> among CPUs?
>>>
>>> My rationale for sharing is that the preference previously has been to favour
>>> efficient use of swap space; we don't want to fail a request for allocation of a
>>> given order if there are actually slots available just because they have been
>>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>>> actually help improve allocation success, then I'm happy to take the exclusive
>>> approach.
>>>
>>>> I suggest to start with simple design, that is, per-CPU
>>>> cluster will not be shared among CPUs in most cases.
>>>
>>> I'm all for starting simple; I think that's what I already proposed (exclusive
>>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>>> current half-and-half policy in this patch.
>>
>> Sounds good to me. We can start with exclusive solution and evaluate
>> whether shared solution is good.
>
> Yep. And also evaluate the dynamic order inc/dec idea too...
Dynamic order inc/dec tries solving a more fundamental problem. For
example,
- Initially, almost only order-0 pages are swapped out, most non-full
clusters are order-0.
- Later, quite some order-0 swap entries are freed so that there are
quite some order-4 swap entries available.
- Order-4 pages need to be swapped out, but no enough order-4 non-full
clusters available.
So, we need a way to migrate non-full clusters among orders to adjust to
the various situations automatically.
But yes, data is needed for any performance related change.
--
Best Regards,
Huang, Ying
>>
>>>>
>>>> Another choice for sharing is when we run short of free swap space, we
>>>> disable per-CPU cluster and allocate from the shared non-full cluster
>>>> list directly.
>>>>
>>>>> Which actually makes me wonder; what is the mechanism that prevents the current
>>>>> per-cpu cluster from being freed? Is that just handled by the conflict detection
>>>>> thingy? Perhaps that would be better handled with a flag to mark it in use, or
>>>>> raise count when its current. (If Chris has implemented that in the "big
>>>>> rewrite" patch, sorry, I still haven't gotten around to looking at it :-| )
>>>>
>>>> Yes. We may need a flag for that.
>>>>
>>>>>>
>>>>>>>> This is another reason that we should put the cluster in
>>>>>>>> nonfull_clusters[order--] if there are no free swap entry with "order"
>>>>>>>> in the cluster. It makes design complex to keep it in
>>>>>>>> nonfull_clusters[order].
>>>>>>>>
>>>>>>>>> We have tried many different approaches including moving to the end of
>>>>>>>>> the list. It can cause more fragmentation because each CPU allocates
>>>>>>>>> their swap slot cache (64 entries) from a different cluster.
>>>>>>>>>
>>>>>>>>>>> Those behaviors will be fine
>>>>>>>>>>> tuned after the patch 3 big rewrite. Try to make this patch simple.
>>>>>>>>>
>>>>>>>>> Again, I want to keep it simple here so patch 3 can land.
>>>>>>>>>
>>>>>>>>>>>> Another potential optimization (which was in my hacked version IIRC) is to only
>>>>>>>>>>>> add/remove from nonfull list when `total - count` crosses the (1 << order)
>>>>>>>>>>>> boundary rather than when becoming completely full. You definitely won't be able
>>>>>>>>>>>> to allocate order-2 if there are only 3 pages available, for example.
>>>>>>>>>>>
>>>>>>>>>>> That is in patch 3 as well. This patch is just doing the bare minimum
>>>>>>>>>>> to introduce the nonfull list.
>>>>>>>>>>>
>>>>>>
>>>>>> [snip]
>>
>> --
>> Best Regards,
>> Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-25 6:43 ` Huang, Ying
@ 2024-07-25 8:09 ` Chris Li
2024-07-26 2:09 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-25 8:09 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Wed, Jul 24, 2024 at 11:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Hi Ryan and Ying,
> >
> > Sorry I was busy. I am catching up on the email now.
> >
> > On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 23/07/2024 07:27, Huang, Ying wrote:
> >> > Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >
> >> >> On 22/07/2024 09:49, Huang, Ying wrote:
> >> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>>
> >> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
> >> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >> >>>>>
> >> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
> >> >>>>>>> Chris Li <chrisl@kernel.org> writes:
> >> >>>>>>>
> >> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> >>>>>>>>>
> >> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
> >> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> >> >>>>>
> >> >>>>> [snip]
> >> >>>>>
> >> >>>>>>>>>>>> +
> >> >>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >> >>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
> >> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
> >> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
> >> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> So you could have this situation:
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
> >> >>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
> >> >>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
> >> >>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
> >> >>>>>>>>>>>
> >> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
> >> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
> >> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
> >> >>>>>>>>>>
> >> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
> >> >>>>>>>>>> moving it into nonfull.
> >> >>>>>>>>>
> >> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> >> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> >> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
> >> >>>>>>>>> refactoring separated from behavioural changes.
> >> >>>>>>>>
> >> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
> >> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
> >> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
> >> >>>>>>>> allocation.
> >> >>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
> >> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
> >> >>>>>>>>> results quoted on the cover letter?
> >> >>>>>>>>
> >> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
> >> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
> >> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
> >> >>>>>>>> because it is not feature complete. Currently it does not do swap
> >> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
> >> >>>>>>>> remove the RFC.
> >> >>>>>>>>
> >> >>>>>>>>>
> >> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
> >> >>>>>>>>>>
> >> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
> >> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
> >> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
> >> >>>>>>>>>>
> >> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
> >> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
> >> >>>>>>>>>> entries allocated from the previous CPU.
> >> >>>>>>>>>
> >> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> >> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
> >> >>>>>>>>
> >> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
> >> >>>>>>>> advances to the next cluster pointer, it can cross with the other
> >> >>>>>>>> CPU's next cluster pointer.
> >> >>>>>>>
> >> >>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
> >> >>>>>>> cluster only. If it doesn't do that, we should fix it.
> >> >>>>>>>
> >> >>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
> >> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
> >> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> >> >>>>>>> list if necessary. And, we should make it correct in this patch instead
> >> >>>>>>> of later in series. I understand that you want to make the patch itself
> >> >>>>>>> simple, but it's important to make code simple to be understood too.
> >> >>>>>>> Consistent design choice will do that.
> >> >>>>>>
> >> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
> >> >>>>>
> >> >>>>> Sorry, I misunderstood your words.
> >> >>>>>
> >> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
> >> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> >> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
> >> >>>>>> case, in which case it should be added to the nonfull list.
> >> >>>>>>
> >> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
> >> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
> >> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
> >> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
> >> >>>>>>
> >> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> >> >>>>>> per-cpu cluster.
> >> >>>>>
> >> >>>>> Yes.
> >> >>>>>
> >> >>>>>> I was arguing to make it always shared. Perhaps the best
> >> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
> >> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
> >> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
> >> >>>>>> the shared approach as part of the "big rewrite"?
> >> >>>>>>>
> >> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> >> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
> >> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
> >> >>>>>>>>> way to help that?
> >> >>>>>>>>
> >> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
> >> >>>>>>>> when all clusters have been scanned and not available swap range
> >> >>>>>>>> found.
> >> >>>>>
> >> >>>>> I also think that the shared approach has dead loop issue.
> >> >>>>
> >> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> >> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> >> >>>> go forever? That's surely just an implementation issue to solve? It's not a
> >> >>>> reason to avoid the design principle; if we agree that maintaining sharability
> >> >>>> of the cluster is preferred then the code must be written to guard against the
> >> >>>> dead loop problem. It could be done by remembering the first cluster you
> >> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> >> >>>> to it. (I think holding the si lock will protect against concurrently freeing
> >> >>>> the cluster so it should definitely remain in the list?).
> >> >>>
> >> >>> I believe that you can find some way to avoid the dead loop issue,
> >> >>> although your suggestion may kill the performance via looping a long list
> >> >>> of nonfull clusters.
> >> >>
> >> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> >> >> list when made current for a cpu), that only reduces the size of the list by a
> >> >> maximum of the number of CPUs in the system, which I suspect is pretty small
> >> >> compared to the number of nonfull clusters.
> >> >
> >> > Anyway, this depends on details. If we cannot allocate a order-N swap
> >> > entry from the cluster, we should remove it from the nonfull list for
> >> > order-N (This is the behavior of this patch too).
> >
> > Yes, Kairui implements something like that in the reclaim part of the
> > patch series. It is after patch 3. We are heavily testing the
> > performance and the stability of the reclaim patches. May I post the
> > reclaim together with patch 3 for discussion. If you want we can
> > discuss the re-order the patch in a later iteration.
> >
> >>
> >> Yes that's a good point, and I conceed it is more difficult to detect that
> >> condition if the cluster is shared. I suspect that with a bit of thinking, we
> >> could find a way though.
> >
> > Kaiui has the patch series show a good performance number that beats
> > the current swap cache reclaim.
> >
> > I want to make a point regarding the patch ordering before vs after
> > patch 3 (aka the big rewrite).
> > Previously, the "san_swap_map_try_ssd_cluster" only did partial
> > allocation. It does not sucessfully allocate a swap entry 100% the
> > time. The patch 3 makes the cluster allocation function return the
> > swap entry 100% of the time. There are no more fallback retry loops
> > outside of the cluster allocation function. Also the try_ssd function
> > does not do swap cache reclaims while the cluster allocation function
> > will need to. These two have very different constraints.
> >
> > There for, adding different cluster header into
> > san_swap_map_try_ssd_cluste will be a lot of waste investment of
> > development time in the sense that, that function will need to be
> > rewrite any way, the end result is very different.
>
> I am not a big fan of implementing the final solution directly.
> Personally, I prefer to improve step by step.
The current proposed order also improves things step by step. The only
disagreement here is which patch order we introduce yet another list
in addition to the nonfull one. I just feel that it does not make
sense to invest into new code if that new code is going to be
completely rewrite anyway in the next two patches.
Unless you mean is we should not do the patch 3 big rewrite and should
continue the scan_swap_map_try_ssd_cluster() way of only doing half of
the allocation job and let scan_swap_map_slots() do the complex retry
on top of try_ssd(). I feel the overall code is more complex and less
maintainable.
> > That is why I want to make this change patch after patch 3. There is
> > also the long test cycle after the modification to make sure the swap
> > code path is stable. I am not resisting a change of patch orders, it
> > is that patch can't directly be removed before patch 3 before the big
> > rewrite.
> >
> >
> >>
> >> > Your original
> >> > suggestion appears like that you want to keep all cluster with order-N
> >> > on the nonfull list for order-N always unless the number of free swap
> >> > entry is less than 1<<N.
> >>
> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> that if a full scan of the cluster has been performed and no swap entries have
> >> been freed since the scan started then it should also be removed from the list.
> >
> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> > full cluster that for the cluster has been scan and not able to
> > allocate order N entry.
> >
> >>
> >> >
> >> >>> And, I understand that in some situations it may
> >> >>> be better to share clusters among CPUs. So my suggestion is,
> >> >>>
> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >>> have free swap entries with that order even after we are sure that we
> >> >>> haven't.
> >> >>
> >> >> Is this patch pretending that today? I don't think so?
> >> >
> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> > sure that there are no order-N free swap entry in the cluster.
> >>
> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> are for order-0 and you have no means of allocating high orders until a whole
> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> better for swap_cluster_info->order to remain static while the cluster is
> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> conclusion wrong!).
> >
> > Yes, that is the original intent, keep the cluster order as much as possible.
> >
> >>
> >> >
> >> >> But I agree that a
> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >
> >> > We can check that when free swap entry via checking adjacent swap
> >> > entries. IMHO, the performance should be acceptable.
> >>
> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> a separate change on top of what Chris is doing here. For high orders there
> >> could be quite a bit of scanning required in the worst case for every page that
> >> gets freed.
> >
> > Right, I feel that is a different set of patches. Even this series is
> > hard enough for review. Those order promotion and demotion is heading
> > towards a buddy system design. I want to point out that even the buddy
> > system is not able to handle the case that swapfile is almost full and
> > the recently freed swap entries are not contiguous.
> >
> > We can invest in the buddy system, which doesn't handle all the
> > fragmentation issues. Or I prefer to go directly to the discontiguous
> > swap entry. We pay a price for the indirect mapping of swap entries.
> > But it will solve the fragmentation issue 100%.
>
> It's good if we can solve the fragmentation issue 100%. Just need to
> pay attention to the cost.
The cost you mean the development cost or the run time cost (memory and cpu)?
>
> >>
> >> >
> >> >>>
> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >>> among CPUs?
> >> >>
> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> given order if there are actually slots available just because they have been
> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> approach.
> >> >>
> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >>> cluster will not be shared among CPUs in most cases.
> >> >>
> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> current half-and-half policy in this patch.
> >> >
> >> > Sounds good to me. We can start with exclusive solution and evaluate
> >> > whether shared solution is good.
> >>
> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >
> > It is not able to avoid fragementation 100% of the time. I prefer the
> > discontinued swap entry as the next step, which guarantees forward
> > progress, we will not be stuck in a situation where we are not able to
> > allocate swap entries due to fragmentation.
>
> If my understanding were correct, the implementation complexity of the
> order promotion/demotion isn't at the same level of that of discontinued
> swap entry.
Discontinued swap entry has higher complexity but higher payout as
well. It can get us to the place where cluster promotion/demotion
can't.
I also feel that if we implement something towards a buddy system
allocator for swap, we should do a proper buddy allocator
implementation of data structures.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-25 6:53 ` Huang, Ying
@ 2024-07-25 8:26 ` Chris Li
2024-07-26 2:04 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-25 8:26 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Wed, Jul 24, 2024 at 11:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
> > On 23/07/2024 07:27, Huang, Ying wrote:
> >> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>
> >>> On 22/07/2024 09:49, Huang, Ying wrote:
> >>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>>
> >>>>> On 22/07/2024 03:14, Huang, Ying wrote:
> >>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
> >>>>>>
> >>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
> >>>>>>>> Chris Li <chrisl@kernel.org> writes:
> >>>>>>>>
> >>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
> >>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
> >>>>>>
> >>>>>> [snip]
> >>>>>>
> >>>>>>>>>>>>> +
> >>>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
> >>>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
> >>>>>>>>>>>>
> >>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
> >>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
> >>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
> >>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
> >>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
> >>>>>>>>>>>>
> >>>>>>>>>>>> So you could have this situation:
> >>>>>>>>>>>>
> >>>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
> >>>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
> >>>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
> >>>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
> >>>>>>>>>>>>
> >>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
> >>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
> >>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
> >>>>>>>>>>>
> >>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
> >>>>>>>>>>> moving it into nonfull.
> >>>>>>>>>>
> >>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
> >>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
> >>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
> >>>>>>>>>> refactoring separated from behavioural changes.
> >>>>>>>>>
> >>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
> >>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
> >>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
> >>>>>>>>> allocation.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
> >>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
> >>>>>>>>>> results quoted on the cover letter?
> >>>>>>>>>
> >>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
> >>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
> >>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
> >>>>>>>>> because it is not feature complete. Currently it does not do swap
> >>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
> >>>>>>>>> remove the RFC.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
> >>>>>>>>>>>
> >>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
> >>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
> >>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
> >>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
> >>>>>>>>>>>> chances of multiple CPUs using the same cluster.
> >>>>>>>>>>>
> >>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
> >>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
> >>>>>>>>>>> entries allocated from the previous CPU.
> >>>>>>>>>>
> >>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
> >>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
> >>>>>>>>>
> >>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
> >>>>>>>>> advances to the next cluster pointer, it can cross with the other
> >>>>>>>>> CPU's next cluster pointer.
> >>>>>>>>
> >>>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
> >>>>>>>> cluster only. If it doesn't do that, we should fix it.
> >>>>>>>>
> >>>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
> >>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
> >>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
> >>>>>>>> list if necessary. And, we should make it correct in this patch instead
> >>>>>>>> of later in series. I understand that you want to make the patch itself
> >>>>>>>> simple, but it's important to make code simple to be understood too.
> >>>>>>>> Consistent design choice will do that.
> >>>>>>>
> >>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
> >>>>>>
> >>>>>> Sorry, I misunderstood your words.
> >>>>>>
> >>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
> >>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
> >>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
> >>>>>>> case, in which case it should be added to the nonfull list.
> >>>>>>>
> >>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
> >>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
> >>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
> >>>>>>> This neither-one-policy-nor-the-other seems odd to me.
> >>>>>>>
> >>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
> >>>>>>> per-cpu cluster.
> >>>>>>
> >>>>>> Yes.
> >>>>>>
> >>>>>>> I was arguing to make it always shared. Perhaps the best
> >>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
> >>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
> >>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
> >>>>>>> the shared approach as part of the "big rewrite"?
> >>>>>>>>
> >>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
> >>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
> >>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
> >>>>>>>>>> way to help that?
> >>>>>>>>>
> >>>>>>>>> Simply moving to the end of the list can create a possible deadloop
> >>>>>>>>> when all clusters have been scanned and not available swap range
> >>>>>>>>> found.
> >>>>>>
> >>>>>> I also think that the shared approach has dead loop issue.
> >>>>>
> >>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
> >>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
> >>>>> go forever? That's surely just an implementation issue to solve? It's not a
> >>>>> reason to avoid the design principle; if we agree that maintaining sharability
> >>>>> of the cluster is preferred then the code must be written to guard against the
> >>>>> dead loop problem. It could be done by remembering the first cluster you
> >>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
> >>>>> to it. (I think holding the si lock will protect against concurrently freeing
> >>>>> the cluster so it should definitely remain in the list?).
> >>>>
> >>>> I believe that you can find some way to avoid the dead loop issue,
> >>>> although your suggestion may kill the performance via looping a long list
> >>>> of nonfull clusters.
> >>>
> >>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
> >>> list when made current for a cpu), that only reduces the size of the list by a
> >>> maximum of the number of CPUs in the system, which I suspect is pretty small
> >>> compared to the number of nonfull clusters.
> >>
> >> Anyway, this depends on details. If we cannot allocate a order-N swap
> >> entry from the cluster, we should remove it from the nonfull list for
> >> order-N (This is the behavior of this patch too).
> >
> > Yes that's a good point, and I conceed it is more difficult to detect that
> > condition if the cluster is shared. I suspect that with a bit of thinking, we
> > could find a way though.
> >
> >> Your original
> >> suggestion appears like that you want to keep all cluster with order-N
> >> on the nonfull list for order-N always unless the number of free swap
> >> entry is less than 1<<N.
> >
> > Well I think that's certainly one of the conditions for removing it. But agree
> > that if a full scan of the cluster has been performed and no swap entries have
> > been freed since the scan started then it should also be removed from the list.
> >
> >>
> >>>> And, I understand that in some situations it may
> >>>> be better to share clusters among CPUs. So my suggestion is,
> >>>>
> >>>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >>>> have free swap entries with that order even after we are sure that we
> >>>> haven't.
> >>>
> >>> Is this patch pretending that today? I don't think so?
> >>
> >> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> sure that there are no order-N free swap entry in the cluster.
> >
> > Oh I see what you mean. I think you and Chris already discussed this? IIRC
> > Chris's point was that if you move that cluster to N-1, eventually all clusters
> > are for order-0 and you have no means of allocating high orders until a whole
> > cluster becomes free. That logic certainly makes sense to me, so think its
> > better for swap_cluster_info->order to remain static while the cluster is
> > allocated. (I only skimmed that conversation so appologies if I got the
> > conclusion wrong!).
> >
> >>
> >>> But I agree that a
> >>> cluster should only be on the per-order nonfull list if we know there are at
> >>> least enough free swap entries in that cluster to cover the order. Of course
> >>> that doesn't tell us for sure because they may not be contiguous.
> >>
> >> We can check that when free swap entry via checking adjacent swap
> >> entries. IMHO, the performance should be acceptable.
> >
> > Would you then use the result of that scanning to "promote" a cluster's order?
> > e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> > a separate change on top of what Chris is doing here. For high orders there
> > could be quite a bit of scanning required in the worst case for every page that
> > gets freed.
>
> We can try to optimize it to control overhead if necessary.
>
> >>
> >>>>
> >>>> My question is whether it's so important to share the per-cpu cluster
> >>>> among CPUs?
> >>>
> >>> My rationale for sharing is that the preference previously has been to favour
> >>> efficient use of swap space; we don't want to fail a request for allocation of a
> >>> given order if there are actually slots available just because they have been
> >>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >>> actually help improve allocation success, then I'm happy to take the exclusive
> >>> approach.
> >>>
> >>>> I suggest to start with simple design, that is, per-CPU
> >>>> cluster will not be shared among CPUs in most cases.
> >>>
> >>> I'm all for starting simple; I think that's what I already proposed (exclusive
> >>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >>> current half-and-half policy in this patch.
> >>
> >> Sounds good to me. We can start with exclusive solution and evaluate
> >> whether shared solution is good.
> >
> > Yep. And also evaluate the dynamic order inc/dec idea too...
>
> Dynamic order inc/dec tries solving a more fundamental problem. For
> example,
>
> - Initially, almost only order-0 pages are swapped out, most non-full
> clusters are order-0.
>
> - Later, quite some order-0 swap entries are freed so that there are
> quite some order-4 swap entries available.
If the freeing of swap entry is random distribution. You need 16
continuous swap entries free at the same time at aligned 16 base
locations. The total number of order 4 free swap space add up together
is much lower than the order 0 allocatable swap space.
If having one entry free is 50% probability(swapfile half full), then
having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
If the swapfile is 80% full, that number drops to 6.5 E -12.
> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> clusters available.
Exactly.
>
> So, we need a way to migrate non-full clusters among orders to adjust to
> the various situations automatically.
There is no easy way to migrate swap entries to different locations.
That is why I like to have discontiguous swap entries allocation for
mTHP.
>
> But yes, data is needed for any performance related change.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-25 8:26 ` Chris Li
@ 2024-07-26 2:04 ` Huang, Ying
2024-07-26 4:50 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 2:04 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Wed, Jul 24, 2024 at 11:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>
>> > On 23/07/2024 07:27, Huang, Ying wrote:
>> >> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>
>> >>> On 22/07/2024 09:49, Huang, Ying wrote:
>> >>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>>
>> >>>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >>>>>>
>> >>>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >>>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >>>>>>>>
>> >>>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >>>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >>>>>>
>> >>>>>> [snip]
>> >>>>>>
>> >>>>>>>>>>>>> +
>> >>>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >>>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >>>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >>>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >>>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >>>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> So you could have this situation:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>> >>>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>> >>>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>> >>>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >>>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >>>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >>>>>>>>>>>
>> >>>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >>>>>>>>>>> moving it into nonfull.
>> >>>>>>>>>>
>> >>>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >>>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >>>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >>>>>>>>>> refactoring separated from behavioural changes.
>> >>>>>>>>>
>> >>>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >>>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >>>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >>>>>>>>> allocation.
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >>>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >>>>>>>>>> results quoted on the cover letter?
>> >>>>>>>>>
>> >>>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >>>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >>>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >>>>>>>>> because it is not feature complete. Currently it does not do swap
>> >>>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >>>>>>>>> remove the RFC.
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >>>>>>>>>>>
>> >>>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >>>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >>>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >>>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >>>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >>>>>>>>>>>
>> >>>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >>>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >>>>>>>>>>> entries allocated from the previous CPU.
>> >>>>>>>>>>
>> >>>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >>>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >>>>>>>>>
>> >>>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >>>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >>>>>>>>> CPU's next cluster pointer.
>> >>>>>>>>
>> >>>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>> >>>>>>>> cluster only. If it doesn't do that, we should fix it.
>> >>>>>>>>
>> >>>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>> >>>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>> >>>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >>>>>>>> list if necessary. And, we should make it correct in this patch instead
>> >>>>>>>> of later in series. I understand that you want to make the patch itself
>> >>>>>>>> simple, but it's important to make code simple to be understood too.
>> >>>>>>>> Consistent design choice will do that.
>> >>>>>>>
>> >>>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >>>>>>
>> >>>>>> Sorry, I misunderstood your words.
>> >>>>>>
>> >>>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >>>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >>>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >>>>>>> case, in which case it should be added to the nonfull list.
>> >>>>>>>
>> >>>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >>>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >>>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >>>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >>>>>>>
>> >>>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >>>>>>> per-cpu cluster.
>> >>>>>>
>> >>>>>> Yes.
>> >>>>>>
>> >>>>>>> I was arguing to make it always shared. Perhaps the best
>> >>>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >>>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >>>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >>>>>>> the shared approach as part of the "big rewrite"?
>> >>>>>>>>
>> >>>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >>>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >>>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >>>>>>>>>> way to help that?
>> >>>>>>>>>
>> >>>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >>>>>>>>> when all clusters have been scanned and not available swap range
>> >>>>>>>>> found.
>> >>>>>>
>> >>>>>> I also think that the shared approach has dead loop issue.
>> >>>>>
>> >>>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >>>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >>>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >>>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >>>>> of the cluster is preferred then the code must be written to guard against the
>> >>>>> dead loop problem. It could be done by remembering the first cluster you
>> >>>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >>>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >>>>> the cluster so it should definitely remain in the list?).
>> >>>>
>> >>>> I believe that you can find some way to avoid the dead loop issue,
>> >>>> although your suggestion may kill the performance via looping a long list
>> >>>> of nonfull clusters.
>> >>>
>> >>> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >>> list when made current for a cpu), that only reduces the size of the list by a
>> >>> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >>> compared to the number of nonfull clusters.
>> >>
>> >> Anyway, this depends on details. If we cannot allocate a order-N swap
>> >> entry from the cluster, we should remove it from the nonfull list for
>> >> order-N (This is the behavior of this patch too).
>> >
>> > Yes that's a good point, and I conceed it is more difficult to detect that
>> > condition if the cluster is shared. I suspect that with a bit of thinking, we
>> > could find a way though.
>> >
>> >> Your original
>> >> suggestion appears like that you want to keep all cluster with order-N
>> >> on the nonfull list for order-N always unless the number of free swap
>> >> entry is less than 1<<N.
>> >
>> > Well I think that's certainly one of the conditions for removing it. But agree
>> > that if a full scan of the cluster has been performed and no swap entries have
>> > been freed since the scan started then it should also be removed from the list.
>> >
>> >>
>> >>>> And, I understand that in some situations it may
>> >>>> be better to share clusters among CPUs. So my suggestion is,
>> >>>>
>> >>>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >>>> have free swap entries with that order even after we are sure that we
>> >>>> haven't.
>> >>>
>> >>> Is this patch pretending that today? I don't think so?
>> >>
>> >> IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> sure that there are no order-N free swap entry in the cluster.
>> >
>> > Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> > Chris's point was that if you move that cluster to N-1, eventually all clusters
>> > are for order-0 and you have no means of allocating high orders until a whole
>> > cluster becomes free. That logic certainly makes sense to me, so think its
>> > better for swap_cluster_info->order to remain static while the cluster is
>> > allocated. (I only skimmed that conversation so appologies if I got the
>> > conclusion wrong!).
>> >
>> >>
>> >>> But I agree that a
>> >>> cluster should only be on the per-order nonfull list if we know there are at
>> >>> least enough free swap entries in that cluster to cover the order. Of course
>> >>> that doesn't tell us for sure because they may not be contiguous.
>> >>
>> >> We can check that when free swap entry via checking adjacent swap
>> >> entries. IMHO, the performance should be acceptable.
>> >
>> > Would you then use the result of that scanning to "promote" a cluster's order?
>> > e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> > a separate change on top of what Chris is doing here. For high orders there
>> > could be quite a bit of scanning required in the worst case for every page that
>> > gets freed.
>>
>> We can try to optimize it to control overhead if necessary.
>>
>> >>
>> >>>>
>> >>>> My question is whether it's so important to share the per-cpu cluster
>> >>>> among CPUs?
>> >>>
>> >>> My rationale for sharing is that the preference previously has been to favour
>> >>> efficient use of swap space; we don't want to fail a request for allocation of a
>> >>> given order if there are actually slots available just because they have been
>> >>> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >>> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >>> actually help improve allocation success, then I'm happy to take the exclusive
>> >>> approach.
>> >>>
>> >>>> I suggest to start with simple design, that is, per-CPU
>> >>>> cluster will not be shared among CPUs in most cases.
>> >>>
>> >>> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >>> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >>> current half-and-half policy in this patch.
>> >>
>> >> Sounds good to me. We can start with exclusive solution and evaluate
>> >> whether shared solution is good.
>> >
>> > Yep. And also evaluate the dynamic order inc/dec idea too...
>>
>> Dynamic order inc/dec tries solving a more fundamental problem. For
>> example,
>>
>> - Initially, almost only order-0 pages are swapped out, most non-full
>> clusters are order-0.
>>
>> - Later, quite some order-0 swap entries are freed so that there are
>> quite some order-4 swap entries available.
>
> If the freeing of swap entry is random distribution. You need 16
> continuous swap entries free at the same time at aligned 16 base
> locations. The total number of order 4 free swap space add up together
> is much lower than the order 0 allocatable swap space.
> If having one entry free is 50% probability(swapfile half full), then
> having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> If the swapfile is 80% full, that number drops to 6.5 E -12.
This depends on workloads. Quite some workloads will show some degree
of spatial locality. For a workload with no spatial locality at all as
above, mTHP may be not a good choice at the first place.
>> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> clusters available.
>
> Exactly.
>
>>
>> So, we need a way to migrate non-full clusters among orders to adjust to
>> the various situations automatically.
>
> There is no easy way to migrate swap entries to different locations.
> That is why I like to have discontiguous swap entries allocation for
> mTHP.
We suggest to migrate non-full swap clsuters among different lists, not
swap entries.
>>
>> But yes, data is needed for any performance related change.
BTW: I think non-full cluster isn't a good name. Partial cluster is
much better and follows the same convention as partial slab.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-25 8:09 ` Chris Li
@ 2024-07-26 2:09 ` Huang, Ying
2024-07-26 5:09 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 2:09 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Wed, Jul 24, 2024 at 11:46 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > Hi Ryan and Ying,
>> >
>> > Sorry I was busy. I am catching up on the email now.
>> >
>> > On Wed, Jul 24, 2024 at 1:33 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>
>> >> On 23/07/2024 07:27, Huang, Ying wrote:
>> >> > Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >
>> >> >> On 22/07/2024 09:49, Huang, Ying wrote:
>> >> >>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>>
>> >> >>>> On 22/07/2024 03:14, Huang, Ying wrote:
>> >> >>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> >> >>>>>
>> >> >>>>>> On 18/07/2024 08:53, Huang, Ying wrote:
>> >> >>>>>>> Chris Li <chrisl@kernel.org> writes:
>> >> >>>>>>>
>> >> >>>>>>>> On Wed, Jul 17, 2024 at 3:14 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>>>>>>>>
>> >> >>>>>>>>> On 16/07/2024 23:46, Chris Li wrote:
>> >> >>>>>>>>>> On Mon, Jul 15, 2024 at 8:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> On 11/07/2024 08:29, Chris Li wrote:
>> >> >>>>>
>> >> >>>>> [snip]
>> >> >>>>>
>> >> >>>>>>>>>>>> +
>> >> >>>>>>>>>>>> + if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
>> >> >>>>>>>>>>>> + list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> I find the transitions when you add and remove a cluster from the
>> >> >>>>>>>>>>> nonfull_clusters list a bit strange (if I've understood correctly): It is added
>> >> >>>>>>>>>>> to the list whenever there is at least one free swap entry if not already on the
>> >> >>>>>>>>>>> list. But you take it off the list when assigning it as the current cluster for
>> >> >>>>>>>>>>> a cpu in scan_swap_map_try_ssd_cluster().
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> So you could have this situation:
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> - cpuA allocs cluster from free list (exclusive to that cpu)
>> >> >>>>>>>>>>> - cpuA allocs 1 swap entry from current cluster
>> >> >>>>>>>>>>> - swap entry is freed; cluster added to nonfull_clusters
>> >> >>>>>>>>>>> - cpuB "allocs" cluster from nonfull_clusters
>> >> >>>>>>>>>>>
>> >> >>>>>>>>>>> At this point both cpuA and cpuB share the same cluster as their current
>> >> >>>>>>>>>>> cluster. So why not just put the cluster on the nonfull_clusters list at
>> >> >>>>>>>>>>> allocation time (when removed from free_list) and only remove it from the
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> The big rewrite on patch 3 does that, taking it off the free list and
>> >> >>>>>>>>>> moving it into nonfull.
>> >> >>>>>>>>>
>> >> >>>>>>>>> Oh, from the title, "RFC: mm: swap: seperate SSD allocation from
>> >> >>>>>>>>> scan_swap_map_slots()" I assumed that was just a refactoring of the code to
>> >> >>>>>>>>> separate the SSD and HDD code paths. Personally I'd prefer to see the
>> >> >>>>>>>>> refactoring separated from behavioural changes.
>> >> >>>>>>>>
>> >> >>>>>>>> It is not a refactoring. It is a big rewrite of the swap allocator
>> >> >>>>>>>> using the cluster. Behavior change is expected. The goal is completely
>> >> >>>>>>>> removing the brute force scanning of swap_map[] array for cluster swap
>> >> >>>>>>>> allocation.
>> >> >>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>> Since the patch was titled RFC and I thought it was just refactoring, I was
>> >> >>>>>>>>> deferring review. But sounds like it is actually required to realize the test
>> >> >>>>>>>>> results quoted on the cover letter?
>> >> >>>>>>>>
>> >> >>>>>>>> Yes, required because it handles the previous fall out case try_ssd()
>> >> >>>>>>>> failed. This big rewrite has gone through a lot of testing and bug
>> >> >>>>>>>> fix. It is pretty stable now. The only reason I keep it as RFC is
>> >> >>>>>>>> because it is not feature complete. Currently it does not do swap
>> >> >>>>>>>> cache reclaim. The next version will have swap cache reclaim and
>> >> >>>>>>>> remove the RFC.
>> >> >>>>>>>>
>> >> >>>>>>>>>
>> >> >>>>>>>>>> I am only making the minimal change in this step so the big rewrite can land.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>>> nonfull_clusters list when it is completely full (or at least definitely doesn't
>> >> >>>>>>>>>>> have room for an `order` allocation)? Then you allow "stealing" always instead
>> >> >>>>>>>>>>> of just sometimes. You would likely want to move the cluster to the end of the
>> >> >>>>>>>>>>> nonfull list when selecting it in scan_swap_map_try_ssd_cluster() to reduce the
>> >> >>>>>>>>>>> chances of multiple CPUs using the same cluster.
>> >> >>>>>>>>>>
>> >> >>>>>>>>>> For nonfull clusters it is less important to avoid multiple CPU
>> >> >>>>>>>>>> sharing the cluster. Because the cluster already has previous swap
>> >> >>>>>>>>>> entries allocated from the previous CPU.
>> >> >>>>>>>>>
>> >> >>>>>>>>> But if 2 CPUs have the same cluster, isn't there a pathalogical case where cpuA
>> >> >>>>>>>>> could be slightly ahead of cpuB so that cpuA allocates all the free pages and
>> >> >>>>>>>>
>> >> >>>>>>>> That happens to exist per cpu next pointer already. When the other CPU
>> >> >>>>>>>> advances to the next cluster pointer, it can cross with the other
>> >> >>>>>>>> CPU's next cluster pointer.
>> >> >>>>>>>
>> >> >>>>>>> No. si->percpu_cluster[cpu].next will keep in the current per cpu
>> >> >>>>>>> cluster only. If it doesn't do that, we should fix it.
>> >> >>>>>>>
>> >> >>>>>>> I agree with Ryan that we should make per cpu cluster correct. A
>> >> >>>>>>> cluster in per cpu cluster shouldn't be put in nonfull list. When we
>> >> >>>>>>> scan to the end of a per cpu cluster, we can put the cluster in nonfull
>> >> >>>>>>> list if necessary. And, we should make it correct in this patch instead
>> >> >>>>>>> of later in series. I understand that you want to make the patch itself
>> >> >>>>>>> simple, but it's important to make code simple to be understood too.
>> >> >>>>>>> Consistent design choice will do that.
>> >> >>>>>>
>> >> >>>>>> I think I'm actually arguing for the opposite of what you suggest here.
>> >> >>>>>
>> >> >>>>> Sorry, I misunderstood your words.
>> >> >>>>>
>> >> >>>>>> As I see it, there are 2 possible approaches; either a cluster is always
>> >> >>>>>> considered exclusive to a single cpu when its set as a per-cpu cluster, so it
>> >> >>>>>> does not appear on the nonfull list. Or a cluster is considered sharable in this
>> >> >>>>>> case, in which case it should be added to the nonfull list.
>> >> >>>>>>
>> >> >>>>>> The code at the moment sort of does both; when a cpu decides to use a cluster in
>> >> >>>>>> the nonfull list, it removes it from that list to make it exclusive. But as soon
>> >> >>>>>> as a single swap entry is freed from that cluster it is put back on the list.
>> >> >>>>>> This neither-one-policy-nor-the-other seems odd to me.
>> >> >>>>>>
>> >> >>>>>> I think Huang, Ying is arguing to keep it always exclusive while installed as a
>> >> >>>>>> per-cpu cluster.
>> >> >>>>>
>> >> >>>>> Yes.
>> >> >>>>>
>> >> >>>>>> I was arguing to make it always shared. Perhaps the best
>> >> >>>>>> approach is to implement the exclusive policy in this patch (you'd need a flag
>> >> >>>>>> to note if any pages were freed while in exclusive use, then when exclusive use
>> >> >>>>>> completes, put it back on the nonfull list if the flag was set). Then migrate to
>> >> >>>>>> the shared approach as part of the "big rewrite"?
>> >> >>>>>>>
>> >> >>>>>>>>> cpuB just ends up scanning and finding nothing to allocate. I think do want to
>> >> >>>>>>>>> share the cluster when you really need to, but try to avoid it if there are
>> >> >>>>>>>>> other options, and I think moving the cluster to the end of the list might be a
>> >> >>>>>>>>> way to help that?
>> >> >>>>>>>>
>> >> >>>>>>>> Simply moving to the end of the list can create a possible deadloop
>> >> >>>>>>>> when all clusters have been scanned and not available swap range
>> >> >>>>>>>> found.
>> >> >>>>>
>> >> >>>>> I also think that the shared approach has dead loop issue.
>> >> >>>>
>> >> >>>> What exactly do you mean by dead loop issue? Perhaps you are suggesting the code
>> >> >>>> won't know when to stop dequeing/requeuing clusters on the nonfull list and will
>> >> >>>> go forever? That's surely just an implementation issue to solve? It's not a
>> >> >>>> reason to avoid the design principle; if we agree that maintaining sharability
>> >> >>>> of the cluster is preferred then the code must be written to guard against the
>> >> >>>> dead loop problem. It could be done by remembering the first cluster you
>> >> >>>> dequeued/requeued in scan_swap_map_try_ssd_cluster() and stop when you get back
>> >> >>>> to it. (I think holding the si lock will protect against concurrently freeing
>> >> >>>> the cluster so it should definitely remain in the list?).
>> >> >>>
>> >> >>> I believe that you can find some way to avoid the dead loop issue,
>> >> >>> although your suggestion may kill the performance via looping a long list
>> >> >>> of nonfull clusters.
>> >> >>
>> >> >> I don't agree; If the clusters are considered exclusive (i.e. removed from the
>> >> >> list when made current for a cpu), that only reduces the size of the list by a
>> >> >> maximum of the number of CPUs in the system, which I suspect is pretty small
>> >> >> compared to the number of nonfull clusters.
>> >> >
>> >> > Anyway, this depends on details. If we cannot allocate a order-N swap
>> >> > entry from the cluster, we should remove it from the nonfull list for
>> >> > order-N (This is the behavior of this patch too).
>> >
>> > Yes, Kairui implements something like that in the reclaim part of the
>> > patch series. It is after patch 3. We are heavily testing the
>> > performance and the stability of the reclaim patches. May I post the
>> > reclaim together with patch 3 for discussion. If you want we can
>> > discuss the re-order the patch in a later iteration.
>> >
>> >>
>> >> Yes that's a good point, and I conceed it is more difficult to detect that
>> >> condition if the cluster is shared. I suspect that with a bit of thinking, we
>> >> could find a way though.
>> >
>> > Kaiui has the patch series show a good performance number that beats
>> > the current swap cache reclaim.
>> >
>> > I want to make a point regarding the patch ordering before vs after
>> > patch 3 (aka the big rewrite).
>> > Previously, the "san_swap_map_try_ssd_cluster" only did partial
>> > allocation. It does not sucessfully allocate a swap entry 100% the
>> > time. The patch 3 makes the cluster allocation function return the
>> > swap entry 100% of the time. There are no more fallback retry loops
>> > outside of the cluster allocation function. Also the try_ssd function
>> > does not do swap cache reclaims while the cluster allocation function
>> > will need to. These two have very different constraints.
>> >
>> > There for, adding different cluster header into
>> > san_swap_map_try_ssd_cluste will be a lot of waste investment of
>> > development time in the sense that, that function will need to be
>> > rewrite any way, the end result is very different.
>>
>> I am not a big fan of implementing the final solution directly.
>> Personally, I prefer to improve step by step.
>
> The current proposed order also improves things step by step. The only
> disagreement here is which patch order we introduce yet another list
> in addition to the nonfull one. I just feel that it does not make
> sense to invest into new code if that new code is going to be
> completely rewrite anyway in the next two patches.
>
> Unless you mean is we should not do the patch 3 big rewrite and should
> continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> the allocation job and let scan_swap_map_slots() do the complex retry
> on top of try_ssd(). I feel the overall code is more complex and less
> maintainable.
I haven't look at [3/3], will wait for your next version for that. So,
I cannot say which order is better. Please consider reviewers' effort
too. Small step patch is easier to be understood and reviewed.
>> > That is why I want to make this change patch after patch 3. There is
>> > also the long test cycle after the modification to make sure the swap
>> > code path is stable. I am not resisting a change of patch orders, it
>> > is that patch can't directly be removed before patch 3 before the big
>> > rewrite.
>> >
>> >
>> >>
>> >> > Your original
>> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> > on the nonfull list for order-N always unless the number of free swap
>> >> > entry is less than 1<<N.
>> >>
>> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> been freed since the scan started then it should also be removed from the list.
>> >
>> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> > full cluster that for the cluster has been scan and not able to
>> > allocate order N entry.
>> >
>> >>
>> >> >
>> >> >>> And, I understand that in some situations it may
>> >> >>> be better to share clusters among CPUs. So my suggestion is,
>> >> >>>
>> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >>> have free swap entries with that order even after we are sure that we
>> >> >>> haven't.
>> >> >>
>> >> >> Is this patch pretending that today? I don't think so?
>> >> >
>> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> > sure that there are no order-N free swap entry in the cluster.
>> >>
>> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> conclusion wrong!).
>> >
>> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >
>> >>
>> >> >
>> >> >> But I agree that a
>> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >
>> >> > We can check that when free swap entry via checking adjacent swap
>> >> > entries. IMHO, the performance should be acceptable.
>> >>
>> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> a separate change on top of what Chris is doing here. For high orders there
>> >> could be quite a bit of scanning required in the worst case for every page that
>> >> gets freed.
>> >
>> > Right, I feel that is a different set of patches. Even this series is
>> > hard enough for review. Those order promotion and demotion is heading
>> > towards a buddy system design. I want to point out that even the buddy
>> > system is not able to handle the case that swapfile is almost full and
>> > the recently freed swap entries are not contiguous.
>> >
>> > We can invest in the buddy system, which doesn't handle all the
>> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> > swap entry. We pay a price for the indirect mapping of swap entries.
>> > But it will solve the fragmentation issue 100%.
>>
>> It's good if we can solve the fragmentation issue 100%. Just need to
>> pay attention to the cost.
>
> The cost you mean the development cost or the run time cost (memory and cpu)?
I mean runtime cost.
>>
>> >>
>> >> >
>> >> >>>
>> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >>> among CPUs?
>> >> >>
>> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> given order if there are actually slots available just because they have been
>> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> approach.
>> >> >>
>> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >>
>> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> current half-and-half policy in this patch.
>> >> >
>> >> > Sounds good to me. We can start with exclusive solution and evaluate
>> >> > whether shared solution is good.
>> >>
>> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >
>> > It is not able to avoid fragementation 100% of the time. I prefer the
>> > discontinued swap entry as the next step, which guarantees forward
>> > progress, we will not be stuck in a situation where we are not able to
>> > allocate swap entries due to fragmentation.
>>
>> If my understanding were correct, the implementation complexity of the
>> order promotion/demotion isn't at the same level of that of discontinued
>> swap entry.
>
> Discontinued swap entry has higher complexity but higher payout as
> well. It can get us to the place where cluster promotion/demotion
> can't.
>
> I also feel that if we implement something towards a buddy system
> allocator for swap, we should do a proper buddy allocator
> implementation of data structures.
I don't think that it's easy to implement a real buddy allocator for
swap entries. So, I avoid to use buddy in my words.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 2:04 ` Huang, Ying
@ 2024-07-26 4:50 ` Chris Li
2024-07-26 5:52 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-26 4:50 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
> > If the freeing of swap entry is random distribution. You need 16
> > continuous swap entries free at the same time at aligned 16 base
> > locations. The total number of order 4 free swap space add up together
> > is much lower than the order 0 allocatable swap space.
> > If having one entry free is 50% probability(swapfile half full), then
> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>
> This depends on workloads. Quite some workloads will show some degree
> of spatial locality. For a workload with no spatial locality at all as
> above, mTHP may be not a good choice at the first place.
The fragmentation comes from the order 0 entry not from the mTHP. mTHP
have their own valid usage case, and should be separate from how you
use the order 0 entry. That is why I consider this kind of strategy
only works on the lucky case. I would much prefer the strategy that
can guarantee work not depend on luck.
> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> clusters available.
> >
> > Exactly.
> >
> >>
> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> the various situations automatically.
> >
> > There is no easy way to migrate swap entries to different locations.
> > That is why I like to have discontiguous swap entries allocation for
> > mTHP.
>
> We suggest to migrate non-full swap clsuters among different lists, not
> swap entries.
Then you have the down side of reducing the number of total high order
clusters. By chance it is much easier to fragment the cluster than
anti-fragment a cluster. The orders of clusters have a natural
tendency to move down rather than move up, given long enough time of
random access. It will likely run out of high order clusters in the
long run if we don't have any separation of orders.
> >> But yes, data is needed for any performance related change.
>
> BTW: I think non-full cluster isn't a good name. Partial cluster is
> much better and follows the same convention as partial slab.
I am not opposed to it. The only reason I hold off on the rename is
because there are patches from Kairui I am testing depending on it.
Let's finish up the V5 patch with the swap cache reclaim code path
then do the renaming as one batch job. We actually have more than one
list that has the clusters partially full. It helps reduce the repeat
scan of the cluster that is not full but also not able to allocate
swap entries for this order. Just the name of one of them as
"partial" is not precise either. Because the other lists are also
partially full. We'd better give them precise meaning systematically.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 2:09 ` Huang, Ying
@ 2024-07-26 5:09 ` Chris Li
2024-07-26 6:02 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-26 5:09 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > The current proposed order also improves things step by step. The only
> > disagreement here is which patch order we introduce yet another list
> > in addition to the nonfull one. I just feel that it does not make
> > sense to invest into new code if that new code is going to be
> > completely rewrite anyway in the next two patches.
> >
> > Unless you mean is we should not do the patch 3 big rewrite and should
> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> > the allocation job and let scan_swap_map_slots() do the complex retry
> > on top of try_ssd(). I feel the overall code is more complex and less
> > maintainable.
>
> I haven't look at [3/3], will wait for your next version for that. So,
> I cannot say which order is better. Please consider reviewers' effort
> too. Small step patch is easier to be understood and reviewed.
That is exactly the reason I don't want to introduce too much new code
depending on the scan_swap_map_slots() behavior, which will be
abandoned in the big rewrite. Their constraints are very different. I
want to make the big rewrite patch 3 as small as possible. Using
incremental follow up patches to improve it.
>
> >> > That is why I want to make this change patch after patch 3. There is
> >> > also the long test cycle after the modification to make sure the swap
> >> > code path is stable. I am not resisting a change of patch orders, it
> >> > is that patch can't directly be removed before patch 3 before the big
> >> > rewrite.
> >> >
> >> >
> >> >>
> >> >> > Your original
> >> >> > suggestion appears like that you want to keep all cluster with order-N
> >> >> > on the nonfull list for order-N always unless the number of free swap
> >> >> > entry is less than 1<<N.
> >> >>
> >> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> >> that if a full scan of the cluster has been performed and no swap entries have
> >> >> been freed since the scan started then it should also be removed from the list.
> >> >
> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> >> > full cluster that for the cluster has been scan and not able to
> >> > allocate order N entry.
> >> >
> >> >>
> >> >> >
> >> >> >>> And, I understand that in some situations it may
> >> >> >>> be better to share clusters among CPUs. So my suggestion is,
> >> >> >>>
> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >> >>> have free swap entries with that order even after we are sure that we
> >> >> >>> haven't.
> >> >> >>
> >> >> >> Is this patch pretending that today? I don't think so?
> >> >> >
> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> >> > sure that there are no order-N free swap entry in the cluster.
> >> >>
> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> >> are for order-0 and you have no means of allocating high orders until a whole
> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> >> better for swap_cluster_info->order to remain static while the cluster is
> >> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> >> conclusion wrong!).
> >> >
> >> > Yes, that is the original intent, keep the cluster order as much as possible.
> >> >
> >> >>
> >> >> >
> >> >> >> But I agree that a
> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >> >
> >> >> > We can check that when free swap entry via checking adjacent swap
> >> >> > entries. IMHO, the performance should be acceptable.
> >> >>
> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> >> a separate change on top of what Chris is doing here. For high orders there
> >> >> could be quite a bit of scanning required in the worst case for every page that
> >> >> gets freed.
> >> >
> >> > Right, I feel that is a different set of patches. Even this series is
> >> > hard enough for review. Those order promotion and demotion is heading
> >> > towards a buddy system design. I want to point out that even the buddy
> >> > system is not able to handle the case that swapfile is almost full and
> >> > the recently freed swap entries are not contiguous.
> >> >
> >> > We can invest in the buddy system, which doesn't handle all the
> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
> >> > swap entry. We pay a price for the indirect mapping of swap entries.
> >> > But it will solve the fragmentation issue 100%.
> >>
> >> It's good if we can solve the fragmentation issue 100%. Just need to
> >> pay attention to the cost.
> >
> > The cost you mean the development cost or the run time cost (memory and cpu)?
>
> I mean runtime cost.
Thanks for the clarification. Agree that we need to pay attention to
the run time cost. That is given.
> >> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >> >>> among CPUs?
> >> >> >>
> >> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> >> given order if there are actually slots available just because they have been
> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> >> approach.
> >> >> >>
> >> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >> >>> cluster will not be shared among CPUs in most cases.
> >> >> >>
> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> >> current half-and-half policy in this patch.
> >> >> >
> >> >> > Sounds good to me. We can start with exclusive solution and evaluate
> >> >> > whether shared solution is good.
> >> >>
> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >> >
> >> > It is not able to avoid fragementation 100% of the time. I prefer the
> >> > discontinued swap entry as the next step, which guarantees forward
> >> > progress, we will not be stuck in a situation where we are not able to
> >> > allocate swap entries due to fragmentation.
> >>
> >> If my understanding were correct, the implementation complexity of the
> >> order promotion/demotion isn't at the same level of that of discontinued
> >> swap entry.
> >
> > Discontinued swap entry has higher complexity but higher payout as
> > well. It can get us to the place where cluster promotion/demotion
> > can't.
> >
> > I also feel that if we implement something towards a buddy system
> > allocator for swap, we should do a proper buddy allocator
> > implementation of data structures.
>
> I don't think that it's easy to implement a real buddy allocator for
> swap entries. So, I avoid to use buddy in my words.
Then such a mix of cluster order promote/demote lose some benefit of
the buddy system. Because it lacks the proper data structure to
support buddy allocation. The buddy allocator provides more general
migration between orders. For the limited usage case of cluster
promotion/demotion is supported (by luck). We need to evaluate whether
it is worth the additional complexity.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 1/3] mm: swap: swap cluster switch to double link list
2024-07-18 6:26 ` Huang, Ying
@ 2024-07-26 5:46 ` Chris Li
0 siblings, 0 replies; 43+ messages in thread
From: Chris Li @ 2024-07-26 5:46 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Wed, Jul 17, 2024 at 11:29 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > Previously, the swap cluster used a cluster index as a pointer
> > to construct a custom single link list type "swap_cluster_list".
> > The next cluster pointer is shared with the cluster->count.
> > It prevents puting the non free cluster into a list.
> >
> > Change the cluster to use the standard double link list instead.
> > This allows tracing the nonfull cluster in the follow up patch.
> > That way, it is faster to get to the nonfull cluster of that order.
> >
> > Remove the cluster getter/setter for accessing the cluster
> > struct member.
> >
> > The list operation is protected by the swap_info_struct->lock.
> >
> > Change cluster code to use "struct swap_cluster_info *" to
> > reference the cluster rather than by using index. That is more
> > consistent with the list manipulation. It avoids the repeat
> > adding index to the cluser_info. The code is easier to understand.
> >
> > Remove the cluster next pointer is NULL flag, the double link
> > list can handle the empty list pretty well.
> >
> > The "swap_cluster_info" struct is two pointer bigger, because
> > 512 swap entries share one swap struct, it has very little impact
> ~~~~
> swap_cluster_info ?
Did not see this email earlier. Done.
>
> > on the average memory usage per swap entry. For 1TB swapfile, the
> > swap cluster data structure increases from 8MB to 24MB.
> >
> > Other than the list conversion, there is no real function change
> > in this patch.
> >
> > Signed-off-by: Chris Li <chrisl@kernel.org>
> > ---
> > include/linux/swap.h | 26 +++---
> > mm/swapfile.c | 225 ++++++++++++++-------------------------------------
> > 2 files changed, 70 insertions(+), 181 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index e473fe6cfb7a..e9be95468fc7 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -243,22 +243,21 @@ enum {
> > * free clusters are organized into a list. We fetch an entry from the list to
> > * get a free cluster.
> > *
> > - * The data field stores next cluster if the cluster is free or cluster usage
> > - * counter otherwise. The flags field determines if a cluster is free. This is
> > - * protected by swap_info_struct.lock.
> > + * The flags field determines if a cluster is free. This is
> > + * protected by cluster lock.
> > */
> > struct swap_cluster_info {
> > spinlock_t lock; /*
> > * Protect swap_cluster_info fields
> > - * and swap_info_struct->swap_map
> > - * elements correspond to the swap
> > - * cluster
> > + * other than list, and swap_info_struct->swap_map
> > + * elements correspond to the swap cluster.
> > */
> > - unsigned int data:24;
> > - unsigned int flags:8;
> > + u16 count;
> > + u8 flags;
> > + struct list_head list;
> > };
> > #define CLUSTER_FLAG_FREE 1 /* This cluster is free */
> > -#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
> > +
> >
> > /*
> > * The first page in the swap file is the swap header, which is always marked
> > @@ -283,11 +282,6 @@ struct percpu_cluster {
> > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > };
> >
> > -struct swap_cluster_list {
> > - struct swap_cluster_info head;
> > - struct swap_cluster_info tail;
> > -};
> > -
> > /*
> > * The in-memory structure used to track swap areas.
> > */
> > @@ -301,7 +295,7 @@ struct swap_info_struct {
> > unsigned char *swap_map; /* vmalloc'ed array of usage counts */
> > unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
> > struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
> > - struct swap_cluster_list free_clusters; /* free clusters list */
> > + struct list_head free_clusters; /* free clusters list */
> > unsigned int lowest_bit; /* index of first free in swap_map */
> > unsigned int highest_bit; /* index of last free in swap_map */
> > unsigned int pages; /* total of usable pages of swap */
> > @@ -332,7 +326,7 @@ struct swap_info_struct {
> > * list.
> > */
> > struct work_struct discard_work; /* discard worker */
> > - struct swap_cluster_list discard_clusters; /* discard clusters list */
> > + struct list_head discard_clusters; /* discard clusters list */
> > struct plist_node avail_lists[]; /*
> > * entries in swap_avail_heads, one
> > * entry per node.
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index f7224bc1320c..f70d25005d2c 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
> > #endif
> > #define LATENCY_LIMIT 256
> >
> > -static inline void cluster_set_flag(struct swap_cluster_info *info,
> > - unsigned int flag)
> > -{
> > - info->flags = flag;
> > -}
> > -
> > -static inline unsigned int cluster_count(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_count(struct swap_cluster_info *info,
> > - unsigned int c)
> > -{
> > - info->data = c;
> > -}
> > -
> > -static inline void cluster_set_count_flag(struct swap_cluster_info *info,
> > - unsigned int c, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = c;
> > -}
> > -
> > -static inline unsigned int cluster_next(struct swap_cluster_info *info)
> > -{
> > - return info->data;
> > -}
> > -
> > -static inline void cluster_set_next(struct swap_cluster_info *info,
> > - unsigned int n)
> > -{
> > - info->data = n;
> > -}
> > -
> > -static inline void cluster_set_next_flag(struct swap_cluster_info *info,
> > - unsigned int n, unsigned int f)
> > -{
> > - info->flags = f;
> > - info->data = n;
> > -}
> > -
> > static inline bool cluster_is_free(struct swap_cluster_info *info)
> > {
> > return info->flags & CLUSTER_FLAG_FREE;
> > }
> >
> > -static inline bool cluster_is_null(struct swap_cluster_info *info)
> > -{
> > - return info->flags & CLUSTER_FLAG_NEXT_NULL;
> > -}
> > -
> > -static inline void cluster_set_null(struct swap_cluster_info *info)
> > +static inline unsigned int cluster_index(struct swap_info_struct *si,
> > + struct swap_cluster_info *ci)
> > {
> > - info->flags = CLUSTER_FLAG_NEXT_NULL;
> > - info->data = 0;
> > + return ci - si->cluster_info;
> > }
> >
> > static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
> > @@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > spin_unlock(&si->lock);
> > }
> >
> > -static inline bool cluster_list_empty(struct swap_cluster_list *list)
> > -{
> > - return cluster_is_null(&list->head);
> > -}
> > -
> > -static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
> > -{
> > - return cluster_next(&list->head);
> > -}
> > -
> > -static void cluster_list_init(struct swap_cluster_list *list)
> > -{
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > -}
> > -
> > -static void cluster_list_add_tail(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci,
> > - unsigned int idx)
> > -{
> > - if (cluster_list_empty(list)) {
> > - cluster_set_next_flag(&list->head, idx, 0);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - } else {
> > - struct swap_cluster_info *ci_tail;
> > - unsigned int tail = cluster_next(&list->tail);
> > -
> > - /*
> > - * Nested cluster lock, but both cluster locks are
> > - * only acquired when we held swap_info_struct->lock
> > - */
> > - ci_tail = ci + tail;
> > - spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
> > - cluster_set_next(ci_tail, idx);
> > - spin_unlock(&ci_tail->lock);
> > - cluster_set_next_flag(&list->tail, idx, 0);
> > - }
> > -}
> > -
> > -static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
> > - struct swap_cluster_info *ci)
> > -{
> > - unsigned int idx;
> > -
> > - idx = cluster_next(&list->head);
> > - if (cluster_next(&list->tail) == idx) {
> > - cluster_set_null(&list->head);
> > - cluster_set_null(&list->tail);
> > - } else
> > - cluster_set_next_flag(&list->head,
> > - cluster_next(&ci[idx]), 0);
> > -
> > - return idx;
> > -}
> > -
> > /* Add a cluster to discard list and schedule it to do discard */
> > static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > - unsigned int idx)
> > + struct swap_cluster_info *ci)
> > {
> > + unsigned int idx = cluster_index(si, ci);
> > /*
> > * If scan_swap_map_slots() can't find a free cluster, it will check
> > * si->swap_map directly. To make sure the discarding cluster isn't
> > @@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > SWAP_MAP_BAD, SWAPFILE_CLUSTER);
> >
> > - cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
> > -
> > + list_add_tail(&ci->list, &si->discard_clusters);
> > schedule_work(&si->discard_work);
> > }
> >
> > -static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > -
> > - cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
> > - cluster_list_add_tail(&si->free_clusters, ci, idx);
> > + ci->flags = CLUSTER_FLAG_FREE;
> > + list_add_tail(&ci->list, &si->free_clusters);
> > }
> >
> > /*
> > @@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > {
> > - struct swap_cluster_info *info, *ci;
> > + struct swap_cluster_info *ci;
> > unsigned int idx;
> >
> > - info = si->cluster_info;
> > -
> > - while (!cluster_list_empty(&si->discard_clusters)) {
> > - idx = cluster_list_del_first(&si->discard_clusters, info);
> > + while (!list_empty(&si->discard_clusters)) {
> > + ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > + list_del(&ci->list);
> > + idx = cluster_index(si, ci);
> > spin_unlock(&si->lock);
> >
> > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > SWAPFILE_CLUSTER);
> >
> > spin_lock(&si->lock);
> > - ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
> > - __free_cluster(si, idx);
> > +
> > + spin_lock(&ci->lock);
> > + __free_cluster(si, ci);
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > 0, SWAPFILE_CLUSTER);
> > - unlock_cluster(ci);
> > + spin_unlock(&ci->lock);
> > }
> > }
> >
> > @@ -521,20 +418,20 @@ static void swap_users_ref_free(struct percpu_ref *ref)
> > complete(&si->comp);
> > }
> >
> > -static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info;
> > + struct swap_cluster_info *ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> >
> > - VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
> > - cluster_list_del_first(&si->free_clusters, ci);
> > - cluster_set_count_flag(ci + idx, 0, 0);
> > + VM_BUG_ON(cluster_index(si, ci) != idx);
> > + list_del(&ci->list);
> > + ci->count = 0;
> > + ci->flags = 0;
> > + return ci;
> > }
> >
> > -static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > +static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > {
> > - struct swap_cluster_info *ci = si->cluster_info + idx;
> > -
> > - VM_BUG_ON(cluster_count(ci) != 0);
> > + VM_BUG_ON(ci->count != 0);
> > /*
> > * If the swap is discardable, prepare discard the cluster
> > * instead of free it immediately. The cluster will be freed
> > @@ -542,11 +439,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
> > */
> > if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
> > (SWP_WRITEOK | SWP_PAGE_DISCARD)) {
> > - swap_cluster_schedule_discard(si, idx);
> > + swap_cluster_schedule_discard(si, ci);
> > return;
> > }
> >
> > - __free_cluster(si, idx);
> > + __free_cluster(si, ci);
> > }
> >
> > /*
> > @@ -559,15 +456,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
> > unsigned long count)
> > {
> > unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > + struct swap_cluster_info *ci = cluster_info + idx;
> >
> > if (!cluster_info)
> > return;
> > - if (cluster_is_free(&cluster_info[idx]))
> > + if (cluster_is_free(ci))
> > alloc_cluster(p, idx);
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) + count);
> > + VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
> > + ci->count += count;
> > }
> >
> > /*
> > @@ -581,24 +478,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> > }
> >
> > /*
> > - * The cluster corresponding to page_nr decreases one usage. If the usage
> > - * counter becomes 0, which means no page in the cluster is in using, we can
> > - * optionally discard the cluster and add it to free cluster list.
> > + * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * which means no page in the cluster is in using, we can optionally discard
> > + * the cluster and add it to free cluster list.
> > */
> > -static void dec_cluster_info_page(struct swap_info_struct *p,
> > - struct swap_cluster_info *cluster_info, unsigned long page_nr)
> > +static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> > {
> > - unsigned long idx = page_nr / SWAPFILE_CLUSTER;
> > -
> > - if (!cluster_info)
> > + if (!p->cluster_info)
> > return;
> >
> > - VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
> > - cluster_set_count(&cluster_info[idx],
> > - cluster_count(&cluster_info[idx]) - 1);
> > + VM_BUG_ON(ci->count == 0);
> > + ci->count--;
> >
> > - if (cluster_count(&cluster_info[idx]) == 0)
> > - free_cluster(p, idx);
> > + if (!ci->count)
> > + free_cluster(p, ci);
> > }
> >
> > /*
> > @@ -611,10 +504,10 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
> > {
> > struct percpu_cluster *percpu_cluster;
> > bool conflict;
> > -
> > + struct swap_cluster_info *first = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > offset /= SWAPFILE_CLUSTER;
Ah, here it changes the meaning of the offset variable. The offset
holds a cluster index rather than offset.
That is why I miss it.
> > - conflict = !cluster_list_empty(&si->free_clusters) &&
> > - offset != cluster_list_first(&si->free_clusters) &&
> > + conflict = !list_empty(&si->free_clusters) &&
> > + offset != first - si->cluster_info &&
>
> offset != cluster_index(si, first) ?
Done.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-18 5:50 ` Huang, Ying
@ 2024-07-26 5:51 ` Chris Li
0 siblings, 0 replies; 43+ messages in thread
From: Chris Li @ 2024-07-26 5:51 UTC (permalink / raw)
To: Huang, Ying
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Wed, Jul 17, 2024 at 10:54 PM Huang, Ying <ying.huang@intel.com> wrote:
> > - HDD swap allocation does not need to consider clusters any more.
>
> It appears that my comments in the following emails are ignored?
>
Sorry for missing some email catching up.
> https://lore.kernel.org/linux-mm/87bk3pzr5p.fsf@yhuang6-desk2.ccr.corp.intel.com/
Will reply to that. BTW, V4 already reverted to the previous SSD
behavior and allocated a new cluster before nonfull cluster.
> https://lore.kernel.org/linux-mm/874j9hzqr3.fsf@yhuang6-desk2.ccr.corp.intel.com/
Already replied to the renaming in another email.
Chris
>
> > changes in v3:
> > - Using V1 as base.
> > - Rename "next" to "list" for the list field, suggested by Ying.
> > - Update comment for the locking rules for cluster fields and list,
> > suggested by Ying.
> > - Allocate from the nonfull list before attempting free list, suggested
> > by Kairui.
> > - Link to v2: https://lore.kernel.org/r/20240614-swap-allocator-v2-0-2a513b4a7f2f@kernel.org
> >
> > Changes in v2:
> > - Abandoned.
> > - Link to v1: https://lore.kernel.org/r/20240524-swap-allocator-v1-0-47861b423b26@kernel.org
> >
> > ---
> > Chris Li (3):
> > mm: swap: swap cluster switch to double link list
> > mm: swap: mTHP allocate swap entries from nonfull list
> > RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots()
> >
> > include/linux/swap.h | 30 ++--
> > mm/swapfile.c | 490 +++++++++++++++++++++++----------------------------
> > 2 files changed, 238 insertions(+), 282 deletions(-)
> > ---
> > base-commit: ff3a648ecb9409aff1448cf4f6aa41d78c69a3bc
> > change-id: 20240523-swap-allocator-1534c480ece4
> >
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 4:50 ` Chris Li
@ 2024-07-26 5:52 ` Huang, Ying
2024-07-26 7:10 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 5:52 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>> > If the freeing of swap entry is random distribution. You need 16
>> > continuous swap entries free at the same time at aligned 16 base
>> > locations. The total number of order 4 free swap space add up together
>> > is much lower than the order 0 allocatable swap space.
>> > If having one entry free is 50% probability(swapfile half full), then
>> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>>
>> This depends on workloads. Quite some workloads will show some degree
>> of spatial locality. For a workload with no spatial locality at all as
>> above, mTHP may be not a good choice at the first place.
>
> The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> have their own valid usage case, and should be separate from how you
> use the order 0 entry. That is why I consider this kind of strategy
> only works on the lucky case. I would much prefer the strategy that
> can guarantee work not depend on luck.
It seems that you have some perfect solution. Will learn it when you
post it.
>> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >> clusters available.
>> >
>> > Exactly.
>> >
>> >>
>> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> the various situations automatically.
>> >
>> > There is no easy way to migrate swap entries to different locations.
>> > That is why I like to have discontiguous swap entries allocation for
>> > mTHP.
>>
>> We suggest to migrate non-full swap clsuters among different lists, not
>> swap entries.
>
> Then you have the down side of reducing the number of total high order
> clusters. By chance it is much easier to fragment the cluster than
> anti-fragment a cluster. The orders of clusters have a natural
> tendency to move down rather than move up, given long enough time of
> random access. It will likely run out of high order clusters in the
> long run if we don't have any separation of orders.
As my example above, you may have almost 0 high-order clusters forever.
So, your solution only works for very specific use cases. It's not a
general solution.
>> >> But yes, data is needed for any performance related change.
>>
>> BTW: I think non-full cluster isn't a good name. Partial cluster is
>> much better and follows the same convention as partial slab.
>
> I am not opposed to it. The only reason I hold off on the rename is
> because there are patches from Kairui I am testing depending on it.
> Let's finish up the V5 patch with the swap cache reclaim code path
> then do the renaming as one batch job. We actually have more than one
> list that has the clusters partially full. It helps reduce the repeat
> scan of the cluster that is not full but also not able to allocate
> swap entries for this order. Just the name of one of them as
> "partial" is not precise either. Because the other lists are also
> partially full. We'd better give them precise meaning systematically.
I don't think that it's hard to do a search/replace before the next
version.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 5:09 ` Chris Li
@ 2024-07-26 6:02 ` Huang, Ying
2024-07-26 7:15 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 6:02 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >
>> > The current proposed order also improves things step by step. The only
>> > disagreement here is which patch order we introduce yet another list
>> > in addition to the nonfull one. I just feel that it does not make
>> > sense to invest into new code if that new code is going to be
>> > completely rewrite anyway in the next two patches.
>> >
>> > Unless you mean is we should not do the patch 3 big rewrite and should
>> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
>> > the allocation job and let scan_swap_map_slots() do the complex retry
>> > on top of try_ssd(). I feel the overall code is more complex and less
>> > maintainable.
>>
>> I haven't look at [3/3], will wait for your next version for that. So,
>> I cannot say which order is better. Please consider reviewers' effort
>> too. Small step patch is easier to be understood and reviewed.
>
> That is exactly the reason I don't want to introduce too much new code
> depending on the scan_swap_map_slots() behavior, which will be
> abandoned in the big rewrite. Their constraints are very different. I
> want to make the big rewrite patch 3 as small as possible. Using
> incremental follow up patches to improve it.
>
>>
>> >> > That is why I want to make this change patch after patch 3. There is
>> >> > also the long test cycle after the modification to make sure the swap
>> >> > code path is stable. I am not resisting a change of patch orders, it
>> >> > is that patch can't directly be removed before patch 3 before the big
>> >> > rewrite.
>> >> >
>> >> >
>> >> >>
>> >> >> > Your original
>> >> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> >> > on the nonfull list for order-N always unless the number of free swap
>> >> >> > entry is less than 1<<N.
>> >> >>
>> >> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> >> been freed since the scan started then it should also be removed from the list.
>> >> >
>> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> >> > full cluster that for the cluster has been scan and not able to
>> >> > allocate order N entry.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >>> And, I understand that in some situations it may
>> >> >> >>> be better to share clusters among CPUs. So my suggestion is,
>> >> >> >>>
>> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >> >>> have free swap entries with that order even after we are sure that we
>> >> >> >>> haven't.
>> >> >> >>
>> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >
>> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >>
>> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> >> conclusion wrong!).
>> >> >
>> >> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >> >
>> >> >>
>> >> >> >
>> >> >> >> But I agree that a
>> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >> >
>> >> >> > We can check that when free swap entry via checking adjacent swap
>> >> >> > entries. IMHO, the performance should be acceptable.
>> >> >>
>> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> >> a separate change on top of what Chris is doing here. For high orders there
>> >> >> could be quite a bit of scanning required in the worst case for every page that
>> >> >> gets freed.
>> >> >
>> >> > Right, I feel that is a different set of patches. Even this series is
>> >> > hard enough for review. Those order promotion and demotion is heading
>> >> > towards a buddy system design. I want to point out that even the buddy
>> >> > system is not able to handle the case that swapfile is almost full and
>> >> > the recently freed swap entries are not contiguous.
>> >> >
>> >> > We can invest in the buddy system, which doesn't handle all the
>> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> >> > swap entry. We pay a price for the indirect mapping of swap entries.
>> >> > But it will solve the fragmentation issue 100%.
>> >>
>> >> It's good if we can solve the fragmentation issue 100%. Just need to
>> >> pay attention to the cost.
>> >
>> > The cost you mean the development cost or the run time cost (memory and cpu)?
>>
>> I mean runtime cost.
>
> Thanks for the clarification. Agree that we need to pay attention to
> the run time cost. That is given.
>
>> >> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >> >>> among CPUs?
>> >> >> >>
>> >> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> >> given order if there are actually slots available just because they have been
>> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> >> approach.
>> >> >> >>
>> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >>
>> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> >> current half-and-half policy in this patch.
>> >> >> >
>> >> >> > Sounds good to me. We can start with exclusive solution and evaluate
>> >> >> > whether shared solution is good.
>> >> >>
>> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >
>> >> > It is not able to avoid fragementation 100% of the time. I prefer the
>> >> > discontinued swap entry as the next step, which guarantees forward
>> >> > progress, we will not be stuck in a situation where we are not able to
>> >> > allocate swap entries due to fragmentation.
>> >>
>> >> If my understanding were correct, the implementation complexity of the
>> >> order promotion/demotion isn't at the same level of that of discontinued
>> >> swap entry.
>> >
>> > Discontinued swap entry has higher complexity but higher payout as
>> > well. It can get us to the place where cluster promotion/demotion
>> > can't.
>> >
>> > I also feel that if we implement something towards a buddy system
>> > allocator for swap, we should do a proper buddy allocator
>> > implementation of data structures.
>>
>> I don't think that it's easy to implement a real buddy allocator for
>> swap entries. So, I avoid to use buddy in my words.
>
> Then such a mix of cluster order promote/demote lose some benefit of
> the buddy system. Because it lacks the proper data structure to
> support buddy allocation. The buddy allocator provides more general
> migration between orders. For the limited usage case of cluster
> promotion/demotion is supported (by luck). We need to evaluate whether
> it is worth the additional complexity.
TBH, I believe that the complexity of order promote/demote is quite low,
both for development and runtime. A real buddy allocator may need to
increase per-swap-entry memory footprint much.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 5:52 ` Huang, Ying
@ 2024-07-26 7:10 ` Chris Li
2024-07-26 7:18 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-26 7:10 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> > If the freeing of swap entry is random distribution. You need 16
> >> > continuous swap entries free at the same time at aligned 16 base
> >> > locations. The total number of order 4 free swap space add up together
> >> > is much lower than the order 0 allocatable swap space.
> >> > If having one entry free is 50% probability(swapfile half full), then
> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
> >>
> >> This depends on workloads. Quite some workloads will show some degree
> >> of spatial locality. For a workload with no spatial locality at all as
> >> above, mTHP may be not a good choice at the first place.
> >
> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> > have their own valid usage case, and should be separate from how you
> > use the order 0 entry. That is why I consider this kind of strategy
> > only works on the lucky case. I would much prefer the strategy that
> > can guarantee work not depend on luck.
>
> It seems that you have some perfect solution. Will learn it when you
> post it.
No, I don't have perfect solutions. I see puting limit on order 0 swap
usage and writing out discontinuous swap entries from a folio are more
deterministic and not depend on luck. Both have their price to pay as
well.
>
> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> >> clusters available.
> >> >
> >> > Exactly.
> >> >
> >> >>
> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> >> the various situations automatically.
> >> >
> >> > There is no easy way to migrate swap entries to different locations.
> >> > That is why I like to have discontiguous swap entries allocation for
> >> > mTHP.
> >>
> >> We suggest to migrate non-full swap clsuters among different lists, not
> >> swap entries.
> >
> > Then you have the down side of reducing the number of total high order
> > clusters. By chance it is much easier to fragment the cluster than
> > anti-fragment a cluster. The orders of clusters have a natural
> > tendency to move down rather than move up, given long enough time of
> > random access. It will likely run out of high order clusters in the
> > long run if we don't have any separation of orders.
>
> As my example above, you may have almost 0 high-order clusters forever.
> So, your solution only works for very specific use cases. It's not a
> general solution.
One simple solution is having an optional limitation of 0 order swap.
I understand you don't like that option, but there is no other easy
solution to achieve the same effectiveness, so far. If there is, I
like to hear it.
>
> >> >> But yes, data is needed for any performance related change.
> >>
> >> BTW: I think non-full cluster isn't a good name. Partial cluster is
> >> much better and follows the same convention as partial slab.
> >
> > I am not opposed to it. The only reason I hold off on the rename is
> > because there are patches from Kairui I am testing depending on it.
> > Let's finish up the V5 patch with the swap cache reclaim code path
> > then do the renaming as one batch job. We actually have more than one
> > list that has the clusters partially full. It helps reduce the repeat
> > scan of the cluster that is not full but also not able to allocate
> > swap entries for this order. Just the name of one of them as
> > "partial" is not precise either. Because the other lists are also
> > partially full. We'd better give them precise meaning systematically.
>
> I don't think that it's hard to do a search/replace before the next
> version.
The overhead is on the other internal experimental patches. Again,
I am not opposed to renaming it. Just want to do it at one batch not
many times, including other list names.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 6:02 ` Huang, Ying
@ 2024-07-26 7:15 ` Chris Li
2024-07-26 7:42 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-26 7:15 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Thu, Jul 25, 2024 at 11:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >
> >> > The current proposed order also improves things step by step. The only
> >> > disagreement here is which patch order we introduce yet another list
> >> > in addition to the nonfull one. I just feel that it does not make
> >> > sense to invest into new code if that new code is going to be
> >> > completely rewrite anyway in the next two patches.
> >> >
> >> > Unless you mean is we should not do the patch 3 big rewrite and should
> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
> >> > the allocation job and let scan_swap_map_slots() do the complex retry
> >> > on top of try_ssd(). I feel the overall code is more complex and less
> >> > maintainable.
> >>
> >> I haven't look at [3/3], will wait for your next version for that. So,
> >> I cannot say which order is better. Please consider reviewers' effort
> >> too. Small step patch is easier to be understood and reviewed.
> >
> > That is exactly the reason I don't want to introduce too much new code
> > depending on the scan_swap_map_slots() behavior, which will be
> > abandoned in the big rewrite. Their constraints are very different. I
> > want to make the big rewrite patch 3 as small as possible. Using
> > incremental follow up patches to improve it.
> >
> >>
> >> >> > That is why I want to make this change patch after patch 3. There is
> >> >> > also the long test cycle after the modification to make sure the swap
> >> >> > code path is stable. I am not resisting a change of patch orders, it
> >> >> > is that patch can't directly be removed before patch 3 before the big
> >> >> > rewrite.
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> > Your original
> >> >> >> > suggestion appears like that you want to keep all cluster with order-N
> >> >> >> > on the nonfull list for order-N always unless the number of free swap
> >> >> >> > entry is less than 1<<N.
> >> >> >>
> >> >> >> Well I think that's certainly one of the conditions for removing it. But agree
> >> >> >> that if a full scan of the cluster has been performed and no swap entries have
> >> >> >> been freed since the scan started then it should also be removed from the list.
> >> >> >
> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
> >> >> > full cluster that for the cluster has been scan and not able to
> >> >> > allocate order N entry.
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >>> And, I understand that in some situations it may
> >> >> >> >>> be better to share clusters among CPUs. So my suggestion is,
> >> >> >> >>>
> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
> >> >> >> >>> have free swap entries with that order even after we are sure that we
> >> >> >> >>> haven't.
> >> >> >> >>
> >> >> >> >> Is this patch pretending that today? I don't think so?
> >> >> >> >
> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
> >> >> >> > sure that there are no order-N free swap entry in the cluster.
> >> >> >>
> >> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
> >> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
> >> >> >> are for order-0 and you have no means of allocating high orders until a whole
> >> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
> >> >> >> better for swap_cluster_info->order to remain static while the cluster is
> >> >> >> allocated. (I only skimmed that conversation so appologies if I got the
> >> >> >> conclusion wrong!).
> >> >> >
> >> >> > Yes, that is the original intent, keep the cluster order as much as possible.
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> >> But I agree that a
> >> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
> >> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
> >> >> >> >> that doesn't tell us for sure because they may not be contiguous.
> >> >> >> >
> >> >> >> > We can check that when free swap entry via checking adjacent swap
> >> >> >> > entries. IMHO, the performance should be acceptable.
> >> >> >>
> >> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
> >> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
> >> >> >> a separate change on top of what Chris is doing here. For high orders there
> >> >> >> could be quite a bit of scanning required in the worst case for every page that
> >> >> >> gets freed.
> >> >> >
> >> >> > Right, I feel that is a different set of patches. Even this series is
> >> >> > hard enough for review. Those order promotion and demotion is heading
> >> >> > towards a buddy system design. I want to point out that even the buddy
> >> >> > system is not able to handle the case that swapfile is almost full and
> >> >> > the recently freed swap entries are not contiguous.
> >> >> >
> >> >> > We can invest in the buddy system, which doesn't handle all the
> >> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
> >> >> > swap entry. We pay a price for the indirect mapping of swap entries.
> >> >> > But it will solve the fragmentation issue 100%.
> >> >>
> >> >> It's good if we can solve the fragmentation issue 100%. Just need to
> >> >> pay attention to the cost.
> >> >
> >> > The cost you mean the development cost or the run time cost (memory and cpu)?
> >>
> >> I mean runtime cost.
> >
> > Thanks for the clarification. Agree that we need to pay attention to
> > the run time cost. That is given.
> >
> >> >> >> >>> My question is whether it's so important to share the per-cpu cluster
> >> >> >> >>> among CPUs?
> >> >> >> >>
> >> >> >> >> My rationale for sharing is that the preference previously has been to favour
> >> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
> >> >> >> >> given order if there are actually slots available just because they have been
> >> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
> >> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
> >> >> >> >> approach.
> >> >> >> >>
> >> >> >> >>> I suggest to start with simple design, that is, per-CPU
> >> >> >> >>> cluster will not be shared among CPUs in most cases.
> >> >> >> >>
> >> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
> >> >> >> >> current half-and-half policy in this patch.
> >> >> >> >
> >> >> >> > Sounds good to me. We can start with exclusive solution and evaluate
> >> >> >> > whether shared solution is good.
> >> >> >>
> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
> >> >> >
> >> >> > It is not able to avoid fragementation 100% of the time. I prefer the
> >> >> > discontinued swap entry as the next step, which guarantees forward
> >> >> > progress, we will not be stuck in a situation where we are not able to
> >> >> > allocate swap entries due to fragmentation.
> >> >>
> >> >> If my understanding were correct, the implementation complexity of the
> >> >> order promotion/demotion isn't at the same level of that of discontinued
> >> >> swap entry.
> >> >
> >> > Discontinued swap entry has higher complexity but higher payout as
> >> > well. It can get us to the place where cluster promotion/demotion
> >> > can't.
> >> >
> >> > I also feel that if we implement something towards a buddy system
> >> > allocator for swap, we should do a proper buddy allocator
> >> > implementation of data structures.
> >>
> >> I don't think that it's easy to implement a real buddy allocator for
> >> swap entries. So, I avoid to use buddy in my words.
> >
> > Then such a mix of cluster order promote/demote lose some benefit of
> > the buddy system. Because it lacks the proper data structure to
> > support buddy allocation. The buddy allocator provides more general
> > migration between orders. For the limited usage case of cluster
> > promotion/demotion is supported (by luck). We need to evaluate whether
> > it is worth the additional complexity.
>
> TBH, I believe that the complexity of order promote/demote is quite low,
> both for development and runtime. A real buddy allocator may need to
> increase per-swap-entry memory footprint much.
I mostly concern its effectiveness. Anyway, the series is already
complex enough with the big rewrite and reclaim on swap cache.
Let me know if you think it needs to be done before the big rewrite.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 7:10 ` Chris Li
@ 2024-07-26 7:18 ` Huang, Ying
2024-07-26 7:26 ` Chris Li
0 siblings, 1 reply; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 7:18 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> > If the freeing of swap entry is random distribution. You need 16
>> >> > continuous swap entries free at the same time at aligned 16 base
>> >> > locations. The total number of order 4 free swap space add up together
>> >> > is much lower than the order 0 allocatable swap space.
>> >> > If having one entry free is 50% probability(swapfile half full), then
>> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>> >>
>> >> This depends on workloads. Quite some workloads will show some degree
>> >> of spatial locality. For a workload with no spatial locality at all as
>> >> above, mTHP may be not a good choice at the first place.
>> >
>> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
>> > have their own valid usage case, and should be separate from how you
>> > use the order 0 entry. That is why I consider this kind of strategy
>> > only works on the lucky case. I would much prefer the strategy that
>> > can guarantee work not depend on luck.
>>
>> It seems that you have some perfect solution. Will learn it when you
>> post it.
>
> No, I don't have perfect solutions. I see puting limit on order 0 swap
> usage and writing out discontinuous swap entries from a folio are more
> deterministic and not depend on luck. Both have their price to pay as
> well.
>
>>
>> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >> >> clusters available.
>> >> >
>> >> > Exactly.
>> >> >
>> >> >>
>> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> >> the various situations automatically.
>> >> >
>> >> > There is no easy way to migrate swap entries to different locations.
>> >> > That is why I like to have discontiguous swap entries allocation for
>> >> > mTHP.
>> >>
>> >> We suggest to migrate non-full swap clsuters among different lists, not
>> >> swap entries.
>> >
>> > Then you have the down side of reducing the number of total high order
>> > clusters. By chance it is much easier to fragment the cluster than
>> > anti-fragment a cluster. The orders of clusters have a natural
>> > tendency to move down rather than move up, given long enough time of
>> > random access. It will likely run out of high order clusters in the
>> > long run if we don't have any separation of orders.
>>
>> As my example above, you may have almost 0 high-order clusters forever.
>> So, your solution only works for very specific use cases. It's not a
>> general solution.
>
> One simple solution is having an optional limitation of 0 order swap.
> I understand you don't like that option, but there is no other easy
> solution to achieve the same effectiveness, so far. If there is, I
> like to hear it.
Just as you said, it's optional, so it's not general solution. This may
trigger OOM in general solution.
>>
>> >> >> But yes, data is needed for any performance related change.
>> >>
>> >> BTW: I think non-full cluster isn't a good name. Partial cluster is
>> >> much better and follows the same convention as partial slab.
>> >
>> > I am not opposed to it. The only reason I hold off on the rename is
>> > because there are patches from Kairui I am testing depending on it.
>> > Let's finish up the V5 patch with the swap cache reclaim code path
>> > then do the renaming as one batch job. We actually have more than one
>> > list that has the clusters partially full. It helps reduce the repeat
>> > scan of the cluster that is not full but also not able to allocate
>> > swap entries for this order. Just the name of one of them as
>> > "partial" is not precise either. Because the other lists are also
>> > partially full. We'd better give them precise meaning systematically.
>>
>> I don't think that it's hard to do a search/replace before the next
>> version.
>
> The overhead is on the other internal experimental patches. Again,
> I am not opposed to renaming it. Just want to do it at one batch not
> many times, including other list names.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 7:18 ` Huang, Ying
@ 2024-07-26 7:26 ` Chris Li
2024-07-26 7:37 ` Huang, Ying
0 siblings, 1 reply; 43+ messages in thread
From: Chris Li @ 2024-07-26 7:26 UTC (permalink / raw)
To: Huang, Ying
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Fri, Jul 26, 2024 at 12:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Chris Li <chrisl@kernel.org> writes:
>
> > On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Chris Li <chrisl@kernel.org> writes:
> >>
> >> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >> > If the freeing of swap entry is random distribution. You need 16
> >> >> > continuous swap entries free at the same time at aligned 16 base
> >> >> > locations. The total number of order 4 free swap space add up together
> >> >> > is much lower than the order 0 allocatable swap space.
> >> >> > If having one entry free is 50% probability(swapfile half full), then
> >> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
> >> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
> >> >>
> >> >> This depends on workloads. Quite some workloads will show some degree
> >> >> of spatial locality. For a workload with no spatial locality at all as
> >> >> above, mTHP may be not a good choice at the first place.
> >> >
> >> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
> >> > have their own valid usage case, and should be separate from how you
> >> > use the order 0 entry. That is why I consider this kind of strategy
> >> > only works on the lucky case. I would much prefer the strategy that
> >> > can guarantee work not depend on luck.
> >>
> >> It seems that you have some perfect solution. Will learn it when you
> >> post it.
> >
> > No, I don't have perfect solutions. I see puting limit on order 0 swap
> > usage and writing out discontinuous swap entries from a folio are more
> > deterministic and not depend on luck. Both have their price to pay as
> > well.
> >
> >>
> >> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
> >> >> >> clusters available.
> >> >> >
> >> >> > Exactly.
> >> >> >
> >> >> >>
> >> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
> >> >> >> the various situations automatically.
> >> >> >
> >> >> > There is no easy way to migrate swap entries to different locations.
> >> >> > That is why I like to have discontiguous swap entries allocation for
> >> >> > mTHP.
> >> >>
> >> >> We suggest to migrate non-full swap clsuters among different lists, not
> >> >> swap entries.
> >> >
> >> > Then you have the down side of reducing the number of total high order
> >> > clusters. By chance it is much easier to fragment the cluster than
> >> > anti-fragment a cluster. The orders of clusters have a natural
> >> > tendency to move down rather than move up, given long enough time of
> >> > random access. It will likely run out of high order clusters in the
> >> > long run if we don't have any separation of orders.
> >>
> >> As my example above, you may have almost 0 high-order clusters forever.
> >> So, your solution only works for very specific use cases. It's not a
> >> general solution.
> >
> > One simple solution is having an optional limitation of 0 order swap.
> > I understand you don't like that option, but there is no other easy
> > solution to achieve the same effectiveness, so far. If there is, I
> > like to hear it.
>
> Just as you said, it's optional, so it's not general solution. This may
> trigger OOM in general solution.
Agree it is not a general solution. This option is simple and useful.
The more general solution is just write out discontiguous swap entries.
Chris
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 7:26 ` Chris Li
@ 2024-07-26 7:37 ` Huang, Ying
0 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 7:37 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Fri, Jul 26, 2024 at 12:21 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 10:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Chris Li <chrisl@kernel.org> writes:
>> >>
>> >> > On Thu, Jul 25, 2024 at 7:07 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >> > If the freeing of swap entry is random distribution. You need 16
>> >> >> > continuous swap entries free at the same time at aligned 16 base
>> >> >> > locations. The total number of order 4 free swap space add up together
>> >> >> > is much lower than the order 0 allocatable swap space.
>> >> >> > If having one entry free is 50% probability(swapfile half full), then
>> >> >> > having 16 swap entries is continually free is (0.5) EXP 16 = 1.5 E-5.
>> >> >> > If the swapfile is 80% full, that number drops to 6.5 E -12.
>> >> >>
>> >> >> This depends on workloads. Quite some workloads will show some degree
>> >> >> of spatial locality. For a workload with no spatial locality at all as
>> >> >> above, mTHP may be not a good choice at the first place.
>> >> >
>> >> > The fragmentation comes from the order 0 entry not from the mTHP. mTHP
>> >> > have their own valid usage case, and should be separate from how you
>> >> > use the order 0 entry. That is why I consider this kind of strategy
>> >> > only works on the lucky case. I would much prefer the strategy that
>> >> > can guarantee work not depend on luck.
>> >>
>> >> It seems that you have some perfect solution. Will learn it when you
>> >> post it.
>> >
>> > No, I don't have perfect solutions. I see puting limit on order 0 swap
>> > usage and writing out discontinuous swap entries from a folio are more
>> > deterministic and not depend on luck. Both have their price to pay as
>> > well.
>> >
>> >>
>> >> >> >> - Order-4 pages need to be swapped out, but no enough order-4 non-full
>> >> >> >> clusters available.
>> >> >> >
>> >> >> > Exactly.
>> >> >> >
>> >> >> >>
>> >> >> >> So, we need a way to migrate non-full clusters among orders to adjust to
>> >> >> >> the various situations automatically.
>> >> >> >
>> >> >> > There is no easy way to migrate swap entries to different locations.
>> >> >> > That is why I like to have discontiguous swap entries allocation for
>> >> >> > mTHP.
>> >> >>
>> >> >> We suggest to migrate non-full swap clsuters among different lists, not
>> >> >> swap entries.
>> >> >
>> >> > Then you have the down side of reducing the number of total high order
>> >> > clusters. By chance it is much easier to fragment the cluster than
>> >> > anti-fragment a cluster. The orders of clusters have a natural
>> >> > tendency to move down rather than move up, given long enough time of
>> >> > random access. It will likely run out of high order clusters in the
>> >> > long run if we don't have any separation of orders.
>> >>
>> >> As my example above, you may have almost 0 high-order clusters forever.
>> >> So, your solution only works for very specific use cases. It's not a
>> >> general solution.
>> >
>> > One simple solution is having an optional limitation of 0 order swap.
>> > I understand you don't like that option, but there is no other easy
>> > solution to achieve the same effectiveness, so far. If there is, I
>> > like to hear it.
>>
>> Just as you said, it's optional, so it's not general solution. This may
>> trigger OOM in general solution.
>
> Agree it is not a general solution. This option is simple and useful.
> The more general solution is just write out discontiguous swap entries.
I just don't know how to do that. For example, how to put the folio in
swap cache. Will wait you to show the implementation.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
* Re: [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-26 7:15 ` Chris Li
@ 2024-07-26 7:42 ` Huang, Ying
0 siblings, 0 replies; 43+ messages in thread
From: Huang, Ying @ 2024-07-26 7:42 UTC (permalink / raw)
To: Chris Li
Cc: Ryan Roberts, Andrew Morton, Kairui Song, Hugh Dickins,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
Chris Li <chrisl@kernel.org> writes:
> On Thu, Jul 25, 2024 at 11:05 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Chris Li <chrisl@kernel.org> writes:
>>
>> > On Thu, Jul 25, 2024 at 7:13 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >> >
>> >> > The current proposed order also improves things step by step. The only
>> >> > disagreement here is which patch order we introduce yet another list
>> >> > in addition to the nonfull one. I just feel that it does not make
>> >> > sense to invest into new code if that new code is going to be
>> >> > completely rewrite anyway in the next two patches.
>> >> >
>> >> > Unless you mean is we should not do the patch 3 big rewrite and should
>> >> > continue the scan_swap_map_try_ssd_cluster() way of only doing half of
>> >> > the allocation job and let scan_swap_map_slots() do the complex retry
>> >> > on top of try_ssd(). I feel the overall code is more complex and less
>> >> > maintainable.
>> >>
>> >> I haven't look at [3/3], will wait for your next version for that. So,
>> >> I cannot say which order is better. Please consider reviewers' effort
>> >> too. Small step patch is easier to be understood and reviewed.
>> >
>> > That is exactly the reason I don't want to introduce too much new code
>> > depending on the scan_swap_map_slots() behavior, which will be
>> > abandoned in the big rewrite. Their constraints are very different. I
>> > want to make the big rewrite patch 3 as small as possible. Using
>> > incremental follow up patches to improve it.
>> >
>> >>
>> >> >> > That is why I want to make this change patch after patch 3. There is
>> >> >> > also the long test cycle after the modification to make sure the swap
>> >> >> > code path is stable. I am not resisting a change of patch orders, it
>> >> >> > is that patch can't directly be removed before patch 3 before the big
>> >> >> > rewrite.
>> >> >> >
>> >> >> >
>> >> >> >>
>> >> >> >> > Your original
>> >> >> >> > suggestion appears like that you want to keep all cluster with order-N
>> >> >> >> > on the nonfull list for order-N always unless the number of free swap
>> >> >> >> > entry is less than 1<<N.
>> >> >> >>
>> >> >> >> Well I think that's certainly one of the conditions for removing it. But agree
>> >> >> >> that if a full scan of the cluster has been performed and no swap entries have
>> >> >> >> been freed since the scan started then it should also be removed from the list.
>> >> >> >
>> >> >> > Yes, in the later patch of patch, beyond patch 3, we have the almost
>> >> >> > full cluster that for the cluster has been scan and not able to
>> >> >> > allocate order N entry.
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >>> And, I understand that in some situations it may
>> >> >> >> >>> be better to share clusters among CPUs. So my suggestion is,
>> >> >> >> >>>
>> >> >> >> >>> - Make swap_cluster_info->order more accurate, don't pretend that we
>> >> >> >> >>> have free swap entries with that order even after we are sure that we
>> >> >> >> >>> haven't.
>> >> >> >> >>
>> >> >> >> >> Is this patch pretending that today? I don't think so?
>> >> >> >> >
>> >> >> >> > IIUC, in this patch swap_cluster_info->order is still "N" even if we are
>> >> >> >> > sure that there are no order-N free swap entry in the cluster.
>> >> >> >>
>> >> >> >> Oh I see what you mean. I think you and Chris already discussed this? IIRC
>> >> >> >> Chris's point was that if you move that cluster to N-1, eventually all clusters
>> >> >> >> are for order-0 and you have no means of allocating high orders until a whole
>> >> >> >> cluster becomes free. That logic certainly makes sense to me, so think its
>> >> >> >> better for swap_cluster_info->order to remain static while the cluster is
>> >> >> >> allocated. (I only skimmed that conversation so appologies if I got the
>> >> >> >> conclusion wrong!).
>> >> >> >
>> >> >> > Yes, that is the original intent, keep the cluster order as much as possible.
>> >> >> >
>> >> >> >>
>> >> >> >> >
>> >> >> >> >> But I agree that a
>> >> >> >> >> cluster should only be on the per-order nonfull list if we know there are at
>> >> >> >> >> least enough free swap entries in that cluster to cover the order. Of course
>> >> >> >> >> that doesn't tell us for sure because they may not be contiguous.
>> >> >> >> >
>> >> >> >> > We can check that when free swap entry via checking adjacent swap
>> >> >> >> > entries. IMHO, the performance should be acceptable.
>> >> >> >>
>> >> >> >> Would you then use the result of that scanning to "promote" a cluster's order?
>> >> >> >> e.g. swap_cluster_info->order = N+1? That would be neat. But this all feels like
>> >> >> >> a separate change on top of what Chris is doing here. For high orders there
>> >> >> >> could be quite a bit of scanning required in the worst case for every page that
>> >> >> >> gets freed.
>> >> >> >
>> >> >> > Right, I feel that is a different set of patches. Even this series is
>> >> >> > hard enough for review. Those order promotion and demotion is heading
>> >> >> > towards a buddy system design. I want to point out that even the buddy
>> >> >> > system is not able to handle the case that swapfile is almost full and
>> >> >> > the recently freed swap entries are not contiguous.
>> >> >> >
>> >> >> > We can invest in the buddy system, which doesn't handle all the
>> >> >> > fragmentation issues. Or I prefer to go directly to the discontiguous
>> >> >> > swap entry. We pay a price for the indirect mapping of swap entries.
>> >> >> > But it will solve the fragmentation issue 100%.
>> >> >>
>> >> >> It's good if we can solve the fragmentation issue 100%. Just need to
>> >> >> pay attention to the cost.
>> >> >
>> >> > The cost you mean the development cost or the run time cost (memory and cpu)?
>> >>
>> >> I mean runtime cost.
>> >
>> > Thanks for the clarification. Agree that we need to pay attention to
>> > the run time cost. That is given.
>> >
>> >> >> >> >>> My question is whether it's so important to share the per-cpu cluster
>> >> >> >> >>> among CPUs?
>> >> >> >> >>
>> >> >> >> >> My rationale for sharing is that the preference previously has been to favour
>> >> >> >> >> efficient use of swap space; we don't want to fail a request for allocation of a
>> >> >> >> >> given order if there are actually slots available just because they have been
>> >> >> >> >> reserved by another CPU. And I'm still asserting that it should be ~zero cost to
>> >> >> >> >> do this. If I'm wrong about the zero cost, or in practice the sharing doesn't
>> >> >> >> >> actually help improve allocation success, then I'm happy to take the exclusive
>> >> >> >> >> approach.
>> >> >> >> >>
>> >> >> >> >>> I suggest to start with simple design, that is, per-CPU
>> >> >> >> >>> cluster will not be shared among CPUs in most cases.
>> >> >> >> >>
>> >> >> >> >> I'm all for starting simple; I think that's what I already proposed (exclusive
>> >> >> >> >> in this patch, then shared in the "big rewrite"). I'm just objecting to the
>> >> >> >> >> current half-and-half policy in this patch.
>> >> >> >> >
>> >> >> >> > Sounds good to me. We can start with exclusive solution and evaluate
>> >> >> >> > whether shared solution is good.
>> >> >> >>
>> >> >> >> Yep. And also evaluate the dynamic order inc/dec idea too...
>> >> >> >
>> >> >> > It is not able to avoid fragementation 100% of the time. I prefer the
>> >> >> > discontinued swap entry as the next step, which guarantees forward
>> >> >> > progress, we will not be stuck in a situation where we are not able to
>> >> >> > allocate swap entries due to fragmentation.
>> >> >>
>> >> >> If my understanding were correct, the implementation complexity of the
>> >> >> order promotion/demotion isn't at the same level of that of discontinued
>> >> >> swap entry.
>> >> >
>> >> > Discontinued swap entry has higher complexity but higher payout as
>> >> > well. It can get us to the place where cluster promotion/demotion
>> >> > can't.
>> >> >
>> >> > I also feel that if we implement something towards a buddy system
>> >> > allocator for swap, we should do a proper buddy allocator
>> >> > implementation of data structures.
>> >>
>> >> I don't think that it's easy to implement a real buddy allocator for
>> >> swap entries. So, I avoid to use buddy in my words.
>> >
>> > Then such a mix of cluster order promote/demote lose some benefit of
>> > the buddy system. Because it lacks the proper data structure to
>> > support buddy allocation. The buddy allocator provides more general
>> > migration between orders. For the limited usage case of cluster
>> > promotion/demotion is supported (by luck). We need to evaluate whether
>> > it is worth the additional complexity.
>>
>> TBH, I believe that the complexity of order promote/demote is quite low,
>> both for development and runtime. A real buddy allocator may need to
>> increase per-swap-entry memory footprint much.
>
> I mostly concern its effectiveness. Anyway, the series is already
> complex enough with the big rewrite and reclaim on swap cache.
>
> Let me know if you think it needs to be done before the big rewrite.
I hope so. But, I will not force you to do that if you don't buy in it.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 43+ messages in thread
end of thread, other threads:[~2024-07-26 7:46 UTC | newest]
Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-11 7:29 [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-11 7:29 ` [PATCH v4 1/3] mm: swap: swap cluster switch to double link list Chris Li
2024-07-15 14:57 ` Ryan Roberts
2024-07-16 22:11 ` Chris Li
2024-07-18 6:26 ` Huang, Ying
2024-07-26 5:46 ` Chris Li
2024-07-11 7:29 ` [PATCH v4 2/3] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
2024-07-15 15:40 ` Ryan Roberts
2024-07-16 22:46 ` Chris Li
2024-07-17 10:14 ` Ryan Roberts
2024-07-17 15:41 ` Chris Li
2024-07-18 7:53 ` Huang, Ying
2024-07-19 10:30 ` Ryan Roberts
2024-07-22 2:14 ` Huang, Ying
2024-07-22 7:51 ` Ryan Roberts
2024-07-22 8:49 ` Huang, Ying
2024-07-22 9:54 ` Ryan Roberts
2024-07-23 6:27 ` Huang, Ying
2024-07-24 8:33 ` Ryan Roberts
2024-07-24 22:41 ` Chris Li
2024-07-25 6:43 ` Huang, Ying
2024-07-25 8:09 ` Chris Li
2024-07-26 2:09 ` Huang, Ying
2024-07-26 5:09 ` Chris Li
2024-07-26 6:02 ` Huang, Ying
2024-07-26 7:15 ` Chris Li
2024-07-26 7:42 ` Huang, Ying
2024-07-25 6:53 ` Huang, Ying
2024-07-25 8:26 ` Chris Li
2024-07-26 2:04 ` Huang, Ying
2024-07-26 4:50 ` Chris Li
2024-07-26 5:52 ` Huang, Ying
2024-07-26 7:10 ` Chris Li
2024-07-26 7:18 ` Huang, Ying
2024-07-26 7:26 ` Chris Li
2024-07-26 7:37 ` Huang, Ying
2024-07-11 7:29 ` [PATCH v4 3/3] RFC: mm: swap: seperate SSD allocation from scan_swap_map_slots() Chris Li
2024-07-11 10:02 ` [PATCH v4 0/3] mm: swap: mTHP swap allocator base on swap cluster order Ryan Roberts
2024-07-11 14:08 ` Chris Li
2024-07-15 14:10 ` Ryan Roberts
2024-07-15 18:14 ` Chris Li
2024-07-18 5:50 ` Huang, Ying
2024-07-26 5:51 ` Chris Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).