* [PATCH v5 1/9] mm: swap: swap cluster switch to double link list
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
@ 2024-07-31 6:49 ` Chris Li
2024-07-31 6:49 ` [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
` (10 subsequent siblings)
11 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
Previously, the swap cluster used a cluster index as a pointer
to construct a custom single link list type "swap_cluster_list".
The next cluster pointer is shared with the cluster->count.
It prevents puting the non free cluster into a list.
Change the cluster to use the standard double link list instead.
This allows tracing the nonfull cluster in the follow up patch.
That way, it is faster to get to the nonfull cluster of that order.
Remove the cluster getter/setter for accessing the cluster
struct member.
The list operation is protected by the swap_info_struct->lock.
Change cluster code to use "struct swap_cluster_info *" to
reference the cluster rather than by using index. That is more
consistent with the list manipulation. It avoids the repeat
adding index to the cluser_info. The code is easier to understand.
Remove the cluster next pointer is NULL flag, the double link
list can handle the empty list pretty well.
The "swap_cluster_info" struct is two pointer bigger, because
512 swap entries share one swap_cluster_info struct, it has very
little impact on the average memory usage per swap entry. For
1TB swapfile, the swap cluster data structure increases from 8MB
to 24MB.
Other than the list conversion, there is no real function change
in this patch.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
include/linux/swap.h | 25 ++----
mm/swapfile.c | 226 ++++++++++++++-------------------------------------
2 files changed, 71 insertions(+), 180 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e473fe6cfb7a..edafd52d7ac4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -243,22 +243,20 @@ enum {
* free clusters are organized into a list. We fetch an entry from the list to
* get a free cluster.
*
- * The data field stores next cluster if the cluster is free or cluster usage
- * counter otherwise. The flags field determines if a cluster is free. This is
- * protected by swap_info_struct.lock.
+ * The flags field determines if a cluster is free. This is
+ * protected by cluster lock.
*/
struct swap_cluster_info {
spinlock_t lock; /*
* Protect swap_cluster_info fields
- * and swap_info_struct->swap_map
- * elements correspond to the swap
- * cluster
+ * other than list, and swap_info_struct->swap_map
+ * elements corresponding to the swap cluster.
*/
- unsigned int data:24;
- unsigned int flags:8;
+ u16 count;
+ u8 flags;
+ struct list_head list;
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
-#define CLUSTER_FLAG_NEXT_NULL 2 /* This cluster has no next cluster */
/*
* The first page in the swap file is the swap header, which is always marked
@@ -283,11 +281,6 @@ struct percpu_cluster {
unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
};
-struct swap_cluster_list {
- struct swap_cluster_info head;
- struct swap_cluster_info tail;
-};
-
/*
* The in-memory structure used to track swap areas.
*/
@@ -301,7 +294,7 @@ struct swap_info_struct {
unsigned char *swap_map; /* vmalloc'ed array of usage counts */
unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
- struct swap_cluster_list free_clusters; /* free clusters list */
+ struct list_head free_clusters; /* free clusters list */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
@@ -332,7 +325,7 @@ struct swap_info_struct {
* list.
*/
struct work_struct discard_work; /* discard worker */
- struct swap_cluster_list discard_clusters; /* discard clusters list */
+ struct list_head discard_clusters; /* discard clusters list */
struct plist_node avail_lists[]; /*
* entries in swap_avail_heads, one
* entry per node.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index f7224bc1320c..bceead7f9e3c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -290,62 +290,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
#endif
#define LATENCY_LIMIT 256
-static inline void cluster_set_flag(struct swap_cluster_info *info,
- unsigned int flag)
-{
- info->flags = flag;
-}
-
-static inline unsigned int cluster_count(struct swap_cluster_info *info)
-{
- return info->data;
-}
-
-static inline void cluster_set_count(struct swap_cluster_info *info,
- unsigned int c)
-{
- info->data = c;
-}
-
-static inline void cluster_set_count_flag(struct swap_cluster_info *info,
- unsigned int c, unsigned int f)
-{
- info->flags = f;
- info->data = c;
-}
-
-static inline unsigned int cluster_next(struct swap_cluster_info *info)
-{
- return info->data;
-}
-
-static inline void cluster_set_next(struct swap_cluster_info *info,
- unsigned int n)
-{
- info->data = n;
-}
-
-static inline void cluster_set_next_flag(struct swap_cluster_info *info,
- unsigned int n, unsigned int f)
-{
- info->flags = f;
- info->data = n;
-}
-
static inline bool cluster_is_free(struct swap_cluster_info *info)
{
return info->flags & CLUSTER_FLAG_FREE;
}
-static inline bool cluster_is_null(struct swap_cluster_info *info)
-{
- return info->flags & CLUSTER_FLAG_NEXT_NULL;
-}
-
-static inline void cluster_set_null(struct swap_cluster_info *info)
+static inline unsigned int cluster_index(struct swap_info_struct *si,
+ struct swap_cluster_info *ci)
{
- info->flags = CLUSTER_FLAG_NEXT_NULL;
- info->data = 0;
+ return ci - si->cluster_info;
}
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
@@ -394,65 +347,11 @@ static inline void unlock_cluster_or_swap_info(struct swap_info_struct *si,
spin_unlock(&si->lock);
}
-static inline bool cluster_list_empty(struct swap_cluster_list *list)
-{
- return cluster_is_null(&list->head);
-}
-
-static inline unsigned int cluster_list_first(struct swap_cluster_list *list)
-{
- return cluster_next(&list->head);
-}
-
-static void cluster_list_init(struct swap_cluster_list *list)
-{
- cluster_set_null(&list->head);
- cluster_set_null(&list->tail);
-}
-
-static void cluster_list_add_tail(struct swap_cluster_list *list,
- struct swap_cluster_info *ci,
- unsigned int idx)
-{
- if (cluster_list_empty(list)) {
- cluster_set_next_flag(&list->head, idx, 0);
- cluster_set_next_flag(&list->tail, idx, 0);
- } else {
- struct swap_cluster_info *ci_tail;
- unsigned int tail = cluster_next(&list->tail);
-
- /*
- * Nested cluster lock, but both cluster locks are
- * only acquired when we held swap_info_struct->lock
- */
- ci_tail = ci + tail;
- spin_lock_nested(&ci_tail->lock, SINGLE_DEPTH_NESTING);
- cluster_set_next(ci_tail, idx);
- spin_unlock(&ci_tail->lock);
- cluster_set_next_flag(&list->tail, idx, 0);
- }
-}
-
-static unsigned int cluster_list_del_first(struct swap_cluster_list *list,
- struct swap_cluster_info *ci)
-{
- unsigned int idx;
-
- idx = cluster_next(&list->head);
- if (cluster_next(&list->tail) == idx) {
- cluster_set_null(&list->head);
- cluster_set_null(&list->tail);
- } else
- cluster_set_next_flag(&list->head,
- cluster_next(&ci[idx]), 0);
-
- return idx;
-}
-
/* Add a cluster to discard list and schedule it to do discard */
static void swap_cluster_schedule_discard(struct swap_info_struct *si,
- unsigned int idx)
+ struct swap_cluster_info *ci)
{
+ unsigned int idx = cluster_index(si, ci);
/*
* If scan_swap_map_slots() can't find a free cluster, it will check
* si->swap_map directly. To make sure the discarding cluster isn't
@@ -462,17 +361,14 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
- cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);
-
+ list_add_tail(&ci->list, &si->discard_clusters);
schedule_work(&si->discard_work);
}
-static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
- struct swap_cluster_info *ci = si->cluster_info;
-
- cluster_set_flag(ci + idx, CLUSTER_FLAG_FREE);
- cluster_list_add_tail(&si->free_clusters, ci, idx);
+ ci->flags = CLUSTER_FLAG_FREE;
+ list_add_tail(&ci->list, &si->free_clusters);
}
/*
@@ -481,24 +377,25 @@ static void __free_cluster(struct swap_info_struct *si, unsigned long idx)
*/
static void swap_do_scheduled_discard(struct swap_info_struct *si)
{
- struct swap_cluster_info *info, *ci;
+ struct swap_cluster_info *ci;
unsigned int idx;
- info = si->cluster_info;
-
- while (!cluster_list_empty(&si->discard_clusters)) {
- idx = cluster_list_del_first(&si->discard_clusters, info);
+ while (!list_empty(&si->discard_clusters)) {
+ ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
+ list_del(&ci->list);
+ idx = cluster_index(si, ci);
spin_unlock(&si->lock);
discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
SWAPFILE_CLUSTER);
spin_lock(&si->lock);
- ci = lock_cluster(si, idx * SWAPFILE_CLUSTER);
- __free_cluster(si, idx);
+
+ spin_lock(&ci->lock);
+ __free_cluster(si, ci);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
0, SWAPFILE_CLUSTER);
- unlock_cluster(ci);
+ spin_unlock(&ci->lock);
}
}
@@ -521,20 +418,21 @@ static void swap_users_ref_free(struct percpu_ref *ref)
complete(&si->comp);
}
-static void alloc_cluster(struct swap_info_struct *si, unsigned long idx)
+static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
{
- struct swap_cluster_info *ci = si->cluster_info;
+ struct swap_cluster_info *ci = list_first_entry(&si->free_clusters,
+ struct swap_cluster_info, list);
- VM_BUG_ON(cluster_list_first(&si->free_clusters) != idx);
- cluster_list_del_first(&si->free_clusters, ci);
- cluster_set_count_flag(ci + idx, 0, 0);
+ VM_BUG_ON(cluster_index(si, ci) != idx);
+ list_del(&ci->list);
+ ci->count = 0;
+ ci->flags = 0;
+ return ci;
}
-static void free_cluster(struct swap_info_struct *si, unsigned long idx)
+static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
- struct swap_cluster_info *ci = si->cluster_info + idx;
-
- VM_BUG_ON(cluster_count(ci) != 0);
+ VM_BUG_ON(ci->count != 0);
/*
* If the swap is discardable, prepare discard the cluster
* instead of free it immediately. The cluster will be freed
@@ -542,11 +440,11 @@ static void free_cluster(struct swap_info_struct *si, unsigned long idx)
*/
if ((si->flags & (SWP_WRITEOK | SWP_PAGE_DISCARD)) ==
(SWP_WRITEOK | SWP_PAGE_DISCARD)) {
- swap_cluster_schedule_discard(si, idx);
+ swap_cluster_schedule_discard(si, ci);
return;
}
- __free_cluster(si, idx);
+ __free_cluster(si, ci);
}
/*
@@ -559,15 +457,15 @@ static void add_cluster_info_page(struct swap_info_struct *p,
unsigned long count)
{
unsigned long idx = page_nr / SWAPFILE_CLUSTER;
+ struct swap_cluster_info *ci = cluster_info + idx;
if (!cluster_info)
return;
- if (cluster_is_free(&cluster_info[idx]))
+ if (cluster_is_free(ci))
alloc_cluster(p, idx);
- VM_BUG_ON(cluster_count(&cluster_info[idx]) + count > SWAPFILE_CLUSTER);
- cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) + count);
+ VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
+ ci->count += count;
}
/*
@@ -581,24 +479,20 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
}
/*
- * The cluster corresponding to page_nr decreases one usage. If the usage
- * counter becomes 0, which means no page in the cluster is in using, we can
- * optionally discard the cluster and add it to free cluster list.
+ * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * which means no page in the cluster is in use, we can optionally discard
+ * the cluster and add it to free cluster list.
*/
-static void dec_cluster_info_page(struct swap_info_struct *p,
- struct swap_cluster_info *cluster_info, unsigned long page_nr)
+static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
{
- unsigned long idx = page_nr / SWAPFILE_CLUSTER;
-
- if (!cluster_info)
+ if (!p->cluster_info)
return;
- VM_BUG_ON(cluster_count(&cluster_info[idx]) == 0);
- cluster_set_count(&cluster_info[idx],
- cluster_count(&cluster_info[idx]) - 1);
+ VM_BUG_ON(ci->count == 0);
+ ci->count--;
- if (cluster_count(&cluster_info[idx]) == 0)
- free_cluster(p, idx);
+ if (!ci->count)
+ free_cluster(p, ci);
}
/*
@@ -611,10 +505,12 @@ scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
{
struct percpu_cluster *percpu_cluster;
bool conflict;
+ struct swap_cluster_info *first = list_first_entry(&si->free_clusters,
+ struct swap_cluster_info, list);
offset /= SWAPFILE_CLUSTER;
- conflict = !cluster_list_empty(&si->free_clusters) &&
- offset != cluster_list_first(&si->free_clusters) &&
+ conflict = !list_empty(&si->free_clusters) &&
+ offset != cluster_index(si, first) &&
cluster_is_free(&si->cluster_info[offset]);
if (!conflict)
@@ -655,10 +551,10 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
cluster = this_cpu_ptr(si->percpu_cluster);
tmp = cluster->next[order];
if (tmp == SWAP_NEXT_INVALID) {
- if (!cluster_list_empty(&si->free_clusters)) {
- tmp = cluster_next(&si->free_clusters.head) *
- SWAPFILE_CLUSTER;
- } else if (!cluster_list_empty(&si->discard_clusters)) {
+ if (!list_empty(&si->free_clusters)) {
+ ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+ tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+ } else if (!list_empty(&si->discard_clusters)) {
/*
* we don't have free cluster but have some clusters in
* discarding, do discard now and reclaim them, then
@@ -1070,8 +966,9 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
- cluster_set_count_flag(ci, 0, 0);
- free_cluster(si, idx);
+ ci->count = 0;
+ ci->flags = 0;
+ free_cluster(si, ci);
unlock_cluster(ci);
swap_range_free(si, offset, SWAPFILE_CLUSTER);
}
@@ -1344,7 +1241,7 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
count = p->swap_map[offset];
VM_BUG_ON(count != SWAP_HAS_CACHE);
p->swap_map[offset] = 0;
- dec_cluster_info_page(p, p->cluster_info, offset);
+ dec_cluster_info_page(p, ci);
unlock_cluster(ci);
mem_cgroup_uncharge_swap(entry, 1);
@@ -3022,8 +2919,8 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
nr_good_pages = maxpages - 1; /* omit header page */
- cluster_list_init(&p->free_clusters);
- cluster_list_init(&p->discard_clusters);
+ INIT_LIST_HEAD(&p->free_clusters);
+ INIT_LIST_HEAD(&p->discard_clusters);
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
@@ -3074,14 +2971,15 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
for (k = 0; k < SWAP_CLUSTER_COLS; k++) {
j = (k + col) % SWAP_CLUSTER_COLS;
for (i = 0; i < DIV_ROUND_UP(nr_clusters, SWAP_CLUSTER_COLS); i++) {
+ struct swap_cluster_info *ci;
idx = i * SWAP_CLUSTER_COLS + j;
+ ci = cluster_info + idx;
if (idx >= nr_clusters)
continue;
- if (cluster_count(&cluster_info[idx]))
+ if (ci->count)
continue;
- cluster_set_flag(&cluster_info[idx], CLUSTER_FLAG_FREE);
- cluster_list_add_tail(&p->free_clusters, cluster_info,
- idx);
+ ci->flags = CLUSTER_FLAG_FREE;
+ list_add_tail(&ci->list, &p->free_clusters);
}
}
return nr_extents;
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-31 6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
@ 2024-07-31 6:49 ` Chris Li
[not found] ` <87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-07-31 6:49 ` [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() Chris Li
` (9 subsequent siblings)
11 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
Track the nonfull cluster as well as the empty cluster
on lists. Each order has one nonfull cluster list.
The cluster will remember which order it was used during
new cluster allocation.
When the cluster has free entry, add to the nonfull[order]
list. When the free cluster list is empty, also allocate
from the nonempty list of that order.
This improves the mTHP swap allocation success rate.
There are limitations if the distribution of numbers of
different orders of mTHP changes a lot. e.g. there are a lot
of nonfull cluster assign to order A while later time there
are a lot of order B allocation while very little allocation
in order A. Currently the cluster used by order A will not
reused by order B unless the cluster is 100% empty.
Signed-off-by: Chris Li <chrisl@kernel.org>
---
include/linux/swap.h | 4 ++++
mm/swapfile.c | 38 +++++++++++++++++++++++++++++++++++---
2 files changed, 39 insertions(+), 3 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index edafd52d7ac4..6716ef236766 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -254,9 +254,11 @@ struct swap_cluster_info {
*/
u16 count;
u8 flags;
+ u8 order;
struct list_head list;
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
+#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
/*
* The first page in the swap file is the swap header, which is always marked
@@ -295,6 +297,8 @@ struct swap_info_struct {
unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
+ struct list_head nonfull_clusters[SWAP_NR_ORDERS];
+ /* list of cluster that contains at least one free slot */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index bceead7f9e3c..dcf09eb549db 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -361,14 +361,22 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
- list_add_tail(&ci->list, &si->discard_clusters);
+ VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
+ if (ci->flags & CLUSTER_FLAG_NONFULL)
+ list_move_tail(&ci->list, &si->discard_clusters);
+ else
+ list_add_tail(&ci->list, &si->discard_clusters);
+ ci->flags = 0;
schedule_work(&si->discard_work);
}
static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
+ if (ci->flags & CLUSTER_FLAG_NONFULL)
+ list_move_tail(&ci->list, &si->free_clusters);
+ else
+ list_add_tail(&ci->list, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
- list_add_tail(&ci->list, &si->free_clusters);
}
/*
@@ -491,8 +499,15 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
VM_BUG_ON(ci->count == 0);
ci->count--;
- if (!ci->count)
+ if (!ci->count) {
free_cluster(p, ci);
+ return;
+ }
+
+ if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+ list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+ ci->flags |= CLUSTER_FLAG_NONFULL;
+ }
}
/*
@@ -553,6 +568,19 @@ static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
if (tmp == SWAP_NEXT_INVALID) {
if (!list_empty(&si->free_clusters)) {
ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+ list_del(&ci->list);
+ spin_lock(&ci->lock);
+ ci->order = order;
+ ci->flags = 0;
+ spin_unlock(&ci->lock);
+ tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
+ } else if (!list_empty(&si->nonfull_clusters[order])) {
+ ci = list_first_entry(&si->nonfull_clusters[order],
+ struct swap_cluster_info, list);
+ list_del(&ci->list);
+ spin_lock(&ci->lock);
+ ci->flags = 0;
+ spin_unlock(&ci->lock);
tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
} else if (!list_empty(&si->discard_clusters)) {
/*
@@ -967,6 +995,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
ci->count = 0;
+ ci->order = 0;
ci->flags = 0;
free_cluster(si, ci);
unlock_cluster(ci);
@@ -2922,6 +2951,9 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
INIT_LIST_HEAD(&p->free_clusters);
INIT_LIST_HEAD(&p->discard_clusters);
+ for (i = 0; i < SWAP_NR_ORDERS; i++)
+ INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
if (page_nr == 0 || page_nr > swap_header->info.last_page)
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots()
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
2024-07-31 6:49 ` [PATCH v5 1/9] mm: swap: swap cluster switch to double link list Chris Li
2024-07-31 6:49 ` [PATCH v5 2/9] mm: swap: mTHP allocate swap entries from nonfull list Chris Li
@ 2024-07-31 6:49 ` Chris Li
2024-07-31 6:49 ` [PATCH v5 4/9] mm: swap: clean up initialization helper chrisl
` (8 subsequent siblings)
11 siblings, 0 replies; 32+ messages in thread
From: Chris Li @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
Previously the SSD and HDD share the same swap_map scan loop in
scan_swap_map_slots(). This function is complex and hard to flow
the execution flow.
scan_swap_map_try_ssd_cluster() can already do most of the heavy
lifting to locate the candidate swap range in the cluster. However
it needs to go back to scan_swap_map_slots() to check conflict
and then perform the allocation.
When scan_swap_map_try_ssd_cluster() failed, it still depended on
the scan_swap_map_slots() to do brute force scanning of the swap_map.
When the swapfile is large and almost full, it will take some CPU
time to go through the swap_map array.
Get rid of the cluster allocation dependency on the swap_map scan
loop in scan_swap_map_slots(). Streamline the cluster allocation
code path. No more conflict checks.
For order 0 swap entry, when run out of free and nonfull list.
It will allocate from the higher order nonfull cluster list.
Users should see less CPU time spent on searching the free swap
slot when swapfile is almost full.
Signed-off-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 300 ++++++++++++++++++++++++++++++++--------------------------
1 file changed, 168 insertions(+), 132 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index dcf09eb549db..8a72c8a9aafd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,6 +53,8 @@
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
static void free_swap_count_continuations(struct swap_info_struct *);
+static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
+ unsigned int nr_entries);
static DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
@@ -301,6 +303,12 @@ static inline unsigned int cluster_index(struct swap_info_struct *si,
return ci - si->cluster_info;
}
+static inline unsigned int cluster_offset(struct swap_info_struct *si,
+ struct swap_cluster_info *ci)
+{
+ return cluster_index(si, ci) * SWAPFILE_CLUSTER;
+}
+
static inline struct swap_cluster_info *lock_cluster(struct swap_info_struct *si,
unsigned long offset)
{
@@ -372,11 +380,15 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
+ lockdep_assert_held(&si->lock);
+ lockdep_assert_held(&ci->lock);
+
if (ci->flags & CLUSTER_FLAG_NONFULL)
list_move_tail(&ci->list, &si->free_clusters);
else
list_add_tail(&ci->list, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
+ ci->order = 0;
}
/*
@@ -431,9 +443,11 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi
struct swap_cluster_info *ci = list_first_entry(&si->free_clusters,
struct swap_cluster_info, list);
+ lockdep_assert_held(&si->lock);
+ lockdep_assert_held(&ci->lock);
VM_BUG_ON(cluster_index(si, ci) != idx);
+ VM_BUG_ON(ci->count);
list_del(&ci->list);
- ci->count = 0;
ci->flags = 0;
return ci;
}
@@ -441,6 +455,8 @@ static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsi
static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
VM_BUG_ON(ci->count != 0);
+ lockdep_assert_held(&si->lock);
+ lockdep_assert_held(&ci->lock);
/*
* If the swap is discardable, prepare discard the cluster
* instead of free it immediately. The cluster will be freed
@@ -497,6 +513,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
return;
VM_BUG_ON(ci->count == 0);
+ VM_BUG_ON(cluster_is_free(ci));
+ lockdep_assert_held(&p->lock);
+ lockdep_assert_held(&ci->lock);
ci->count--;
if (!ci->count) {
@@ -505,48 +524,88 @@ static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluste
}
if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
+ VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
- ci->flags |= CLUSTER_FLAG_NONFULL;
+ ci->flags = CLUSTER_FLAG_NONFULL;
}
}
-/*
- * It's possible scan_swap_map_slots() uses a free cluster in the middle of free
- * cluster list. Avoiding such abuse to avoid list corruption.
- */
-static bool
-scan_swap_map_ssd_cluster_conflict(struct swap_info_struct *si,
- unsigned long offset, int order)
+static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start,
+ unsigned int nr_pages)
{
- struct percpu_cluster *percpu_cluster;
- bool conflict;
- struct swap_cluster_info *first = list_first_entry(&si->free_clusters,
- struct swap_cluster_info, list);
-
- offset /= SWAPFILE_CLUSTER;
- conflict = !list_empty(&si->free_clusters) &&
- offset != cluster_index(si, first) &&
- cluster_is_free(&si->cluster_info[offset]);
+ unsigned char *p = si->swap_map + start;
+ unsigned char *end = p + nr_pages;
- if (!conflict)
- return false;
+ while (p < end)
+ if (*p++)
+ return false;
- percpu_cluster = this_cpu_ptr(si->percpu_cluster);
- percpu_cluster->next[order] = SWAP_NEXT_INVALID;
return true;
}
-static inline bool swap_range_empty(char *swap_map, unsigned int start,
- unsigned int nr_pages)
+
+static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
+ unsigned int start, unsigned char usage,
+ unsigned int order)
{
- unsigned int i;
+ unsigned int nr_pages = 1 << order;
- for (i = 0; i < nr_pages; i++) {
- if (swap_map[start + i])
- return false;
+ if (cluster_is_free(ci)) {
+ if (nr_pages < SWAPFILE_CLUSTER) {
+ list_move_tail(&ci->list, &si->nonfull_clusters[order]);
+ ci->flags = CLUSTER_FLAG_NONFULL;
+ }
+ ci->order = order;
}
- return true;
+ memset(si->swap_map + start, usage, nr_pages);
+ swap_range_alloc(si, start, nr_pages);
+ ci->count += nr_pages;
+
+ if (ci->count == SWAPFILE_CLUSTER) {
+ VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL)));
+ list_del(&ci->list);
+ ci->flags = 0;
+ }
+}
+
+static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset,
+ unsigned int *foundp, unsigned int order,
+ unsigned char usage)
+{
+ unsigned long start = offset & ~(SWAPFILE_CLUSTER - 1);
+ unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
+ unsigned int nr_pages = 1 << order;
+ struct swap_cluster_info *ci;
+
+ if (end < nr_pages)
+ return SWAP_NEXT_INVALID;
+ end -= nr_pages;
+
+ ci = lock_cluster(si, offset);
+ if (ci->count + nr_pages > SWAPFILE_CLUSTER) {
+ offset = SWAP_NEXT_INVALID;
+ goto done;
+ }
+
+ while (offset <= end) {
+ if (cluster_scan_range(si, offset, nr_pages)) {
+ cluster_alloc_range(si, ci, offset, usage, order);
+ *foundp = offset;
+ if (ci->count == SWAPFILE_CLUSTER) {
+ offset = SWAP_NEXT_INVALID;
+ goto done;
+ }
+ offset += nr_pages;
+ break;
+ }
+ offset += nr_pages;
+ }
+ if (offset > end)
+ offset = SWAP_NEXT_INVALID;
+done:
+ unlock_cluster(ci);
+ return offset;
}
/*
@@ -554,72 +613,66 @@ static inline bool swap_range_empty(char *swap_map, unsigned int start,
* pool (a cluster). This might involve allocating a new cluster for current CPU
* too.
*/
-static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
- unsigned long *offset, unsigned long *scan_base, int order)
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+ unsigned char usage)
{
- unsigned int nr_pages = 1 << order;
struct percpu_cluster *cluster;
- struct swap_cluster_info *ci;
- unsigned int tmp, max;
+ struct swap_cluster_info *ci, *n;
+ unsigned int offset, found = 0;
new_cluster:
+ lockdep_assert_held(&si->lock);
cluster = this_cpu_ptr(si->percpu_cluster);
- tmp = cluster->next[order];
- if (tmp == SWAP_NEXT_INVALID) {
- if (!list_empty(&si->free_clusters)) {
- ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
- list_del(&ci->list);
- spin_lock(&ci->lock);
- ci->order = order;
- ci->flags = 0;
- spin_unlock(&ci->lock);
- tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
- } else if (!list_empty(&si->nonfull_clusters[order])) {
- ci = list_first_entry(&si->nonfull_clusters[order],
- struct swap_cluster_info, list);
- list_del(&ci->list);
- spin_lock(&ci->lock);
- ci->flags = 0;
- spin_unlock(&ci->lock);
- tmp = cluster_index(si, ci) * SWAPFILE_CLUSTER;
- } else if (!list_empty(&si->discard_clusters)) {
- /*
- * we don't have free cluster but have some clusters in
- * discarding, do discard now and reclaim them, then
- * reread cluster_next_cpu since we dropped si->lock
- */
- swap_do_scheduled_discard(si);
- *scan_base = this_cpu_read(*si->cluster_next_cpu);
- *offset = *scan_base;
- goto new_cluster;
- } else
- return false;
+ offset = cluster->next[order];
+ if (offset) {
+ offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
+ if (found)
+ goto done;
}
- /*
- * Other CPUs can use our cluster if they can't find a free cluster,
- * check if there is still free entry in the cluster, maintaining
- * natural alignment.
- */
- max = min_t(unsigned long, si->max, ALIGN(tmp + 1, SWAPFILE_CLUSTER));
- if (tmp < max) {
- ci = lock_cluster(si, tmp);
- while (tmp < max) {
- if (swap_range_empty(si->swap_map, tmp, nr_pages))
- break;
- tmp += nr_pages;
+ if (!list_empty(&si->free_clusters)) {
+ ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
+ VM_BUG_ON(!found);
+ goto done;
+ }
+
+ if (order < PMD_ORDER) {
+ list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+ &found, order, usage);
+ if (found)
+ goto done;
}
- unlock_cluster(ci);
}
- if (tmp >= max) {
- cluster->next[order] = SWAP_NEXT_INVALID;
+
+ if (!list_empty(&si->discard_clusters)) {
+ /*
+ * we don't have free cluster but have some clusters in
+ * discarding, do discard now and reclaim them, then
+ * reread cluster_next_cpu since we dropped si->lock
+ */
+ swap_do_scheduled_discard(si);
goto new_cluster;
}
- *offset = tmp;
- *scan_base = tmp;
- tmp += nr_pages;
- cluster->next[order] = tmp < max ? tmp : SWAP_NEXT_INVALID;
- return true;
+
+ if (order)
+ goto done;
+
+ for (int o = 1; o < PMD_ORDER; o++) {
+ if (!list_empty(&si->nonfull_clusters[o])) {
+ ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info,
+ list);
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+ &found, 0, usage);
+ VM_BUG_ON(!found);
+ goto done;
+ }
+ }
+
+done:
+ cluster->next[order] = offset;
+ return found;
}
static void __del_from_avail_list(struct swap_info_struct *p)
@@ -754,11 +807,29 @@ static bool swap_offset_available_and_locked(struct swap_info_struct *si,
return false;
}
+static int cluster_alloc_swap(struct swap_info_struct *si,
+ unsigned char usage, int nr,
+ swp_entry_t slots[], int order)
+{
+ int n_ret = 0;
+
+ VM_BUG_ON(!si->cluster_info);
+
+ while (n_ret < nr) {
+ unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
+
+ if (!offset)
+ break;
+ slots[n_ret++] = swp_entry(si->type, offset);
+ }
+
+ return n_ret;
+}
+
static int scan_swap_map_slots(struct swap_info_struct *si,
unsigned char usage, int nr,
swp_entry_t slots[], int order)
{
- struct swap_cluster_info *ci;
unsigned long offset;
unsigned long scan_base;
unsigned long last_in_cluster = 0;
@@ -797,26 +868,16 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
return 0;
}
+ if (si->cluster_info)
+ return cluster_alloc_swap(si, usage, nr, slots, order);
+
si->flags += SWP_SCANNING;
- /*
- * Use percpu scan base for SSD to reduce lock contention on
- * cluster and swap cache. For HDD, sequential access is more
- * important.
- */
- if (si->flags & SWP_SOLIDSTATE)
- scan_base = this_cpu_read(*si->cluster_next_cpu);
- else
- scan_base = si->cluster_next;
+
+ /* For HDD, sequential access is more important. */
+ scan_base = si->cluster_next;
offset = scan_base;
- /* SSD algorithm */
- if (si->cluster_info) {
- if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order)) {
- if (order > 0)
- goto no_page;
- goto scan;
- }
- } else if (unlikely(!si->cluster_nr--)) {
+ if (unlikely(!si->cluster_nr--)) {
if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) {
si->cluster_nr = SWAPFILE_CLUSTER - 1;
goto checks;
@@ -827,8 +888,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
/*
* If seek is expensive, start searching for new cluster from
* start of partition, to minimize the span of allocated swap.
- * If seek is cheap, that is the SWP_SOLIDSTATE si->cluster_info
- * case, just handled by scan_swap_map_try_ssd_cluster() above.
*/
scan_base = offset = si->lowest_bit;
last_in_cluster = offset + SWAPFILE_CLUSTER - 1;
@@ -856,19 +915,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}
checks:
- if (si->cluster_info) {
- while (scan_swap_map_ssd_cluster_conflict(si, offset, order)) {
- /* take a break if we already got some slots */
- if (n_ret)
- goto done;
- if (!scan_swap_map_try_ssd_cluster(si, &offset,
- &scan_base, order)) {
- if (order > 0)
- goto no_page;
- goto scan;
- }
- }
- }
if (!(si->flags & SWP_WRITEOK))
goto no_page;
if (!si->highest_bit)
@@ -876,11 +922,9 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
if (offset > si->highest_bit)
scan_base = offset = si->lowest_bit;
- ci = lock_cluster(si, offset);
/* reuse swap entry of cache-only swap if not busy. */
if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
int swap_was_freed;
- unlock_cluster(ci);
spin_unlock(&si->lock);
swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
spin_lock(&si->lock);
@@ -891,15 +935,12 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
}
if (si->swap_map[offset]) {
- unlock_cluster(ci);
if (!n_ret)
goto scan;
else
goto done;
}
memset(si->swap_map + offset, usage, nr_pages);
- add_cluster_info_page(si, si->cluster_info, offset, nr_pages);
- unlock_cluster(ci);
swap_range_alloc(si, offset, nr_pages);
slots[n_ret++] = swp_entry(si->type, offset);
@@ -920,13 +961,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
latency_ration = LATENCY_LIMIT;
}
- /* try to get more slots in cluster */
- if (si->cluster_info) {
- if (scan_swap_map_try_ssd_cluster(si, &offset, &scan_base, order))
- goto checks;
- if (order > 0)
- goto done;
- } else if (si->cluster_nr && !si->swap_map[++offset]) {
+ if (si->cluster_nr && !si->swap_map[++offset]) {
/* non-ssd case, still more slots in cluster? */
--si->cluster_nr;
goto checks;
@@ -995,8 +1030,6 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
ci = lock_cluster(si, offset);
memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
ci->count = 0;
- ci->order = 0;
- ci->flags = 0;
free_cluster(si, ci);
unlock_cluster(ci);
swap_range_free(si, offset, SWAPFILE_CLUSTER);
@@ -3008,8 +3041,11 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
ci = cluster_info + idx;
if (idx >= nr_clusters)
continue;
- if (ci->count)
+ if (ci->count) {
+ ci->flags = CLUSTER_FLAG_NONFULL;
+ list_add_tail(&ci->list, &p->nonfull_clusters[0]);
continue;
+ }
ci->flags = CLUSTER_FLAG_FREE;
list_add_tail(&ci->list, &p->free_clusters);
}
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v5 4/9] mm: swap: clean up initialization helper
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (2 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 3/9] mm: swap: separate SSD allocation from scan_swap_map_slots() Chris Li
@ 2024-07-31 6:49 ` chrisl
2024-07-31 6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
` (7 subsequent siblings)
11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
From: Kairui Song <kasong@tencent.com>
At this point, alloc_cluster is never called already, and
inc_cluster_info_page is called by initialization only, a lot of dead
code can be dropped.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 44 ++++++++++----------------------------------
1 file changed, 10 insertions(+), 34 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8a72c8a9aafd..34e6ea13e8e4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -438,20 +438,6 @@ static void swap_users_ref_free(struct percpu_ref *ref)
complete(&si->comp);
}
-static struct swap_cluster_info *alloc_cluster(struct swap_info_struct *si, unsigned long idx)
-{
- struct swap_cluster_info *ci = list_first_entry(&si->free_clusters,
- struct swap_cluster_info, list);
-
- lockdep_assert_held(&si->lock);
- lockdep_assert_held(&ci->lock);
- VM_BUG_ON(cluster_index(si, ci) != idx);
- VM_BUG_ON(ci->count);
- list_del(&ci->list);
- ci->flags = 0;
- return ci;
-}
-
static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
{
VM_BUG_ON(ci->count != 0);
@@ -472,34 +458,24 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
}
/*
- * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased by
- * count.
+ * The cluster corresponding to page_nr will be used. The cluster will not be
+ * added to free cluster list and its usage counter will be increased by 1.
+ * Only used for initialization.
*/
-static void add_cluster_info_page(struct swap_info_struct *p,
- struct swap_cluster_info *cluster_info, unsigned long page_nr,
- unsigned long count)
+static void inc_cluster_info_page(struct swap_info_struct *p,
+ struct swap_cluster_info *cluster_info, unsigned long page_nr)
{
unsigned long idx = page_nr / SWAPFILE_CLUSTER;
- struct swap_cluster_info *ci = cluster_info + idx;
+ struct swap_cluster_info *ci;
if (!cluster_info)
return;
- if (cluster_is_free(ci))
- alloc_cluster(p, idx);
- VM_BUG_ON(ci->count + count > SWAPFILE_CLUSTER);
- ci->count += count;
-}
+ ci = cluster_info + idx;
+ ci->count++;
-/*
- * The cluster corresponding to page_nr will be used. The cluster will be
- * removed from free cluster list and its usage counter will be increased by 1.
- */
-static void inc_cluster_info_page(struct swap_info_struct *p,
- struct swap_cluster_info *cluster_info, unsigned long page_nr)
-{
- add_cluster_info_page(p, cluster_info, page_nr, 1);
+ VM_BUG_ON(ci->count > SWAPFILE_CLUSTER);
+ VM_BUG_ON(ci->flags);
}
/*
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (3 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 4/9] mm: swap: clean up initialization helper chrisl
@ 2024-07-31 6:49 ` chrisl
2024-08-03 9:11 ` Barry Song
2024-07-31 6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
` (6 subsequent siblings)
11 siblings, 1 reply; 32+ messages in thread
From: chrisl @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
From: Kairui Song <kasong@tencent.com>
Currently when we are freeing mTHP folios from swap cache, we free
then one by one and put each entry into swap slot cache. Slot
cache is designed to reduce the overhead by batching the freeing,
but mTHP swap entries are already continuous so they can be batch
freed without it already, it saves litle overhead, or even increase
overhead for larger mTHP.
What's more, mTHP entries could stay in swap cache for a while.
Contiguous swap entry is an rather rare resource so releasing them
directly can help improve mTHP allocation success rate when under
pressure.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 59 ++++++++++++++++++++++++++---------------------------------
1 file changed, 26 insertions(+), 33 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 34e6ea13e8e4..9b63b2262cc2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
}
/*
- * The cluster ci decreases one usage. If the usage counter becomes 0,
+ * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
* which means no page in the cluster is in use, we can optionally discard
* the cluster and add it to free cluster list.
*/
-static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
+static void dec_cluster_info_page(struct swap_info_struct *p,
+ struct swap_cluster_info *ci, int nr_pages)
{
if (!p->cluster_info)
return;
- VM_BUG_ON(ci->count == 0);
+ VM_BUG_ON(ci->count < nr_pages);
VM_BUG_ON(cluster_is_free(ci));
lockdep_assert_held(&p->lock);
lockdep_assert_held(&ci->lock);
- ci->count--;
+ ci->count -= nr_pages;
if (!ci->count) {
free_cluster(p, ci);
@@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
return n_ret;
}
-static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
-{
- unsigned long offset = idx * SWAPFILE_CLUSTER;
- struct swap_cluster_info *ci;
-
- ci = lock_cluster(si, offset);
- memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
- ci->count = 0;
- free_cluster(si, ci);
- unlock_cluster(ci);
- swap_range_free(si, offset, SWAPFILE_CLUSTER);
-}
-
int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
{
int order = swap_entry_order(entry_order);
@@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
return usage;
}
-static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
+/*
+ * Drop the last HAS_CACHE flag of swap entries, caller have to
+ * ensure all entries belong to the same cgroup.
+ */
+static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry,
+ unsigned int nr_pages)
{
- struct swap_cluster_info *ci;
unsigned long offset = swp_offset(entry);
- unsigned char count;
+ unsigned char *map = p->swap_map + offset;
+ unsigned char *map_end = map + nr_pages;
+ struct swap_cluster_info *ci;
ci = lock_cluster(p, offset);
- count = p->swap_map[offset];
- VM_BUG_ON(count != SWAP_HAS_CACHE);
- p->swap_map[offset] = 0;
- dec_cluster_info_page(p, ci);
+ do {
+ VM_BUG_ON(*map != SWAP_HAS_CACHE);
+ *map = 0;
+ } while (++map < map_end);
+ dec_cluster_info_page(p, ci, nr_pages);
unlock_cluster(ci);
- mem_cgroup_uncharge_swap(entry, 1);
- swap_range_free(p, offset, 1);
+ mem_cgroup_uncharge_swap(entry, nr_pages);
+ swap_range_free(p, offset, nr_pages);
}
static void cluster_swap_free_nr(struct swap_info_struct *sis,
@@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
unsigned long offset = swp_offset(entry);
- unsigned long idx = offset / SWAPFILE_CLUSTER;
struct swap_cluster_info *ci;
struct swap_info_struct *si;
unsigned char *map;
@@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
return;
ci = lock_cluster_or_swap_info(si, offset);
- if (size == SWAPFILE_CLUSTER) {
+ if (size > 1) {
map = si->swap_map + offset;
- for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+ for (i = 0; i < size; i++) {
val = map[i];
VM_BUG_ON(!(val & SWAP_HAS_CACHE));
if (val == SWAP_HAS_CACHE)
free_entries++;
}
- if (free_entries == SWAPFILE_CLUSTER) {
+ if (free_entries == size) {
unlock_cluster_or_swap_info(si, ci);
spin_lock(&si->lock);
- mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
- swap_free_cluster(si, idx);
+ swap_entry_range_free(si, entry, size);
spin_unlock(&si->lock);
return;
}
@@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
for (i = 0; i < n; ++i) {
p = swap_info_get_cont(entries[i], prev);
if (p)
- swap_entry_free(p, entries[i]);
+ swap_entry_range_free(p, entries[i], 1);
prev = p;
}
if (p)
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP
2024-07-31 6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
@ 2024-08-03 9:11 ` Barry Song
2024-08-03 10:57 ` Barry Song
0 siblings, 1 reply; 32+ messages in thread
From: Barry Song @ 2024-08-03 9:11 UTC (permalink / raw)
To: chrisl
Cc: akpm, baohua, hughd, kaleshsingh, kasong, linux-kernel, linux-mm,
ryan.roberts, ying.huang, Barry Song
On Wed, Jul 31, 2024 at 6:49 PM <chrisl@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently when we are freeing mTHP folios from swap cache, we free
> then one by one and put each entry into swap slot cache. Slot
> cache is designed to reduce the overhead by batching the freeing,
> but mTHP swap entries are already continuous so they can be batch
> freed without it already, it saves litle overhead, or even increase
> overhead for larger mTHP.
>
> What's more, mTHP entries could stay in swap cache for a while.
> Contiguous swap entry is an rather rare resource so releasing them
> directly can help improve mTHP allocation success rate when under
> pressure.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Barry Song <baohua@kernel.org>
I believe this is the right direction to take. Currently, entries are released
one by one, even when they are contiguous in the swap file(those nr_pages
entries are definitely in the same cluster and same si), leading to numerous
lock and unlock operations.
This approach provides batched support.
free_swap_and_cache_nr() has the same issue, so I drafted a patch based on
your code. I wonder if you can also help test and review before I send it
officially:
From 4bed5c08bc0f7769ee2849812acdad70c4e32ead Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Sat, 3 Aug 2024 20:21:14 +1200
Subject: [PATCH RFC] mm: attempt to batch free swap entries for zap_pte_range()
Zhiguo reported that swap release could be a serious bottleneck
during process exits[1]. With mTHP, we have the opportunity to
batch free swaps.
Thanks to the work of Chris and Kairui[2], I was able to achieve
this optimization with minimal code changes by building on their
efforts.
[1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/
[2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/swapfile.c | 43 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ea023fc25d08..9def6dba8d26 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
return true;
}
+static bool swap_is_last_map(struct swap_info_struct *si,
+ unsigned long offset, int nr_pages,
+ bool *any_only_cache)
+{
+ unsigned char *map = si->swap_map + offset;
+ unsigned char *map_end = map + nr_pages;
+ bool cached = false;
+
+ do {
+ if ((*map & ~SWAP_HAS_CACHE) != 1)
+ return false;
+ if (*map & SWAP_HAS_CACHE)
+ cached = true;
+ } while (++map < map_end);
+
+ *any_only_cache = cached;
+ return true;
+}
+
/*
* returns number of pages in the folio that backs the swap entry. If positive,
* the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
@@ -1808,6 +1827,29 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
if (WARN_ON(end_offset > si->max))
goto out;
+ if (nr > 1) {
+ struct swap_cluster_info *ci;
+ bool batched_free;
+ int i;
+
+ ci = lock_cluster_or_swap_info(si, start_offset);
+ if ((batched_free = swap_is_last_map(si, start_offset, nr, &any_only_cache))) {
+ for (i = 0; i < nr; i++)
+ WRITE_ONCE(si->swap_map[start_offset + i], SWAP_HAS_CACHE);
+ }
+ unlock_cluster_or_swap_info(si, ci);
+
+ if (batched_free) {
+ spin_lock(&si->lock);
+ pr_err("%s offset:%lx nr:%lx\n", __func__,start_offset, nr);
+ swap_entry_range_free(si, entry, nr);
+ spin_unlock(&si->lock);
+ if (any_only_cache)
+ goto reclaim;
+ goto out;
+ }
+ }
+
/*
* First free all entries in the range.
*/
@@ -1828,6 +1870,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
if (!any_only_cache)
goto out;
+reclaim:
/*
* Now go back over the range trying to reclaim the swap cache. This is
* more efficient for large folios because we will only try to reclaim
--
2.34.1
> ---
> mm/swapfile.c | 59 ++++++++++++++++++++++++++---------------------------------
> 1 file changed, 26 insertions(+), 33 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 34e6ea13e8e4..9b63b2262cc2 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> }
>
> /*
> - * The cluster ci decreases one usage. If the usage counter becomes 0,
> + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
> * which means no page in the cluster is in use, we can optionally discard
> * the cluster and add it to free cluster list.
> */
> -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> +static void dec_cluster_info_page(struct swap_info_struct *p,
> + struct swap_cluster_info *ci, int nr_pages)
> {
> if (!p->cluster_info)
> return;
>
> - VM_BUG_ON(ci->count == 0);
> + VM_BUG_ON(ci->count < nr_pages);
> VM_BUG_ON(cluster_is_free(ci));
> lockdep_assert_held(&p->lock);
> lockdep_assert_held(&ci->lock);
> - ci->count--;
> + ci->count -= nr_pages;
>
> if (!ci->count) {
> free_cluster(p, ci);
> @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> return n_ret;
> }
>
> -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> -{
> - unsigned long offset = idx * SWAPFILE_CLUSTER;
> - struct swap_cluster_info *ci;
> -
> - ci = lock_cluster(si, offset);
> - memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> - ci->count = 0;
> - free_cluster(si, ci);
> - unlock_cluster(ci);
> - swap_range_free(si, offset, SWAPFILE_CLUSTER);
> -}
> -
> int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> {
> int order = swap_entry_order(entry_order);
> @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
> return usage;
> }
>
> -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> +/*
> + * Drop the last HAS_CACHE flag of swap entries, caller have to
> + * ensure all entries belong to the same cgroup.
> + */
> +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry,
> + unsigned int nr_pages)
> {
> - struct swap_cluster_info *ci;
> unsigned long offset = swp_offset(entry);
> - unsigned char count;
> + unsigned char *map = p->swap_map + offset;
> + unsigned char *map_end = map + nr_pages;
> + struct swap_cluster_info *ci;
>
> ci = lock_cluster(p, offset);
> - count = p->swap_map[offset];
> - VM_BUG_ON(count != SWAP_HAS_CACHE);
> - p->swap_map[offset] = 0;
> - dec_cluster_info_page(p, ci);
> + do {
> + VM_BUG_ON(*map != SWAP_HAS_CACHE);
> + *map = 0;
> + } while (++map < map_end);
> + dec_cluster_info_page(p, ci, nr_pages);
> unlock_cluster(ci);
>
> - mem_cgroup_uncharge_swap(entry, 1);
> - swap_range_free(p, offset, 1);
> + mem_cgroup_uncharge_swap(entry, nr_pages);
> + swap_range_free(p, offset, nr_pages);
> }
>
> static void cluster_swap_free_nr(struct swap_info_struct *sis,
> @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
> void put_swap_folio(struct folio *folio, swp_entry_t entry)
> {
> unsigned long offset = swp_offset(entry);
> - unsigned long idx = offset / SWAPFILE_CLUSTER;
> struct swap_cluster_info *ci;
> struct swap_info_struct *si;
> unsigned char *map;
> @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
> return;
>
> ci = lock_cluster_or_swap_info(si, offset);
> - if (size == SWAPFILE_CLUSTER) {
> + if (size > 1) {
> map = si->swap_map + offset;
> - for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> + for (i = 0; i < size; i++) {
> val = map[i];
> VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> if (val == SWAP_HAS_CACHE)
> free_entries++;
> }
> - if (free_entries == SWAPFILE_CLUSTER) {
> + if (free_entries == size) {
> unlock_cluster_or_swap_info(si, ci);
> spin_lock(&si->lock);
> - mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> - swap_free_cluster(si, idx);
> + swap_entry_range_free(si, entry, size);
> spin_unlock(&si->lock);
> return;
> }
> @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
> for (i = 0; i < n; ++i) {
> p = swap_info_get_cont(entries[i], prev);
> if (p)
> - swap_entry_free(p, entries[i]);
> + swap_entry_range_free(p, entries[i], 1);
> prev = p;
> }
> if (p)
>
> --
> 2.46.0.rc1.232.g9752f9e123-goog
>
Thanks
Barry
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP
2024-08-03 9:11 ` Barry Song
@ 2024-08-03 10:57 ` Barry Song
0 siblings, 0 replies; 32+ messages in thread
From: Barry Song @ 2024-08-03 10:57 UTC (permalink / raw)
To: chrisl
Cc: akpm, hughd, kaleshsingh, kasong, linux-kernel, linux-mm,
ryan.roberts, ying.huang, Barry Song
On Sat, Aug 3, 2024 at 9:11 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Jul 31, 2024 at 6:49 PM <chrisl@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently when we are freeing mTHP folios from swap cache, we free
> > then one by one and put each entry into swap slot cache. Slot
> > cache is designed to reduce the overhead by batching the freeing,
> > but mTHP swap entries are already continuous so they can be batch
> > freed without it already, it saves litle overhead, or even increase
> > overhead for larger mTHP.
> >
> > What's more, mTHP entries could stay in swap cache for a while.
> > Contiguous swap entry is an rather rare resource so releasing them
> > directly can help improve mTHP allocation success rate when under
> > pressure.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Acked-by: Barry Song <baohua@kernel.org>
>
> I believe this is the right direction to take. Currently, entries are released
> one by one, even when they are contiguous in the swap file(those nr_pages
> entries are definitely in the same cluster and same si), leading to numerous
> lock and unlock operations.
> This approach provides batched support.
>
> free_swap_and_cache_nr() has the same issue, so I drafted a patch based on
> your code. I wonder if you can also help test and review before I send it
> officially:
>
> From 4bed5c08bc0f7769ee2849812acdad70c4e32ead Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Sat, 3 Aug 2024 20:21:14 +1200
> Subject: [PATCH RFC] mm: attempt to batch free swap entries for zap_pte_range()
>
> Zhiguo reported that swap release could be a serious bottleneck
> during process exits[1]. With mTHP, we have the opportunity to
> batch free swaps.
> Thanks to the work of Chris and Kairui[2], I was able to achieve
> this optimization with minimal code changes by building on their
> efforts.
>
> [1] https://lore.kernel.org/linux-mm/20240731133318.527-1-justinjiang@vivo.com/
> [2] https://lore.kernel.org/linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/swapfile.c | 43 +++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 43 insertions(+)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index ea023fc25d08..9def6dba8d26 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -156,6 +156,25 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
> return true;
> }
>
> +static bool swap_is_last_map(struct swap_info_struct *si,
> + unsigned long offset, int nr_pages,
> + bool *any_only_cache)
> +{
> + unsigned char *map = si->swap_map + offset;
> + unsigned char *map_end = map + nr_pages;
> + bool cached = false;
> +
> + do {
> + if ((*map & ~SWAP_HAS_CACHE) != 1)
> + return false;
> + if (*map & SWAP_HAS_CACHE)
> + cached = true;
> + } while (++map < map_end);
> +
> + *any_only_cache = cached;
> + return true;
> +}
> +
> /*
> * returns number of pages in the folio that backs the swap entry. If positive,
> * the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
> @@ -1808,6 +1827,29 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
> if (WARN_ON(end_offset > si->max))
> goto out;
>
> + if (nr > 1) {
> + struct swap_cluster_info *ci;
> + bool batched_free;
> + int i;
> +
> + ci = lock_cluster_or_swap_info(si, start_offset);
> + if ((batched_free = swap_is_last_map(si, start_offset, nr, &any_only_cache))) {
> + for (i = 0; i < nr; i++)
> + WRITE_ONCE(si->swap_map[start_offset + i], SWAP_HAS_CACHE);
> + }
> + unlock_cluster_or_swap_info(si, ci);
> +
> + if (batched_free) {
> + spin_lock(&si->lock);
> + pr_err("%s offset:%lx nr:%lx\n", __func__,start_offset, nr);
> + swap_entry_range_free(si, entry, nr);
> + spin_unlock(&si->lock);
> + if (any_only_cache)
> + goto reclaim;
> + goto out;
> + }
Sorry, what I actually meant was that the two gotos are reversed.
if (batched_free) {
if (any_only_cache)
goto reclaim;
spin_lock(&si->lock);
swap_entry_range_free(si, entry, nr);
spin_unlock(&si->lock);
goto out;
}
> + }
> +
> /*
> * First free all entries in the range.
> */
> @@ -1828,6 +1870,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
> if (!any_only_cache)
> goto out;
>
> +reclaim:
> /*
> * Now go back over the range trying to reclaim the swap cache. This is
> * more efficient for large folios because we will only try to reclaim
> --
> 2.34.1
>
>
>
> > ---
> > mm/swapfile.c | 59 ++++++++++++++++++++++++++---------------------------------
> > 1 file changed, 26 insertions(+), 33 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 34e6ea13e8e4..9b63b2262cc2 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -479,20 +479,21 @@ static void inc_cluster_info_page(struct swap_info_struct *p,
> > }
> >
> > /*
> > - * The cluster ci decreases one usage. If the usage counter becomes 0,
> > + * The cluster ci decreases @nr_pages usage. If the usage counter becomes 0,
> > * which means no page in the cluster is in use, we can optionally discard
> > * the cluster and add it to free cluster list.
> > */
> > -static void dec_cluster_info_page(struct swap_info_struct *p, struct swap_cluster_info *ci)
> > +static void dec_cluster_info_page(struct swap_info_struct *p,
> > + struct swap_cluster_info *ci, int nr_pages)
> > {
> > if (!p->cluster_info)
> > return;
> >
> > - VM_BUG_ON(ci->count == 0);
> > + VM_BUG_ON(ci->count < nr_pages);
> > VM_BUG_ON(cluster_is_free(ci));
> > lockdep_assert_held(&p->lock);
> > lockdep_assert_held(&ci->lock);
> > - ci->count--;
> > + ci->count -= nr_pages;
> >
> > if (!ci->count) {
> > free_cluster(p, ci);
> > @@ -998,19 +999,6 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> > return n_ret;
> > }
> >
> > -static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
> > -{
> > - unsigned long offset = idx * SWAPFILE_CLUSTER;
> > - struct swap_cluster_info *ci;
> > -
> > - ci = lock_cluster(si, offset);
> > - memset(si->swap_map + offset, 0, SWAPFILE_CLUSTER);
> > - ci->count = 0;
> > - free_cluster(si, ci);
> > - unlock_cluster(ci);
> > - swap_range_free(si, offset, SWAPFILE_CLUSTER);
> > -}
> > -
> > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
> > {
> > int order = swap_entry_order(entry_order);
> > @@ -1269,21 +1257,28 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
> > return usage;
> > }
> >
> > -static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
> > +/*
> > + * Drop the last HAS_CACHE flag of swap entries, caller have to
> > + * ensure all entries belong to the same cgroup.
> > + */
> > +static void swap_entry_range_free(struct swap_info_struct *p, swp_entry_t entry,
> > + unsigned int nr_pages)
> > {
> > - struct swap_cluster_info *ci;
> > unsigned long offset = swp_offset(entry);
> > - unsigned char count;
> > + unsigned char *map = p->swap_map + offset;
> > + unsigned char *map_end = map + nr_pages;
> > + struct swap_cluster_info *ci;
> >
> > ci = lock_cluster(p, offset);
> > - count = p->swap_map[offset];
> > - VM_BUG_ON(count != SWAP_HAS_CACHE);
> > - p->swap_map[offset] = 0;
> > - dec_cluster_info_page(p, ci);
> > + do {
> > + VM_BUG_ON(*map != SWAP_HAS_CACHE);
> > + *map = 0;
> > + } while (++map < map_end);
> > + dec_cluster_info_page(p, ci, nr_pages);
> > unlock_cluster(ci);
> >
> > - mem_cgroup_uncharge_swap(entry, 1);
> > - swap_range_free(p, offset, 1);
> > + mem_cgroup_uncharge_swap(entry, nr_pages);
> > + swap_range_free(p, offset, nr_pages);
> > }
> >
> > static void cluster_swap_free_nr(struct swap_info_struct *sis,
> > @@ -1343,7 +1338,6 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
> > void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > {
> > unsigned long offset = swp_offset(entry);
> > - unsigned long idx = offset / SWAPFILE_CLUSTER;
> > struct swap_cluster_info *ci;
> > struct swap_info_struct *si;
> > unsigned char *map;
> > @@ -1356,19 +1350,18 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
> > return;
> >
> > ci = lock_cluster_or_swap_info(si, offset);
> > - if (size == SWAPFILE_CLUSTER) {
> > + if (size > 1) {
> > map = si->swap_map + offset;
> > - for (i = 0; i < SWAPFILE_CLUSTER; i++) {
> > + for (i = 0; i < size; i++) {
> > val = map[i];
> > VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> > if (val == SWAP_HAS_CACHE)
> > free_entries++;
> > }
> > - if (free_entries == SWAPFILE_CLUSTER) {
> > + if (free_entries == size) {
> > unlock_cluster_or_swap_info(si, ci);
> > spin_lock(&si->lock);
> > - mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
> > - swap_free_cluster(si, idx);
> > + swap_entry_range_free(si, entry, size);
> > spin_unlock(&si->lock);
> > return;
> > }
> > @@ -1413,7 +1406,7 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
> > for (i = 0; i < n; ++i) {
> > p = swap_info_get_cont(entries[i], prev);
> > if (p)
> > - swap_entry_free(p, entries[i]);
> > + swap_entry_range_free(p, entries[i], 1);
> > prev = p;
> > }
> > if (p)
> >
> > --
> > 2.46.0.rc1.232.g9752f9e123-goog
> >
>
> Thanks
> Barry
>
^ permalink raw reply [flat|nested] 32+ messages in thread
* [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (4 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 5/9] mm: swap: skip slot cache on freeing for mTHP chrisl
@ 2024-07-31 6:49 ` chrisl
2024-08-03 10:38 ` Barry Song
2024-07-31 6:49 ` [PATCH v5 7/9] mm: swap: add a fragment cluster list chrisl
` (5 subsequent siblings)
11 siblings, 1 reply; 32+ messages in thread
From: chrisl @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
From: Kairui Song <kasong@tencent.com>
Currently we free the reclaimed slots through slot cache even
if the slot is required to be empty immediately. As a result
the reclaim caller will see the slot still occupied even after a
successful reclaim, and need to keep reclaiming until slot cache
get flushed. This caused ineffective or over reclaim when SWAP is
under stress.
So introduce a new flag allowing the slot to be emptied bypassing
the slot cache.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
1 file changed, 109 insertions(+), 43 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9b63b2262cc2..4c0fc0409d3c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,8 +53,15 @@
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
static void free_swap_count_continuations(struct swap_info_struct *);
+static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
+ unsigned int nr_pages);
static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
unsigned int nr_entries);
+static bool folio_swapcache_freeable(struct folio *folio);
+static struct swap_cluster_info *lock_cluster_or_swap_info(
+ struct swap_info_struct *si, unsigned long offset);
+static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
+ struct swap_cluster_info *ci);
static DEFINE_SPINLOCK(swap_lock);
static unsigned int nr_swapfiles;
@@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
* corresponding page
*/
#define TTRS_UNMAPPED 0x2
-/* Reclaim the swap entry if swap is getting full*/
+/* Reclaim the swap entry if swap is getting full */
#define TTRS_FULL 0x4
+/* Reclaim directly, bypass the slot cache and don't touch device lock */
+#define TTRS_DIRECT 0x8
+
+static bool swap_is_has_cache(struct swap_info_struct *si,
+ unsigned long offset, int nr_pages)
+{
+ unsigned char *map = si->swap_map + offset;
+ unsigned char *map_end = map + nr_pages;
+
+ do {
+ VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
+ if (*map != SWAP_HAS_CACHE)
+ return false;
+ } while (++map < map_end);
+
+ return true;
+}
/*
* returns number of pages in the folio that backs the swap entry. If positive,
@@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
unsigned long offset, unsigned long flags)
{
swp_entry_t entry = swp_entry(si->type, offset);
+ struct address_space *address_space = swap_address_space(entry);
+ struct swap_cluster_info *ci;
struct folio *folio;
- int ret = 0;
+ int ret, nr_pages;
+ bool need_reclaim;
- folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
+ folio = filemap_get_folio(address_space, swap_cache_index(entry));
if (IS_ERR(folio))
return 0;
+
+ /* offset could point to the middle of a large folio */
+ entry = folio->swap;
+ offset = swp_offset(entry);
+ nr_pages = folio_nr_pages(folio);
+ ret = -nr_pages;
+
/*
* When this function is called from scan_swap_map_slots() and it's
* called by vmscan.c at reclaiming folios. So we hold a folio lock
@@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
* case and you should use folio_free_swap() with explicit folio_lock()
* in usual operations.
*/
- if (folio_trylock(folio)) {
- if ((flags & TTRS_ANYWAY) ||
- ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
- ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
- ret = folio_free_swap(folio);
- folio_unlock(folio);
+ if (!folio_trylock(folio))
+ goto out;
+
+ need_reclaim = ((flags & TTRS_ANYWAY) ||
+ ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
+ ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
+ if (!need_reclaim || !folio_swapcache_freeable(folio))
+ goto out_unlock;
+
+ /*
+ * It's safe to delete the folio from swap cache only if the folio's
+ * swap_map is HAS_CACHE only, which means the slots have no page table
+ * reference or pending writeback, and can't be allocated to others.
+ */
+ ci = lock_cluster_or_swap_info(si, offset);
+ need_reclaim = swap_is_has_cache(si, offset, nr_pages);
+ unlock_cluster_or_swap_info(si, ci);
+ if (!need_reclaim)
+ goto out_unlock;
+
+ if (!(flags & TTRS_DIRECT)) {
+ /* Free through slot cache */
+ delete_from_swap_cache(folio);
+ folio_set_dirty(folio);
+ ret = nr_pages;
+ goto out_unlock;
}
- ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
+
+ xa_lock_irq(&address_space->i_pages);
+ __delete_from_swap_cache(folio, entry, NULL);
+ xa_unlock_irq(&address_space->i_pages);
+ folio_ref_sub(folio, nr_pages);
+ folio_set_dirty(folio);
+
+ spin_lock(&si->lock);
+ /* Only sinple page folio can be backed by zswap */
+ if (!nr_pages)
+ zswap_invalidate(entry);
+ swap_entry_range_free(si, entry, nr_pages);
+ spin_unlock(&si->lock);
+ ret = nr_pages;
+out_unlock:
+ folio_unlock(folio);
+out:
folio_put(folio);
return ret;
}
@@ -903,7 +973,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
int swap_was_freed;
spin_unlock(&si->lock);
- swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+ swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
spin_lock(&si->lock);
/* entry was freed successfully, try to use this again */
if (swap_was_freed > 0)
@@ -1340,9 +1410,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
unsigned long offset = swp_offset(entry);
struct swap_cluster_info *ci;
struct swap_info_struct *si;
- unsigned char *map;
- unsigned int i, free_entries = 0;
- unsigned char val;
int size = 1 << swap_entry_order(folio_order(folio));
si = _swap_info_get(entry);
@@ -1350,23 +1417,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
return;
ci = lock_cluster_or_swap_info(si, offset);
- if (size > 1) {
- map = si->swap_map + offset;
- for (i = 0; i < size; i++) {
- val = map[i];
- VM_BUG_ON(!(val & SWAP_HAS_CACHE));
- if (val == SWAP_HAS_CACHE)
- free_entries++;
- }
- if (free_entries == size) {
- unlock_cluster_or_swap_info(si, ci);
- spin_lock(&si->lock);
- swap_entry_range_free(si, entry, size);
- spin_unlock(&si->lock);
- return;
- }
+ if (size > 1 && swap_is_has_cache(si, offset, size)) {
+ unlock_cluster_or_swap_info(si, ci);
+ spin_lock(&si->lock);
+ swap_entry_range_free(si, entry, size);
+ spin_unlock(&si->lock);
+ return;
}
- for (i = 0; i < size; i++, entry.val++) {
+ for (int i = 0; i < size; i++, entry.val++) {
if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
unlock_cluster_or_swap_info(si, ci);
free_swap_slot(entry);
@@ -1526,16 +1584,7 @@ static bool folio_swapped(struct folio *folio)
return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
}
-/**
- * folio_free_swap() - Free the swap space used for this folio.
- * @folio: The folio to remove.
- *
- * If swap is getting full, or if there are no more mappings of this folio,
- * then call folio_free_swap to free its swap space.
- *
- * Return: true if we were able to release the swap space.
- */
-bool folio_free_swap(struct folio *folio)
+static bool folio_swapcache_freeable(struct folio *folio)
{
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
@@ -1543,8 +1592,6 @@ bool folio_free_swap(struct folio *folio)
return false;
if (folio_test_writeback(folio))
return false;
- if (folio_swapped(folio))
- return false;
/*
* Once hibernation has begun to create its image of memory,
@@ -1564,6 +1611,25 @@ bool folio_free_swap(struct folio *folio)
if (pm_suspended_storage())
return false;
+ return true;
+}
+
+/**
+ * folio_free_swap() - Free the swap space used for this folio.
+ * @folio: The folio to remove.
+ *
+ * If swap is getting full, or if there are no more mappings of this folio,
+ * then call folio_free_swap to free its swap space.
+ *
+ * Return: true if we were able to release the swap space.
+ */
+bool folio_free_swap(struct folio *folio)
+{
+ if (!folio_swapcache_freeable(folio))
+ return false;
+ if (folio_swapped(folio))
+ return false;
+
delete_from_swap_cache(folio);
folio_set_dirty(folio);
return true;
@@ -1640,7 +1706,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
* to the next boundary.
*/
nr = __try_to_reclaim_swap(si, offset,
- TTRS_UNMAPPED | TTRS_FULL);
+ TTRS_UNMAPPED | TTRS_FULL);
if (nr == 0)
nr = 1;
else if (nr < 0)
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
2024-07-31 6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
@ 2024-08-03 10:38 ` Barry Song
2024-08-03 12:18 ` Kairui Song
0 siblings, 1 reply; 32+ messages in thread
From: Barry Song @ 2024-08-03 10:38 UTC (permalink / raw)
To: chrisl
Cc: Andrew Morton, Kairui Song, Hugh Dickins, Ryan Roberts,
Huang, Ying, Kalesh Singh, linux-kernel, linux-mm
On Wed, Jul 31, 2024 at 2:49 PM <chrisl@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Currently we free the reclaimed slots through slot cache even
> if the slot is required to be empty immediately. As a result
> the reclaim caller will see the slot still occupied even after a
> successful reclaim, and need to keep reclaiming until slot cache
> get flushed. This caused ineffective or over reclaim when SWAP is
> under stress.
>
> So introduce a new flag allowing the slot to be emptied bypassing
> the slot cache.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 109 insertions(+), 43 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 9b63b2262cc2..4c0fc0409d3c 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -53,8 +53,15 @@
> static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
> unsigned char);
> static void free_swap_count_continuations(struct swap_info_struct *);
> +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
> + unsigned int nr_pages);
> static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> unsigned int nr_entries);
> +static bool folio_swapcache_freeable(struct folio *folio);
> +static struct swap_cluster_info *lock_cluster_or_swap_info(
> + struct swap_info_struct *si, unsigned long offset);
> +static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> + struct swap_cluster_info *ci);
>
> static DEFINE_SPINLOCK(swap_lock);
> static unsigned int nr_swapfiles;
> @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
> * corresponding page
> */
> #define TTRS_UNMAPPED 0x2
> -/* Reclaim the swap entry if swap is getting full*/
> +/* Reclaim the swap entry if swap is getting full */
> #define TTRS_FULL 0x4
> +/* Reclaim directly, bypass the slot cache and don't touch device lock */
> +#define TTRS_DIRECT 0x8
> +
> +static bool swap_is_has_cache(struct swap_info_struct *si,
> + unsigned long offset, int nr_pages)
> +{
> + unsigned char *map = si->swap_map + offset;
> + unsigned char *map_end = map + nr_pages;
> +
> + do {
> + VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> + if (*map != SWAP_HAS_CACHE)
> + return false;
> + } while (++map < map_end);
> +
> + return true;
> +}
>
> /*
> * returns number of pages in the folio that backs the swap entry. If positive,
> @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> unsigned long offset, unsigned long flags)
> {
> swp_entry_t entry = swp_entry(si->type, offset);
> + struct address_space *address_space = swap_address_space(entry);
> + struct swap_cluster_info *ci;
> struct folio *folio;
> - int ret = 0;
> + int ret, nr_pages;
> + bool need_reclaim;
>
> - folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> + folio = filemap_get_folio(address_space, swap_cache_index(entry));
> if (IS_ERR(folio))
> return 0;
> +
> + /* offset could point to the middle of a large folio */
> + entry = folio->swap;
> + offset = swp_offset(entry);
> + nr_pages = folio_nr_pages(folio);
> + ret = -nr_pages;
> +
> /*
> * When this function is called from scan_swap_map_slots() and it's
> * called by vmscan.c at reclaiming folios. So we hold a folio lock
> @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> * case and you should use folio_free_swap() with explicit folio_lock()
> * in usual operations.
> */
> - if (folio_trylock(folio)) {
> - if ((flags & TTRS_ANYWAY) ||
> - ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> - ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
> - ret = folio_free_swap(folio);
> - folio_unlock(folio);
> + if (!folio_trylock(folio))
> + goto out;
> +
> + need_reclaim = ((flags & TTRS_ANYWAY) ||
> + ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> + ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> + if (!need_reclaim || !folio_swapcache_freeable(folio))
> + goto out_unlock;
> +
> + /*
> + * It's safe to delete the folio from swap cache only if the folio's
> + * swap_map is HAS_CACHE only, which means the slots have no page table
> + * reference or pending writeback, and can't be allocated to others.
> + */
> + ci = lock_cluster_or_swap_info(si, offset);
> + need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> + unlock_cluster_or_swap_info(si, ci);
> + if (!need_reclaim)
> + goto out_unlock;
> +
> + if (!(flags & TTRS_DIRECT)) {
> + /* Free through slot cache */
> + delete_from_swap_cache(folio);
> + folio_set_dirty(folio);
> + ret = nr_pages;
> + goto out_unlock;
> }
> - ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> +
> + xa_lock_irq(&address_space->i_pages);
> + __delete_from_swap_cache(folio, entry, NULL);
> + xa_unlock_irq(&address_space->i_pages);
> + folio_ref_sub(folio, nr_pages);
> + folio_set_dirty(folio);
> +
> + spin_lock(&si->lock);
> + /* Only sinple page folio can be backed by zswap */
> + if (!nr_pages)
> + zswap_invalidate(entry);
I am trying to figure out if I am mad :-) Does nr_pages == 0 means single
page folio?
> + swap_entry_range_free(si, entry, nr_pages);
> + spin_unlock(&si->lock);
> + ret = nr_pages;
> +out_unlock:
> + folio_unlock(folio);
> +out:
> folio_put(folio);
> return ret;
> }
> @@ -903,7 +973,7 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
> if (vm_swap_full() && si->swap_map[offset] == SWAP_HAS_CACHE) {
> int swap_was_freed;
> spin_unlock(&si->lock);
> - swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> + swap_was_freed = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
> spin_lock(&si->lock);
> /* entry was freed successfully, try to use this again */
> if (swap_was_freed > 0)
> @@ -1340,9 +1410,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
> unsigned long offset = swp_offset(entry);
> struct swap_cluster_info *ci;
> struct swap_info_struct *si;
> - unsigned char *map;
> - unsigned int i, free_entries = 0;
> - unsigned char val;
> int size = 1 << swap_entry_order(folio_order(folio));
>
> si = _swap_info_get(entry);
> @@ -1350,23 +1417,14 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
> return;
>
> ci = lock_cluster_or_swap_info(si, offset);
> - if (size > 1) {
> - map = si->swap_map + offset;
> - for (i = 0; i < size; i++) {
> - val = map[i];
> - VM_BUG_ON(!(val & SWAP_HAS_CACHE));
> - if (val == SWAP_HAS_CACHE)
> - free_entries++;
> - }
> - if (free_entries == size) {
> - unlock_cluster_or_swap_info(si, ci);
> - spin_lock(&si->lock);
> - swap_entry_range_free(si, entry, size);
> - spin_unlock(&si->lock);
> - return;
> - }
> + if (size > 1 && swap_is_has_cache(si, offset, size)) {
> + unlock_cluster_or_swap_info(si, ci);
> + spin_lock(&si->lock);
> + swap_entry_range_free(si, entry, size);
> + spin_unlock(&si->lock);
> + return;
> }
> - for (i = 0; i < size; i++, entry.val++) {
> + for (int i = 0; i < size; i++, entry.val++) {
> if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
> unlock_cluster_or_swap_info(si, ci);
> free_swap_slot(entry);
> @@ -1526,16 +1584,7 @@ static bool folio_swapped(struct folio *folio)
> return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
> }
>
> -/**
> - * folio_free_swap() - Free the swap space used for this folio.
> - * @folio: The folio to remove.
> - *
> - * If swap is getting full, or if there are no more mappings of this folio,
> - * then call folio_free_swap to free its swap space.
> - *
> - * Return: true if we were able to release the swap space.
> - */
> -bool folio_free_swap(struct folio *folio)
> +static bool folio_swapcache_freeable(struct folio *folio)
> {
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>
> @@ -1543,8 +1592,6 @@ bool folio_free_swap(struct folio *folio)
> return false;
> if (folio_test_writeback(folio))
> return false;
> - if (folio_swapped(folio))
> - return false;
>
> /*
> * Once hibernation has begun to create its image of memory,
> @@ -1564,6 +1611,25 @@ bool folio_free_swap(struct folio *folio)
> if (pm_suspended_storage())
> return false;
>
> + return true;
> +}
> +
> +/**
> + * folio_free_swap() - Free the swap space used for this folio.
> + * @folio: The folio to remove.
> + *
> + * If swap is getting full, or if there are no more mappings of this folio,
> + * then call folio_free_swap to free its swap space.
> + *
> + * Return: true if we were able to release the swap space.
> + */
> +bool folio_free_swap(struct folio *folio)
> +{
> + if (!folio_swapcache_freeable(folio))
> + return false;
> + if (folio_swapped(folio))
> + return false;
> +
> delete_from_swap_cache(folio);
> folio_set_dirty(folio);
> return true;
> @@ -1640,7 +1706,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
> * to the next boundary.
> */
> nr = __try_to_reclaim_swap(si, offset,
> - TTRS_UNMAPPED | TTRS_FULL);
> + TTRS_UNMAPPED | TTRS_FULL);
> if (nr == 0)
> nr = 1;
> else if (nr < 0)
>
> --
> 2.46.0.rc1.232.g9752f9e123-goog
>
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
2024-08-03 10:38 ` Barry Song
@ 2024-08-03 12:18 ` Kairui Song
2024-08-04 18:06 ` Chris Li
0 siblings, 1 reply; 32+ messages in thread
From: Kairui Song @ 2024-08-03 12:18 UTC (permalink / raw)
To: Barry Song
Cc: chrisl, Andrew Morton, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm
On Sat, Aug 3, 2024 at 6:39 PM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Jul 31, 2024 at 2:49 PM <chrisl@kernel.org> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Currently we free the reclaimed slots through slot cache even
> > if the slot is required to be empty immediately. As a result
> > the reclaim caller will see the slot still occupied even after a
> > successful reclaim, and need to keep reclaiming until slot cache
> > get flushed. This caused ineffective or over reclaim when SWAP is
> > under stress.
> >
> > So introduce a new flag allowing the slot to be emptied bypassing
> > the slot cache.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
> > 1 file changed, 109 insertions(+), 43 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 9b63b2262cc2..4c0fc0409d3c 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -53,8 +53,15 @@
> > static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
> > unsigned char);
> > static void free_swap_count_continuations(struct swap_info_struct *);
> > +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
> > + unsigned int nr_pages);
> > static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> > unsigned int nr_entries);
> > +static bool folio_swapcache_freeable(struct folio *folio);
> > +static struct swap_cluster_info *lock_cluster_or_swap_info(
> > + struct swap_info_struct *si, unsigned long offset);
> > +static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > + struct swap_cluster_info *ci);
> >
> > static DEFINE_SPINLOCK(swap_lock);
> > static unsigned int nr_swapfiles;
> > @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
> > * corresponding page
> > */
> > #define TTRS_UNMAPPED 0x2
> > -/* Reclaim the swap entry if swap is getting full*/
> > +/* Reclaim the swap entry if swap is getting full */
> > #define TTRS_FULL 0x4
> > +/* Reclaim directly, bypass the slot cache and don't touch device lock */
> > +#define TTRS_DIRECT 0x8
> > +
> > +static bool swap_is_has_cache(struct swap_info_struct *si,
> > + unsigned long offset, int nr_pages)
> > +{
> > + unsigned char *map = si->swap_map + offset;
> > + unsigned char *map_end = map + nr_pages;
> > +
> > + do {
> > + VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> > + if (*map != SWAP_HAS_CACHE)
> > + return false;
> > + } while (++map < map_end);
> > +
> > + return true;
> > +}
> >
> > /*
> > * returns number of pages in the folio that backs the swap entry. If positive,
> > @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > unsigned long offset, unsigned long flags)
> > {
> > swp_entry_t entry = swp_entry(si->type, offset);
> > + struct address_space *address_space = swap_address_space(entry);
> > + struct swap_cluster_info *ci;
> > struct folio *folio;
> > - int ret = 0;
> > + int ret, nr_pages;
> > + bool need_reclaim;
> >
> > - folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > + folio = filemap_get_folio(address_space, swap_cache_index(entry));
> > if (IS_ERR(folio))
> > return 0;
> > +
> > + /* offset could point to the middle of a large folio */
> > + entry = folio->swap;
> > + offset = swp_offset(entry);
> > + nr_pages = folio_nr_pages(folio);
> > + ret = -nr_pages;
> > +
> > /*
> > * When this function is called from scan_swap_map_slots() and it's
> > * called by vmscan.c at reclaiming folios. So we hold a folio lock
> > @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > * case and you should use folio_free_swap() with explicit folio_lock()
> > * in usual operations.
> > */
> > - if (folio_trylock(folio)) {
> > - if ((flags & TTRS_ANYWAY) ||
> > - ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > - ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
> > - ret = folio_free_swap(folio);
> > - folio_unlock(folio);
> > + if (!folio_trylock(folio))
> > + goto out;
> > +
> > + need_reclaim = ((flags & TTRS_ANYWAY) ||
> > + ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > + ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> > + if (!need_reclaim || !folio_swapcache_freeable(folio))
> > + goto out_unlock;
> > +
> > + /*
> > + * It's safe to delete the folio from swap cache only if the folio's
> > + * swap_map is HAS_CACHE only, which means the slots have no page table
> > + * reference or pending writeback, and can't be allocated to others.
> > + */
> > + ci = lock_cluster_or_swap_info(si, offset);
> > + need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> > + unlock_cluster_or_swap_info(si, ci);
> > + if (!need_reclaim)
> > + goto out_unlock;
> > +
> > + if (!(flags & TTRS_DIRECT)) {
> > + /* Free through slot cache */
> > + delete_from_swap_cache(folio);
> > + folio_set_dirty(folio);
> > + ret = nr_pages;
> > + goto out_unlock;
> > }
> > - ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> > +
> > + xa_lock_irq(&address_space->i_pages);
> > + __delete_from_swap_cache(folio, entry, NULL);
> > + xa_unlock_irq(&address_space->i_pages);
> > + folio_ref_sub(folio, nr_pages);
> > + folio_set_dirty(folio);
> > +
> > + spin_lock(&si->lock);
> > + /* Only sinple page folio can be backed by zswap */
> > + if (!nr_pages)
> > + zswap_invalidate(entry);
>
> I am trying to figure out if I am mad :-) Does nr_pages == 0 means single
> page folio?
>
Hi Barry
I'm sorry, this should be nr_pages == 1, I messed up order and nr, as
zswap only works for single page folios.
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
2024-08-03 12:18 ` Kairui Song
@ 2024-08-04 18:06 ` Chris Li
2024-08-05 1:53 ` Barry Song
0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-08-04 18:06 UTC (permalink / raw)
To: Kairui Song
Cc: Barry Song, Andrew Morton, Hugh Dickins, Ryan Roberts,
Huang, Ying, Kalesh Singh, linux-kernel, linux-mm
On Sat, Aug 3, 2024 at 5:19 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Aug 3, 2024 at 6:39 PM Barry Song <baohua@kernel.org> wrote:
> >
> > On Wed, Jul 31, 2024 at 2:49 PM <chrisl@kernel.org> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Currently we free the reclaimed slots through slot cache even
> > > if the slot is required to be empty immediately. As a result
> > > the reclaim caller will see the slot still occupied even after a
> > > successful reclaim, and need to keep reclaiming until slot cache
> > > get flushed. This caused ineffective or over reclaim when SWAP is
> > > under stress.
> > >
> > > So introduce a new flag allowing the slot to be emptied bypassing
> > > the slot cache.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > > mm/swapfile.c | 152 +++++++++++++++++++++++++++++++++++++++++-----------------
> > > 1 file changed, 109 insertions(+), 43 deletions(-)
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 9b63b2262cc2..4c0fc0409d3c 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -53,8 +53,15 @@
> > > static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
> > > unsigned char);
> > > static void free_swap_count_continuations(struct swap_info_struct *);
> > > +static void swap_entry_range_free(struct swap_info_struct *si, swp_entry_t entry,
> > > + unsigned int nr_pages);
> > > static void swap_range_alloc(struct swap_info_struct *si, unsigned long offset,
> > > unsigned int nr_entries);
> > > +static bool folio_swapcache_freeable(struct folio *folio);
> > > +static struct swap_cluster_info *lock_cluster_or_swap_info(
> > > + struct swap_info_struct *si, unsigned long offset);
> > > +static void unlock_cluster_or_swap_info(struct swap_info_struct *si,
> > > + struct swap_cluster_info *ci);
> > >
> > > static DEFINE_SPINLOCK(swap_lock);
> > > static unsigned int nr_swapfiles;
> > > @@ -129,8 +136,25 @@ static inline unsigned char swap_count(unsigned char ent)
> > > * corresponding page
> > > */
> > > #define TTRS_UNMAPPED 0x2
> > > -/* Reclaim the swap entry if swap is getting full*/
> > > +/* Reclaim the swap entry if swap is getting full */
> > > #define TTRS_FULL 0x4
> > > +/* Reclaim directly, bypass the slot cache and don't touch device lock */
> > > +#define TTRS_DIRECT 0x8
> > > +
> > > +static bool swap_is_has_cache(struct swap_info_struct *si,
> > > + unsigned long offset, int nr_pages)
> > > +{
> > > + unsigned char *map = si->swap_map + offset;
> > > + unsigned char *map_end = map + nr_pages;
> > > +
> > > + do {
> > > + VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
> > > + if (*map != SWAP_HAS_CACHE)
> > > + return false;
> > > + } while (++map < map_end);
> > > +
> > > + return true;
> > > +}
> > >
> > > /*
> > > * returns number of pages in the folio that backs the swap entry. If positive,
> > > @@ -141,12 +165,22 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > > unsigned long offset, unsigned long flags)
> > > {
> > > swp_entry_t entry = swp_entry(si->type, offset);
> > > + struct address_space *address_space = swap_address_space(entry);
> > > + struct swap_cluster_info *ci;
> > > struct folio *folio;
> > > - int ret = 0;
> > > + int ret, nr_pages;
> > > + bool need_reclaim;
> > >
> > > - folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
> > > + folio = filemap_get_folio(address_space, swap_cache_index(entry));
> > > if (IS_ERR(folio))
> > > return 0;
> > > +
> > > + /* offset could point to the middle of a large folio */
> > > + entry = folio->swap;
> > > + offset = swp_offset(entry);
> > > + nr_pages = folio_nr_pages(folio);
> > > + ret = -nr_pages;
> > > +
> > > /*
> > > * When this function is called from scan_swap_map_slots() and it's
> > > * called by vmscan.c at reclaiming folios. So we hold a folio lock
> > > @@ -154,14 +188,50 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
> > > * case and you should use folio_free_swap() with explicit folio_lock()
> > > * in usual operations.
> > > */
> > > - if (folio_trylock(folio)) {
> > > - if ((flags & TTRS_ANYWAY) ||
> > > - ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > > - ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)))
> > > - ret = folio_free_swap(folio);
> > > - folio_unlock(folio);
> > > + if (!folio_trylock(folio))
> > > + goto out;
> > > +
> > > + need_reclaim = ((flags & TTRS_ANYWAY) ||
> > > + ((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
> > > + ((flags & TTRS_FULL) && mem_cgroup_swap_full(folio)));
> > > + if (!need_reclaim || !folio_swapcache_freeable(folio))
> > > + goto out_unlock;
> > > +
> > > + /*
> > > + * It's safe to delete the folio from swap cache only if the folio's
> > > + * swap_map is HAS_CACHE only, which means the slots have no page table
> > > + * reference or pending writeback, and can't be allocated to others.
> > > + */
> > > + ci = lock_cluster_or_swap_info(si, offset);
> > > + need_reclaim = swap_is_has_cache(si, offset, nr_pages);
> > > + unlock_cluster_or_swap_info(si, ci);
> > > + if (!need_reclaim)
> > > + goto out_unlock;
> > > +
> > > + if (!(flags & TTRS_DIRECT)) {
> > > + /* Free through slot cache */
> > > + delete_from_swap_cache(folio);
> > > + folio_set_dirty(folio);
> > > + ret = nr_pages;
> > > + goto out_unlock;
> > > }
> > > - ret = ret ? folio_nr_pages(folio) : -folio_nr_pages(folio);
> > > +
> > > + xa_lock_irq(&address_space->i_pages);
> > > + __delete_from_swap_cache(folio, entry, NULL);
> > > + xa_unlock_irq(&address_space->i_pages);
> > > + folio_ref_sub(folio, nr_pages);
> > > + folio_set_dirty(folio);
> > > +
> > > + spin_lock(&si->lock);
> > > + /* Only sinple page folio can be backed by zswap */
> > > + if (!nr_pages)
> > > + zswap_invalidate(entry);
> >
> > I am trying to figure out if I am mad :-) Does nr_pages == 0 means single
> > page folio?
> >
>
> Hi Barry
>
> I'm sorry, this should be nr_pages == 1, I messed up order and nr, as
> zswap only works for single page folios.
Ack. Should be nr_pages == 1.
Barry, thanks for catching that.
Chris
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache
2024-08-04 18:06 ` Chris Li
@ 2024-08-05 1:53 ` Barry Song
0 siblings, 0 replies; 32+ messages in thread
From: Barry Song @ 2024-08-05 1:53 UTC (permalink / raw)
To: chrisl
Cc: akpm, baohua, hughd, kaleshsingh, linux-kernel, linux-mm,
ryan.roberts, ryncsn, ying.huang
On Mon, Aug 5, 2024 at 6:07 AM Chris Li <chrisl@kernel.org> wrote:
> > > > +
> > > > + spin_lock(&si->lock);
> > > > + /* Only sinple page folio can be backed by zswap */
> > > > + if (!nr_pages)
> > > > + zswap_invalidate(entry);
> > >
> > > I am trying to figure out if I am mad :-) Does nr_pages == 0 means single
> > > page folio?
> > >
> >
> > Hi Barry
> >
> > I'm sorry, this should be nr_pages == 1, I messed up order and nr, as
> > zswap only works for single page folios.
> Ack. Should be nr_pages == 1.
>
Yes. Andrew, can you please help squash the below fix,
small folios should have nr_pages == 1 but not nr_page == 0
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ea023fc25d08..6de12d712c7e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -224,7 +224,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
spin_lock(&si->lock);
/* Only sinple page folio can be backed by zswap */
- if (!nr_pages)
+ if (nr_pages == 1)
zswap_invalidate(entry);
swap_entry_range_free(si, entry, nr_pages);
spin_unlock(&si->lock);
> Barry, thanks for catching that.
>
> Chris
Thanks
Barry
^ permalink raw reply related [flat|nested] 32+ messages in thread
* [PATCH v5 7/9] mm: swap: add a fragment cluster list
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (5 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 6/9] mm: swap: allow cache reclaim to skip slot cache chrisl
@ 2024-07-31 6:49 ` chrisl
2024-07-31 6:49 ` [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned chrisl
` (4 subsequent siblings)
11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
From: Kairui Song <kasong@tencent.com>
Now swap cluster allocator arranges the clusters in LRU style, so the
"cold" cluster stay at the head of nonfull lists are the ones that were
used for allocation long time ago and still partially occupied. So if
allocator can't find enough contiguous slots to satisfy an high order
allocation, it's unlikely there will be slot being free on them to
satisfy the allocation, at least in a short period.
As a result, nonfull cluster scanning will waste time repeatly scanning
the unusable head of the list.
Also, multiple CPUs could content on the same head cluster of nonfull
list. Unlike free clusters which are removed from the list when any
CPU starts using it, nonfull cluster stays on the head.
So introduce a new list frag list, all scanned nonfull clusters will be
moved to this list. Both for avoiding repeated scanning and contention.
Frag list is still used as fallback for allocations, so if one CPU
failed to allocate one order of slots, it can still steal other CPU's
clusters. And order 0 will favor the fragmented clusters to better
protect nonfull clusters
If any slots on a fragment list are being freed, move the fragment list
back to nonfull list indicating it worth another scan on the cluster.
Compared to scan upon freeing a slot, this keep the scanning lazy and
save some CPU if there are still other clusters to use.
It may seems unneccessay to keep the fragmented cluster on list at all
if they can't be used for specific order allocation. But this will
start to make sense once reclaim dring scanning is ready.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 3 +++
mm/swapfile.c | 41 +++++++++++++++++++++++++++++++++++++----
2 files changed, 40 insertions(+), 4 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6716ef236766..5a14b6c65949 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -259,6 +259,7 @@ struct swap_cluster_info {
};
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
+#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
/*
* The first page in the swap file is the swap header, which is always marked
@@ -299,6 +300,8 @@ struct swap_info_struct {
struct list_head free_clusters; /* free clusters list */
struct list_head nonfull_clusters[SWAP_NR_ORDERS];
/* list of cluster that contains at least one free slot */
+ struct list_head frag_clusters[SWAP_NR_ORDERS];
+ /* list of cluster that are fragmented or contented */
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 4c0fc0409d3c..eb3e387e86b2 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -572,7 +572,10 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
- list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+ if (ci->flags & CLUSTER_FLAG_FRAG)
+ list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+ else
+ list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
ci->flags = CLUSTER_FLAG_NONFULL;
}
}
@@ -610,7 +613,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_
ci->count += nr_pages;
if (ci->count == SWAPFILE_CLUSTER) {
- VM_BUG_ON(!(ci->flags & (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL)));
+ VM_BUG_ON(!(ci->flags &
+ (CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
list_del(&ci->list);
ci->flags = 0;
}
@@ -666,6 +670,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
struct percpu_cluster *cluster;
struct swap_cluster_info *ci, *n;
unsigned int offset, found = 0;
+ LIST_HEAD(fraged);
new_cluster:
lockdep_assert_held(&si->lock);
@@ -686,13 +691,29 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
if (order < PMD_ORDER) {
list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
+ list_move_tail(&ci->list, &fraged);
+ ci->flags = CLUSTER_FLAG_FRAG;
offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
&found, order, usage);
if (found)
- goto done;
+ break;
}
+
+ if (!found) {
+ list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) {
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+ &found, order, usage);
+ if (found)
+ break;
+ }
+ }
+
+ list_splice_tail(&fraged, &si->frag_clusters[order]);
}
+ if (found)
+ goto done;
+
if (!list_empty(&si->discard_clusters)) {
/*
* we don't have free cluster but have some clusters in
@@ -706,7 +727,17 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
if (order)
goto done;
+ /* Order 0 stealing from higher order */
for (int o = 1; o < PMD_ORDER; o++) {
+ if (!list_empty(&si->frag_clusters[o])) {
+ ci = list_first_entry(&si->frag_clusters[o],
+ struct swap_cluster_info, list);
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found,
+ 0, usage);
+ VM_BUG_ON(!found);
+ goto done;
+ }
+
if (!list_empty(&si->nonfull_clusters[o])) {
ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info,
list);
@@ -3019,8 +3050,10 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
INIT_LIST_HEAD(&p->free_clusters);
INIT_LIST_HEAD(&p->discard_clusters);
- for (i = 0; i < SWAP_NR_ORDERS; i++)
+ for (i = 0; i < SWAP_NR_ORDERS; i++) {
INIT_LIST_HEAD(&p->nonfull_clusters[i]);
+ INIT_LIST_HEAD(&p->frag_clusters[i]);
+ }
for (i = 0; i < swap_header->info.nr_badpages; i++) {
unsigned int page_nr = swap_header->info.badpages[i];
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (6 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 7/9] mm: swap: add a fragment cluster list chrisl
@ 2024-07-31 6:49 ` chrisl
2024-07-31 6:49 ` [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim chrisl
` (3 subsequent siblings)
11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
From: Kairui Song <kasong@tencent.com>
This commit implements reclaim during scan for cluster allocator.
Cluster scanning were unable to reuse SWAP_HAS_CACHE slots, which
could result in low allocation success rate or early OOM.
So to ensure maximum allocation success rate, integrate reclaiming
with scanning. If found a range of suitable swap slots but fragmented
due to HAS_CACHE, just try to reclaim the slots.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 1 +
mm/swapfile.c | 140 +++++++++++++++++++++++++++++++++++++++------------
2 files changed, 110 insertions(+), 31 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 5a14b6c65949..9eb740563d63 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -302,6 +302,7 @@ struct swap_info_struct {
/* list of cluster that contains at least one free slot */
struct list_head frag_clusters[SWAP_NR_ORDERS];
/* list of cluster that are fragmented or contented */
+ unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
unsigned int lowest_bit; /* index of first free in swap_map */
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index eb3e387e86b2..50e7f600a9a1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -513,6 +513,10 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
VM_BUG_ON(ci->count != 0);
lockdep_assert_held(&si->lock);
lockdep_assert_held(&ci->lock);
+
+ if (ci->flags & CLUSTER_FLAG_FRAG)
+ si->frag_cluster_nr[ci->order]--;
+
/*
* If the swap is discardable, prepare discard the cluster
* instead of free it immediately. The cluster will be freed
@@ -572,31 +576,84 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
- if (ci->flags & CLUSTER_FLAG_FRAG)
+ if (ci->flags & CLUSTER_FLAG_FRAG) {
+ p->frag_cluster_nr[ci->order]--;
list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
- else
+ } else {
list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
+ }
ci->flags = CLUSTER_FLAG_NONFULL;
}
}
-static inline bool cluster_scan_range(struct swap_info_struct *si, unsigned int start,
- unsigned int nr_pages)
+static bool cluster_reclaim_range(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long start, unsigned long end)
{
- unsigned char *p = si->swap_map + start;
- unsigned char *end = p + nr_pages;
+ unsigned char *map = si->swap_map;
+ unsigned long offset;
+
+ spin_unlock(&ci->lock);
+ spin_unlock(&si->lock);
+
+ for (offset = start; offset < end; offset++) {
+ switch (READ_ONCE(map[offset])) {
+ case 0:
+ continue;
+ case SWAP_HAS_CACHE:
+ if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT) > 0)
+ continue;
+ goto out;
+ default:
+ goto out;
+ }
+ }
+out:
+ spin_lock(&si->lock);
+ spin_lock(&ci->lock);
- while (p < end)
- if (*p++)
+ /*
+ * Recheck the range no matter reclaim succeeded or not, the slot
+ * could have been be freed while we are not holding the lock.
+ */
+ for (offset = start; offset < end; offset++)
+ if (READ_ONCE(map[offset]))
return false;
return true;
}
+static bool cluster_scan_range(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long start, unsigned int nr_pages)
+{
+ unsigned long offset, end = start + nr_pages;
+ unsigned char *map = si->swap_map;
+ bool need_reclaim = false;
-static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
- unsigned int start, unsigned char usage,
- unsigned int order)
+ for (offset = start; offset < end; offset++) {
+ switch (READ_ONCE(map[offset])) {
+ case 0:
+ continue;
+ case SWAP_HAS_CACHE:
+ if (!vm_swap_full())
+ return false;
+ need_reclaim = true;
+ continue;
+ default:
+ return false;
+ }
+ }
+
+ if (need_reclaim)
+ return cluster_reclaim_range(si, ci, start, end);
+
+ return true;
+}
+
+static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
+ unsigned int start, unsigned char usage,
+ unsigned int order)
{
unsigned int nr_pages = 1 << order;
@@ -615,6 +672,8 @@ static inline void cluster_alloc_range(struct swap_info_struct *si, struct swap_
if (ci->count == SWAPFILE_CLUSTER) {
VM_BUG_ON(!(ci->flags &
(CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
+ if (ci->flags & CLUSTER_FLAG_FRAG)
+ si->frag_cluster_nr[ci->order]--;
list_del(&ci->list);
ci->flags = 0;
}
@@ -640,7 +699,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
}
while (offset <= end) {
- if (cluster_scan_range(si, offset, nr_pages)) {
+ if (cluster_scan_range(si, ci, offset, nr_pages)) {
cluster_alloc_range(si, ci, offset, usage, order);
*foundp = offset;
if (ci->count == SWAPFILE_CLUSTER) {
@@ -668,9 +727,8 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
unsigned char usage)
{
struct percpu_cluster *cluster;
- struct swap_cluster_info *ci, *n;
+ struct swap_cluster_info *ci;
unsigned int offset, found = 0;
- LIST_HEAD(fraged);
new_cluster:
lockdep_assert_held(&si->lock);
@@ -690,25 +748,42 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
}
if (order < PMD_ORDER) {
- list_for_each_entry_safe(ci, n, &si->nonfull_clusters[order], list) {
- list_move_tail(&ci->list, &fraged);
+ unsigned int frags = 0;
+
+ while (!list_empty(&si->nonfull_clusters[order])) {
+ ci = list_first_entry(&si->nonfull_clusters[order],
+ struct swap_cluster_info, list);
+ list_move_tail(&ci->list, &si->frag_clusters[order]);
ci->flags = CLUSTER_FLAG_FRAG;
+ si->frag_cluster_nr[order]++;
offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
&found, order, usage);
+ frags++;
if (found)
break;
}
if (!found) {
- list_for_each_entry_safe(ci, n, &si->frag_clusters[order], list) {
+ /*
+ * Nonfull clusters are moved to frag tail if we reached
+ * here, count them too, don't over scan the frag list.
+ */
+ while (frags < si->frag_cluster_nr[order]) {
+ ci = list_first_entry(&si->frag_clusters[order],
+ struct swap_cluster_info, list);
+ /*
+ * Rotate the frag list to iterate, they were all failing
+ * high order allocation or moved here due to per-CPU usage,
+ * this help keeping usable cluster ahead.
+ */
+ list_move_tail(&ci->list, &si->frag_clusters[order]);
offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
&found, order, usage);
+ frags++;
if (found)
break;
}
}
-
- list_splice_tail(&fraged, &si->frag_clusters[order]);
}
if (found)
@@ -729,25 +804,28 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
/* Order 0 stealing from higher order */
for (int o = 1; o < PMD_ORDER; o++) {
- if (!list_empty(&si->frag_clusters[o])) {
+ /*
+ * Clusters here have at least one usable slots and can't fail order 0
+ * allocation, but reclaim may drop si->lock and race with another user.
+ */
+ while (!list_empty(&si->frag_clusters[o])) {
ci = list_first_entry(&si->frag_clusters[o],
struct swap_cluster_info, list);
- offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found,
- 0, usage);
- VM_BUG_ON(!found);
- goto done;
+ offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
+ &found, 0, usage);
+ if (found)
+ goto done;
}
- if (!list_empty(&si->nonfull_clusters[o])) {
- ci = list_first_entry(&si->nonfull_clusters[o], struct swap_cluster_info,
- list);
+ while (!list_empty(&si->nonfull_clusters[o])) {
+ ci = list_first_entry(&si->nonfull_clusters[o],
+ struct swap_cluster_info, list);
offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
&found, 0, usage);
- VM_BUG_ON(!found);
- goto done;
+ if (found)
+ goto done;
}
}
-
done:
cluster->next[order] = offset;
return found;
@@ -3053,6 +3131,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
for (i = 0; i < SWAP_NR_ORDERS; i++) {
INIT_LIST_HEAD(&p->nonfull_clusters[i]);
INIT_LIST_HEAD(&p->frag_clusters[i]);
+ p->frag_cluster_nr[i] = 0;
}
for (i = 0; i < swap_header->info.nr_badpages; i++) {
@@ -3096,7 +3175,6 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
if (!cluster_info)
return nr_extents;
-
/*
* Reduce false cache line sharing between cluster_info and
* sharing same address space.
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (7 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 8/9] mm: swap: relaim the cached parts that got scanned chrisl
@ 2024-07-31 6:49 ` chrisl
2024-08-01 9:14 ` [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order David Hildenbrand
` (2 subsequent siblings)
11 siblings, 0 replies; 32+ messages in thread
From: chrisl @ 2024-07-31 6:49 UTC (permalink / raw)
To: Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Chris Li, Barry Song
From: Kairui Song <kasong@tencent.com>
Link all full cluster with one full list, and reclaim from it when the
allocation have ran out of all usable clusters.
There are many reason a folio can end up being in the swap cache while
having no swap count reference. So the best way to search for such slots
is still by iterating the swap clusters.
With the list as an LRU, iterating from the oldest cluster and keep them
rotating is a very doable and clean way to free up potentially not inuse
clusters.
When any allocation failure, try reclaim and rotate only one cluster.
This is adaptive for high order allocations they can tolerate fallback.
So this avoids latency, and give the full cluster list an fair chance
to get reclaimed. It release the usage stress for the fallback order 0
allocation or following up high order allocation.
If the swap device is getting very full, reclaim more aggresively to
ensure no OOM will happen. This ensures order 0 heavy workload won't go
OOM as order 0 won't fail if any cluster still have any space.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 1 +
mm/swapfile.c | 68 +++++++++++++++++++++++++++++++++++++++++-----------
2 files changed, 55 insertions(+), 14 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9eb740563d63..145e796dab84 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -298,6 +298,7 @@ struct swap_info_struct {
unsigned long *zeromap; /* vmalloc'ed bitmap to track zero pages */
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct list_head free_clusters; /* free clusters list */
+ struct list_head full_clusters; /* full clusters list */
struct list_head nonfull_clusters[SWAP_NR_ORDERS];
/* list of cluster that contains at least one free slot */
struct list_head frag_clusters[SWAP_NR_ORDERS];
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 50e7f600a9a1..9872e0dbfc72 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -440,10 +440,7 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
SWAP_MAP_BAD, SWAPFILE_CLUSTER);
VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
- if (ci->flags & CLUSTER_FLAG_NONFULL)
- list_move_tail(&ci->list, &si->discard_clusters);
- else
- list_add_tail(&ci->list, &si->discard_clusters);
+ list_move_tail(&ci->list, &si->discard_clusters);
ci->flags = 0;
schedule_work(&si->discard_work);
}
@@ -453,10 +450,7 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
lockdep_assert_held(&si->lock);
lockdep_assert_held(&ci->lock);
- if (ci->flags & CLUSTER_FLAG_NONFULL)
- list_move_tail(&ci->list, &si->free_clusters);
- else
- list_add_tail(&ci->list, &si->free_clusters);
+ list_move_tail(&ci->list, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
ci->order = 0;
}
@@ -576,12 +570,9 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
if (!(ci->flags & CLUSTER_FLAG_NONFULL)) {
VM_BUG_ON(ci->flags & CLUSTER_FLAG_FREE);
- if (ci->flags & CLUSTER_FLAG_FRAG) {
+ if (ci->flags & CLUSTER_FLAG_FRAG)
p->frag_cluster_nr[ci->order]--;
- list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
- } else {
- list_add_tail(&ci->list, &p->nonfull_clusters[ci->order]);
- }
+ list_move_tail(&ci->list, &p->nonfull_clusters[ci->order]);
ci->flags = CLUSTER_FLAG_NONFULL;
}
}
@@ -674,7 +665,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
(CLUSTER_FLAG_FREE | CLUSTER_FLAG_NONFULL | CLUSTER_FLAG_FRAG)));
if (ci->flags & CLUSTER_FLAG_FRAG)
si->frag_cluster_nr[ci->order]--;
- list_del(&ci->list);
+ list_move_tail(&ci->list, &si->full_clusters);
ci->flags = 0;
}
}
@@ -718,6 +709,46 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne
return offset;
}
+static void swap_reclaim_full_clusters(struct swap_info_struct *si)
+{
+ long to_scan = 1;
+ unsigned long offset, end;
+ struct swap_cluster_info *ci;
+ unsigned char *map = si->swap_map;
+ int nr_reclaim, total_reclaimed = 0;
+
+ if (atomic_long_read(&nr_swap_pages) <= SWAPFILE_CLUSTER)
+ to_scan = si->inuse_pages / SWAPFILE_CLUSTER;
+
+ while (!list_empty(&si->full_clusters)) {
+ ci = list_first_entry(&si->full_clusters, struct swap_cluster_info, list);
+ list_move_tail(&ci->list, &si->full_clusters);
+ offset = cluster_offset(si, ci);
+ end = min(si->max, offset + SWAPFILE_CLUSTER);
+ to_scan--;
+
+ while (offset < end) {
+ if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+ spin_unlock(&si->lock);
+ nr_reclaim = __try_to_reclaim_swap(si, offset,
+ TTRS_ANYWAY | TTRS_DIRECT);
+ spin_lock(&si->lock);
+ if (nr_reclaim > 0) {
+ offset += nr_reclaim;
+ total_reclaimed += nr_reclaim;
+ continue;
+ } else if (nr_reclaim < 0) {
+ offset += -nr_reclaim;
+ continue;
+ }
+ }
+ offset++;
+ }
+ if (to_scan <= 0 || total_reclaimed)
+ break;
+ }
+}
+
/*
* Try to get swap entries with specified order from current cpu's swap entry
* pool (a cluster). This might involve allocating a new cluster for current CPU
@@ -826,7 +857,15 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
goto done;
}
}
+
done:
+ /* Try reclaim from full clusters if device is nearfull */
+ if (vm_swap_full() && (!found || (si->pages - si->inuse_pages) < SWAPFILE_CLUSTER)) {
+ swap_reclaim_full_clusters(si);
+ if (!found && !order && si->pages != si->inuse_pages)
+ goto new_cluster;
+ }
+
cluster->next[order] = offset;
return found;
}
@@ -3126,6 +3165,7 @@ static int setup_swap_map_and_extents(struct swap_info_struct *p,
nr_good_pages = maxpages - 1; /* omit header page */
INIT_LIST_HEAD(&p->free_clusters);
+ INIT_LIST_HEAD(&p->full_clusters);
INIT_LIST_HEAD(&p->discard_clusters);
for (i = 0; i < SWAP_NR_ORDERS; i++) {
--
2.46.0.rc1.232.g9752f9e123-goog
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (8 preceding siblings ...)
2024-07-31 6:49 ` [PATCH v5 9/9] mm: swap: add a adaptive full cluster cache reclaim chrisl
@ 2024-08-01 9:14 ` David Hildenbrand
2024-08-01 9:59 ` Kairui Song
[not found] ` <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
2024-09-02 1:20 ` Andrew Morton
11 siblings, 1 reply; 32+ messages in thread
From: David Hildenbrand @ 2024-08-01 9:14 UTC (permalink / raw)
To: Chris Li, Andrew Morton
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On 31.07.24 08:49, Chris Li wrote:
> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
>
Running the cow.c selftest on mm/mm-unstable, I get:
# [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
[ 51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
[ 51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
[ 51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[ 51.871298] Workqueue: events swap_discard_work
[ 51.872211] RIP: 0010:__free_cluster+0x27/0x90
[ 51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
[ 51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
[ 51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
[ 51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
[ 51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
[ 51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
[ 51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
[ 51.884827] FS: 0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
[ 51.886412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
[ 51.888931] PKRU: 55555554
[ 51.889471] Call Trace:
[ 51.889964] <TASK>
[ 51.890391] ? __die_body.cold+0x19/0x27
[ 51.891174] ? die_addr+0x3c/0x60
[ 51.891824] ? exc_general_protection+0x14f/0x430
[ 51.892754] ? asm_exc_general_protection+0x26/0x30
[ 51.893717] ? __free_cluster+0x27/0x90
[ 51.894483] ? __free_cluster+0x7e/0x90
[ 51.895245] swap_do_scheduled_discard+0x142/0x1b0
[ 51.896189] swap_discard_work+0x26/0x30
[ 51.896958] process_one_work+0x211/0x5a0
[ 51.897750] ? srso_alias_return_thunk+0x5/0xfbef5
[ 51.898693] worker_thread+0x1c9/0x3c0
[ 51.899438] ? __pfx_worker_thread+0x10/0x10
[ 51.900287] kthread+0xe3/0x110
[ 51.900913] ? __pfx_kthread+0x10/0x10
[ 51.901656] ret_from_fork+0x34/0x50
[ 51.902377] ? __pfx_kthread+0x10/0x10
[ 51.903114] ret_from_fork_asm+0x1a/0x30
[ 51.903896] </TASK>
Maybe related to this series?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
2024-08-01 9:14 ` [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order David Hildenbrand
@ 2024-08-01 9:59 ` Kairui Song
2024-08-01 10:06 ` Kairui Song
[not found] ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
0 siblings, 2 replies; 32+ messages in thread
From: Kairui Song @ 2024-08-01 9:59 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Chris Li
Cc: Hugh Dickins, Ryan Roberts, Huang, Ying, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
[-- Attachment #1: Type: text/plain, Size: 2993 bytes --]
On Thu, Aug 1, 2024 at 5:15 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.07.24 08:49, Chris Li wrote:
> > This is the short term solutions "swap cluster order" listed
> > in my "Swap Abstraction" discussion slice 8 in the recent
> > LSF/MM conference.
> >
>
> Running the cow.c selftest on mm/mm-unstable, I get:
Hi David, thanks very much for the test and report!
>
> # [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
> [ 51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
> [ 51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
> [ 51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> [ 51.871298] Workqueue: events swap_discard_work
> [ 51.872211] RIP: 0010:__free_cluster+0x27/0x90
> [ 51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
> [ 51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
> [ 51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
> [ 51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
> [ 51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
> [ 51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
> [ 51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
> [ 51.884827] FS: 0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
> [ 51.886412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
> [ 51.888931] PKRU: 55555554
> [ 51.889471] Call Trace:
> [ 51.889964] <TASK>
> [ 51.890391] ? __die_body.cold+0x19/0x27
> [ 51.891174] ? die_addr+0x3c/0x60
> [ 51.891824] ? exc_general_protection+0x14f/0x430
> [ 51.892754] ? asm_exc_general_protection+0x26/0x30
> [ 51.893717] ? __free_cluster+0x27/0x90
> [ 51.894483] ? __free_cluster+0x7e/0x90
> [ 51.895245] swap_do_scheduled_discard+0x142/0x1b0
> [ 51.896189] swap_discard_work+0x26/0x30
> [ 51.896958] process_one_work+0x211/0x5a0
> [ 51.897750] ? srso_alias_return_thunk+0x5/0xfbef5
> [ 51.898693] worker_thread+0x1c9/0x3c0
> [ 51.899438] ? __pfx_worker_thread+0x10/0x10
> [ 51.900287] kthread+0xe3/0x110
> [ 51.900913] ? __pfx_kthread+0x10/0x10
> [ 51.901656] ret_from_fork+0x34/0x50
> [ 51.902377] ? __pfx_kthread+0x10/0x10
> [ 51.903114] ret_from_fork_asm+0x1a/0x30
> [ 51.903896] </TASK>
>
>
> Maybe related to this series?
Right, I can reproduce your problem and I believe this patch can fix
it, see the attachment.
Hi Andrew, can you pick this patch too?
[-- Attachment #2: 0001-SQUASH-Fix-discard-of-full-cluster.patch --]
[-- Type: application/octet-stream, Size: 1968 bytes --]
From 31365f8d8fa5d7a30920166b8c3a60130c24412e Mon Sep 17 00:00:00 2001
From: Kairui Song <kasong@tencent.com>
Date: Thu, 1 Aug 2024 17:29:48 +0800
Subject: [PATCH] SQUASH: Fix discard of full cluster
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 1 +
mm/swapfile.c | 8 +++++---
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 145e796dab84..3543ffaf982e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -260,6 +260,7 @@ struct swap_cluster_info {
#define CLUSTER_FLAG_FREE 1 /* This cluster is free */
#define CLUSTER_FLAG_NONFULL 2 /* This cluster is on nonfull list */
#define CLUSTER_FLAG_FRAG 4 /* This cluster is on nonfull list */
+#define CLUSTER_FLAG_FULL 8 /* This cluster is on full list */
/*
* The first page in the swap file is the swap header, which is always marked
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9872e0dbfc72..5ab06a45565a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
lockdep_assert_held(&si->lock);
lockdep_assert_held(&ci->lock);
- list_move_tail(&ci->list, &si->free_clusters);
+ if (ci->flags)
+ list_move_tail(&ci->list, &si->free_clusters);
+ else
+ list_add_tail(&ci->list, &si->free_clusters);
ci->flags = CLUSTER_FLAG_FREE;
ci->order = 0;
}
@@ -474,7 +477,6 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
SWAPFILE_CLUSTER);
spin_lock(&si->lock);
-
spin_lock(&ci->lock);
__free_cluster(si, ci);
memset(si->swap_map + idx * SWAPFILE_CLUSTER,
@@ -666,7 +668,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
if (ci->flags & CLUSTER_FLAG_FRAG)
si->frag_cluster_nr[ci->order]--;
list_move_tail(&ci->list, &si->full_clusters);
- ci->flags = 0;
+ ci->flags = CLUSTER_FLAG_FULL;
}
}
--
2.45.2
^ permalink raw reply related [flat|nested] 32+ messages in thread* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
2024-08-01 9:59 ` Kairui Song
@ 2024-08-01 10:06 ` Kairui Song
[not found] ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
1 sibling, 0 replies; 32+ messages in thread
From: Kairui Song @ 2024-08-01 10:06 UTC (permalink / raw)
To: David Hildenbrand, Andrew Morton, Chris Li
Cc: Hugh Dickins, Ryan Roberts, Huang, Ying, Kalesh Singh,
linux-kernel, linux-mm, Barry Song
On Thu, Aug 1, 2024 at 5:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Aug 1, 2024 at 5:15 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 31.07.24 08:49, Chris Li wrote:
> > > This is the short term solutions "swap cluster order" listed
> > > in my "Swap Abstraction" discussion slice 8 in the recent
> > > LSF/MM conference.
> > >
> >
> > Running the cow.c selftest on mm/mm-unstable, I get:
>
> Hi David, thanks very much for the test and report!
>
> >
> > # [RUN] Basic COW after fork() with mprotect() optimization ... with swapped-out, PTE-mapped THP (1024 kB)
> > [ 51.865309] Oops: general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP NOPTI
> > [ 51.867738] CPU: 21 UID: 0 PID: 282 Comm: kworker/21:1 Not tainted 6.11.0-rc1+ #11
> > [ 51.869566] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
> > [ 51.871298] Workqueue: events swap_discard_work
> > [ 51.872211] RIP: 0010:__free_cluster+0x27/0x90
> > [ 51.873101] Code: 90 90 90 0f 1f 44 00 00 8b 0d 8d 95 96 01 55 48 89 fd 53 48 89 f3 85 c9 75 3a 48 8b 43 50 48 8b 4b 48 48 8d 53 48 48 83 c5 60 <48> 89 41 08 48 89 08 48 8b 45 08 48 89 55 08 48 89 43 50 48 89 6b
> > [ 51.876720] RSP: 0018:ffffa3dcc0aafdc8 EFLAGS: 00010286
> > [ 51.877752] RAX: dead000000000122 RBX: ffff8e7ed9686e00 RCX: dead000000000100
> > [ 51.879186] RDX: ffff8e7ed9686e48 RSI: ffff8e7ed9686e18 RDI: ffff8e7ec37831c0
> > [ 51.880577] RBP: ffff8e7ec5d10860 R08: 0000000000000001 R09: 0000000000000028
> > [ 51.881972] R10: 0000000000000200 R11: 00000000000004cb R12: ffff8e7ed9686e00
> > [ 51.883393] R13: 0000000000028200 R14: 0000000000028000 R15: 0000000000000000
> > [ 51.884827] FS: 0000000000000000(0000) GS:ffff8e822f480000(0000) knlGS:0000000000000000
> > [ 51.886412] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 51.887532] CR2: 00007f37d7e17840 CR3: 0000000335a3a001 CR4: 0000000000770ef0
> > [ 51.888931] PKRU: 55555554
> > [ 51.889471] Call Trace:
> > [ 51.889964] <TASK>
> > [ 51.890391] ? __die_body.cold+0x19/0x27
> > [ 51.891174] ? die_addr+0x3c/0x60
> > [ 51.891824] ? exc_general_protection+0x14f/0x430
> > [ 51.892754] ? asm_exc_general_protection+0x26/0x30
> > [ 51.893717] ? __free_cluster+0x27/0x90
> > [ 51.894483] ? __free_cluster+0x7e/0x90
> > [ 51.895245] swap_do_scheduled_discard+0x142/0x1b0
> > [ 51.896189] swap_discard_work+0x26/0x30
> > [ 51.896958] process_one_work+0x211/0x5a0
> > [ 51.897750] ? srso_alias_return_thunk+0x5/0xfbef5
> > [ 51.898693] worker_thread+0x1c9/0x3c0
> > [ 51.899438] ? __pfx_worker_thread+0x10/0x10
> > [ 51.900287] kthread+0xe3/0x110
> > [ 51.900913] ? __pfx_kthread+0x10/0x10
> > [ 51.901656] ret_from_fork+0x34/0x50
> > [ 51.902377] ? __pfx_kthread+0x10/0x10
> > [ 51.903114] ret_from_fork_asm+0x1a/0x30
> > [ 51.903896] </TASK>
> >
> >
> > Maybe related to this series?
>
> Right, I can reproduce your problem and I believe this patch can fix
> it, see the attachment.
>
> Hi Andrew, can you pick this patch too?
This issue is caused by my PATCH 9/9 and this attachment patch can be
squashed into it, very sorry for the problem.
^ permalink raw reply [flat|nested] 32+ messages in thread[parent not found: <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>]
* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
[not found] ` <87le17z9zr.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2024-08-16 7:36 ` Chris Li
2024-08-17 17:47 ` Kairui Song
0 siblings, 1 reply; 32+ messages in thread
From: Chris Li @ 2024-08-16 7:36 UTC (permalink / raw)
To: Huang, Ying
Cc: Kairui Song, David Hildenbrand, Andrew Morton, Hugh Dickins,
Ryan Roberts, Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Thu, Aug 8, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> [snip]
>
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> > lockdep_assert_held(&si->lock);
> > lockdep_assert_held(&ci->lock);
> >
> > - list_move_tail(&ci->list, &si->free_clusters);
> > + if (ci->flags)
> > + list_move_tail(&ci->list, &si->free_clusters);
> > + else
> > + list_add_tail(&ci->list, &si->free_clusters);
>
> If we use list_del_init() to delete the cluster, we can always use
> list_move_tail()? If so, the logic can be simplified.
Thanks for the suggestion.
I feel that list_del_init() generates more instruction than necessary.
It is my bad that I leave the discard list without not a list flag bit
for it.
I do want to clean this up. While we are at it, because the cluster
can only belong to one list at a time. We can use a list indicator as
integer rather than bits mask. If we give the discard list the same
treatment, that should remove the special condition to add a cluster
to another list as well.
Chris
> > ci->flags = CLUSTER_FLAG_FREE;
> > ci->order = 0;
> > }
> > @@ -474,7 +477,6 @@ static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > SWAPFILE_CLUSTER);
> >
> > spin_lock(&si->lock);
> > -
> > spin_lock(&ci->lock);
> > __free_cluster(si, ci);
> > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > @@ -666,7 +668,7 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> > if (ci->flags & CLUSTER_FLAG_FRAG)
> > si->frag_cluster_nr[ci->order]--;
> > list_move_tail(&ci->list, &si->full_clusters);
> > - ci->flags = 0;
> > + ci->flags = CLUSTER_FLAG_FULL;
> > }
> > }
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 32+ messages in thread* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
2024-08-16 7:36 ` Chris Li
@ 2024-08-17 17:47 ` Kairui Song
0 siblings, 0 replies; 32+ messages in thread
From: Kairui Song @ 2024-08-17 17:47 UTC (permalink / raw)
To: Chris Li
Cc: Huang, Ying, David Hildenbrand, Andrew Morton, Hugh Dickins,
Ryan Roberts, Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Fri, Aug 16, 2024 at 3:37 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Thu, Aug 8, 2024 at 1:40 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Kairui Song <ryncsn@gmail.com> writes:
> >
> > [snip]
> >
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -450,7 +450,10 @@ static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info
> > > lockdep_assert_held(&si->lock);
> > > lockdep_assert_held(&ci->lock);
> > >
> > > - list_move_tail(&ci->list, &si->free_clusters);
> > > + if (ci->flags)
> > > + list_move_tail(&ci->list, &si->free_clusters);
> > > + else
> > > + list_add_tail(&ci->list, &si->free_clusters);
> >
> > If we use list_del_init() to delete the cluster, we can always use
> > list_move_tail()? If so, the logic can be simplified.
>
> Thanks for the suggestion.
Hi All, thanks for the review and discussion.
> I feel that list_del_init() generates more instruction than necessary.
> It is my bad that I leave the discard list without not a list flag bit
> for it.
Right, list_del_init is a little bit more noisy than list_del indeed.
But considering after this patch, all non-discard clusters are always
a on list (free/nonfull/full) already, and a cluster will be dangling
only when being removed from the discard list (when doing discard
work, it need to unlock si->lock, so the cluster have to be hidden
from other racers).
I think it's good to use list_del_init when deleting from the discard
list in this patch, then list_move_tail can always be used when
changing the list of a cluster.
Discard should be a much less common operation so this should
be OK.
^ permalink raw reply [flat|nested] 32+ messages in thread
[parent not found: <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>]
* Re: [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order
2024-07-31 6:49 [PATCH v5 0/9] mm: swap: mTHP swap allocator base on swap cluster order Chris Li
` (10 preceding siblings ...)
[not found] ` <87h6bw3gxl.fsf@yhuang6-desk2.ccr.corp.intel.com>
@ 2024-09-02 1:20 ` Andrew Morton
11 siblings, 0 replies; 32+ messages in thread
From: Andrew Morton @ 2024-09-02 1:20 UTC (permalink / raw)
To: Chris Li
Cc: Kairui Song, Hugh Dickins, Ryan Roberts, Huang, Ying,
Kalesh Singh, linux-kernel, linux-mm, Barry Song
On Tue, 30 Jul 2024 23:49:12 -0700 Chris Li <chrisl@kernel.org> wrote:
> This is the short term solutions "swap cluster order" listed
> in my "Swap Abstraction" discussion slice 8 in the recent
> LSF/MM conference.
Well this patchset hasn't needed a -fix patch since Aug 5, which is
hopeful.
How do people feel about moving this series into mm-stable for the next
merge window?
A few notes I've made are
https://lkml.kernel.org/r/CACePvbX1EWQk03YcC47s7+vn40kEFb_3wp3D_GmJV-8Fn2j+=Q@mail.gmail.com
https://lkml.kernel.org/r/87bk23250r.fsf@yhuang6-desk2.ccr.corp.intel.com
https://lkml.kernel.org/r/CACePvbU9NUgfD-QC-hdYEmeMbLc5whOKeW4hb2ijcPonZ4i6mg@mail.gmail.com
Have these all been resolved?
Thanks.
^ permalink raw reply [flat|nested] 32+ messages in thread