* + mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer.patch added to mm-new branch
@ 2025-12-20 22:05 Andrew Morton
0 siblings, 0 replies; only message in thread
From: Andrew Morton @ 2025-12-20 22:05 UTC (permalink / raw)
To: mm-commits, yosry.ahmed, rafael, nphamcs, chrisl, bhe,
baolin.wang, baohua, kasong, akpm
The patch titled
Subject: mm, swap: use swap cache as the swap in synchronize layer
has been added to the -mm mm-new branch. Its filename is
mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer.patch
This patch will shortly appear at
https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer.patch
This patch will later appear in the mm-new branch at
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Note, mm-new is a provisional staging ground for work-in-progress
patches, and acceptance into mm-new is a notification for others take
notice and to finish up reviews. Please do not hesitate to respond to
review feedback and post updated versions to replace or incrementally
fixup patches in mm-new.
The mm-new branch of mm.git is not included in linux-next
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/process/submit-checklist.rst when testing your code ***
The -mm tree is included into linux-next via various
branches at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
and is updated there most days
------------------------------------------------------
From: Kairui Song <kasong@tencent.com>
Subject: mm, swap: use swap cache as the swap in synchronize layer
Date: Sat, 20 Dec 2025 03:43:41 +0800
Current swap in synchronization mostly uses the swap_map's SWAP_HAS_CACHE
bit. Whoever sets the bit first does the actual work to swap in a folio.
This has been causing many issues as it's just a poor implementation of a
bit lock. Raced users have no idea what is pinning a slot, so it has to
loop with a schedule_timeout_uninterruptible(1), which is ugly and causes
long-tailing or other performance issues. Besides, the abuse of
SWAP_HAS_CACHE has been causing many other troubles for synchronization or
maintenance.
This is the first step to remove this bit completely.
Now all swap in paths are using the swap cache, and both the swap cache
and swap map are protected by the cluster lock. So we can just resolve
the swap synchronization with the swap cache layer directly using the
cluster lock and folio lock. Whoever inserts a folio in the swap cache
first does the swap in work. And because folios are locked during swap
operations, other raced swap operations will just wait on the folio lock.
The SWAP_HAS_CACHE will be removed in later commit. For now, we still set
it for some remaining users. But now we do the bit setting and swap cache
folio adding in the same critical section, after swap cache is ready. No
one will have to spin on the SWAP_HAS_CACHE bit anymore.
This both simplifies the logic and should improve the performance,
eliminating issues like the one solved in commit 01626a1823024 ("mm: avoid
unconditional one-tick sleep when swapcache_prepare fails"), or the
"skip_if_exists" from commit a65b0e7607ccb ("zswap: make shrinking
memcg-aware"), which will be removed very soon.
Link: https://lkml.kernel.org/r/20251220-swap-table-p2-v5-12-8862a265a033@tencent.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Rafael J. Wysocki (Intel) <rafael@kernel.org>
Cc: Yosry Ahmed <yosry.ahmed@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/swap.h | 6 --
mm/swap.h | 15 +++++
mm/swap_state.c | 105 ++++++++++++++++++++++-------------------
mm/swapfile.c | 39 +++++++++------
mm/vmscan.c | 1
5 files changed, 96 insertions(+), 70 deletions(-)
--- a/include/linux/swap.h~mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer
+++ a/include/linux/swap.h
@@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio,
extern swp_entry_t get_swap_page_of_type(int);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
@@ -516,11 +515,6 @@ static inline int swap_duplicate_nr(swp_
{
return 0;
}
-
-static inline int swapcache_prepare(swp_entry_t swp, int nr)
-{
- return 0;
-}
static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
{
--- a/mm/swapfile.c~mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer
+++ a/mm/swapfile.c
@@ -1476,7 +1476,11 @@ again:
if (!entry.val)
return -ENOMEM;
- swap_cache_add_folio(folio, entry, NULL);
+ /*
+ * Allocator has pinned the slots with SWAP_HAS_CACHE
+ * so it should never fail
+ */
+ WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
return 0;
@@ -1582,9 +1586,8 @@ static unsigned char swap_entry_put_lock
* do_swap_page()
* ... swapoff+swapon
* swap_cache_alloc_folio()
- * swapcache_prepare()
- * __swap_duplicate()
- * // check swap_map
+ * swap_cache_add_folio()
+ * // check swap_map
* // verify PTE not changed
*
* In __swap_duplicate(), the swap_map need to be checked before
@@ -3769,17 +3772,25 @@ int swap_duplicate_nr(swp_entry_t entry,
return err;
}
-/*
- * @entry: first swap entry from which we allocate nr swap cache.
- *
- * Called when allocating swap cache for existing swap entries,
- * This can return error codes. Returns 0 at success.
- * -EEXIST means there is a swap cache.
- * Note: return code is different from swap_duplicate().
- */
-int swapcache_prepare(swp_entry_t entry, int nr)
+/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */
+void __swapcache_set_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry)
+{
+ WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1));
+}
+
+/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock */
+void __swapcache_clear_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry, unsigned int nr)
{
- return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
+ if (swap_only_has_cache(si, swp_offset(entry), nr)) {
+ swap_entries_free(si, ci, entry, nr);
+ } else {
+ for (int i = 0; i < nr; i++, entry.val++)
+ swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+ }
}
/*
--- a/mm/swap.h~mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer
+++ a/mm/swap.h
@@ -234,6 +234,14 @@ static inline bool folio_matches_swap_en
return folio_entry.val == round_down(entry.val, nr_pages);
}
+/* Temporary internal helpers */
+void __swapcache_set_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry);
+void __swapcache_clear_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry, unsigned int nr);
+
/*
* All swap cache helpers below require the caller to ensure the swap entries
* used are valid and stabilize the device by any of the following ways:
@@ -247,7 +255,8 @@ static inline bool folio_matches_swap_en
*/
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadow, bool alloc);
void swap_cache_del_folio(struct folio *folio);
struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
struct mempolicy *mpol, pgoff_t ilx,
@@ -413,8 +422,10 @@ static inline void *swap_cache_get_shado
return NULL;
}
-static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
+static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadow, bool alloc)
{
+ return -ENOENT;
}
static inline void swap_cache_del_folio(struct folio *folio)
--- a/mm/swap_state.c~mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer
+++ a/mm/swap_state.c
@@ -128,34 +128,64 @@ void *swap_cache_get_shadow(swp_entry_t
* @entry: The swap entry corresponding to the folio.
* @gfp: gfp_mask for XArray node allocation.
* @shadowp: If a shadow is found, return the shadow.
+ * @alloc: If it's the allocator that is trying to insert a folio. Allocator
+ * sets SWAP_HAS_CACHE to pin slots before insert so skip map update.
*
* Context: Caller must ensure @entry is valid and protect the swap device
* with reference count or locks.
- * The caller also needs to update the corresponding swap_map slots with
- * SWAP_HAS_CACHE bit to avoid race or conflict.
*/
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadowp, bool alloc)
{
+ int err;
void *shadow = NULL;
+ struct swap_info_struct *si;
unsigned long old_tb, new_tb;
struct swap_cluster_info *ci;
- unsigned int ci_start, ci_off, ci_end;
+ unsigned int ci_start, ci_off, ci_end, offset;
unsigned long nr_pages = folio_nr_pages(folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+ si = __swap_entry_to_info(entry);
new_tb = folio_to_swp_tb(folio);
ci_start = swp_cluster_offset(entry);
ci_end = ci_start + nr_pages;
ci_off = ci_start;
- ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+ offset = swp_offset(entry);
+ ci = swap_cluster_lock(si, swp_offset(entry));
+ if (unlikely(!ci->table)) {
+ err = -ENOENT;
+ goto failed;
+ }
do {
- old_tb = __swap_table_xchg(ci, ci_off, new_tb);
- WARN_ON_ONCE(swp_tb_is_folio(old_tb));
+ old_tb = __swap_table_get(ci, ci_off);
+ if (unlikely(swp_tb_is_folio(old_tb))) {
+ err = -EEXIST;
+ goto failed;
+ }
+ if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
+ err = -ENOENT;
+ goto failed;
+ }
if (swp_tb_is_shadow(old_tb))
shadow = swp_tb_to_shadow(old_tb);
+ offset++;
+ } while (++ci_off < ci_end);
+
+ ci_off = ci_start;
+ offset = swp_offset(entry);
+ do {
+ /*
+ * Still need to pin the slots with SWAP_HAS_CACHE since
+ * swap allocator depends on that.
+ */
+ if (!alloc)
+ __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
+ __swap_table_set(ci, ci_off, new_tb);
+ offset++;
} while (++ci_off < ci_end);
folio_ref_add(folio, nr_pages);
@@ -168,6 +198,11 @@ void swap_cache_add_folio(struct folio *
if (shadowp)
*shadowp = shadow;
+ return 0;
+
+failed:
+ swap_cluster_unlock(ci);
+ return err;
}
/**
@@ -186,6 +221,7 @@ void swap_cache_add_folio(struct folio *
void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
swp_entry_t entry, void *shadow)
{
+ struct swap_info_struct *si;
unsigned long old_tb, new_tb;
unsigned int ci_start, ci_off, ci_end;
unsigned long nr_pages = folio_nr_pages(folio);
@@ -195,6 +231,7 @@ void __swap_cache_del_folio(struct swap_
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
+ si = __swap_entry_to_info(entry);
new_tb = shadow_swp_to_tb(shadow);
ci_start = swp_cluster_offset(entry);
ci_end = ci_start + nr_pages;
@@ -210,6 +247,7 @@ void __swap_cache_del_folio(struct swap_
folio_clear_swapcache(folio);
node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
+ __swapcache_clear_cached(si, ci, entry, nr_pages);
}
/**
@@ -231,7 +269,6 @@ void swap_cache_del_folio(struct folio *
__swap_cache_del_folio(ci, folio, entry, NULL);
swap_cluster_unlock(ci);
- put_swap_folio(folio, entry);
folio_ref_sub(folio, folio_nr_pages(folio));
}
@@ -423,67 +460,37 @@ static struct folio *__swap_cache_prepar
gfp_t gfp, bool charged,
bool skip_if_exists)
{
- struct folio *swapcache;
+ struct folio *swapcache = NULL;
void *shadow;
int ret;
- /*
- * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio
- * into the swap cache. Loop with a schedule delay if raced with
- * another process setting SWAP_HAS_CACHE. This hackish loop will
- * be fixed very soon.
- */
+ __folio_set_locked(folio);
+ __folio_set_swapbacked(folio);
for (;;) {
- ret = swapcache_prepare(entry, folio_nr_pages(folio));
+ ret = swap_cache_add_folio(folio, entry, &shadow, false);
if (!ret)
break;
/*
- * The skip_if_exists is for protecting against a recursive
- * call to this helper on the same entry waiting forever
- * here because SWAP_HAS_CACHE is set but the folio is not
- * in the swap cache yet. This can happen today if
- * mem_cgroup_swapin_charge_folio() below triggers reclaim
- * through zswap, which may call this helper again in the
- * writeback path.
- *
- * Large order allocation also needs special handling on
+ * Large order allocation needs special handling on
* race: if a smaller folio exists in cache, swapin needs
* to fallback to order 0, and doing a swap cache lookup
* might return a folio that is irrelevant to the faulting
* entry because @entry is aligned down. Just return NULL.
*/
if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
- return NULL;
+ goto failed;
- /*
- * Check the swap cache again, we can only arrive
- * here because swapcache_prepare returns -EEXIST.
- */
swapcache = swap_cache_get_folio(entry);
if (swapcache)
- return swapcache;
-
- /*
- * We might race against __swap_cache_del_folio(), and
- * stumble across a swap_map entry whose SWAP_HAS_CACHE
- * has not yet been cleared. Or race against another
- * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
- * in swap_map, but not yet added its folio to swap cache.
- */
- schedule_timeout_uninterruptible(1);
+ goto failed;
}
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
-
if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
- put_swap_folio(folio, entry);
- folio_unlock(folio);
- return NULL;
+ swap_cache_del_folio(folio);
+ goto failed;
}
- swap_cache_add_folio(folio, entry, &shadow);
memcg1_swapin(entry, folio_nr_pages(folio));
if (shadow)
workingset_refault(folio, shadow);
@@ -491,6 +498,10 @@ static struct folio *__swap_cache_prepar
/* Caller will initiate read into locked folio */
folio_add_lru(folio);
return folio;
+
+failed:
+ folio_unlock(folio);
+ return swapcache;
}
/**
--- a/mm/vmscan.c~mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer
+++ a/mm/vmscan.c
@@ -761,7 +761,6 @@ static int __remove_mapping(struct addre
__swap_cache_del_folio(ci, folio, swap, shadow);
memcg1_swapout(folio, swap);
swap_cluster_unlock_irq(ci);
- put_swap_folio(folio, swap);
} else {
void (*free_folio)(struct folio *);
_
Patches currently in -mm which might be from kasong@tencent.com are
mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio.patch
mm-swap-split-swap-cache-preparation-loop-into-a-standalone-helper.patch
mm-swap-never-bypass-the-swap-cache-even-for-swp_synchronous_io.patch
mm-swap-always-try-to-free-swap-cache-for-swp_synchronous_io-devices.patch
mm-swap-simplify-the-code-and-reduce-indention.patch
mm-swap-free-the-swap-cache-after-folio-is-mapped.patch
mm-shmem-never-bypass-the-swap-cache-for-swp_synchronous_io.patch
mm-swap-swap-entry-of-a-bad-slot-should-not-be-considered-as-swapped-out.patch
mm-swap-consolidate-cluster-reclaim-and-usability-check.patch
mm-swap-split-locked-entry-duplicating-into-a-standalone-helper.patch
mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer.patch
mm-swap-remove-workaround-for-unsynchronized-swap-map-cache-state.patch
mm-swap-cleanup-swap-entry-management-workflow.patch
mm-swap-add-folio-to-swap-cache-directly-on-allocation.patch
mm-swap-check-swap-table-directly-for-checking-cache.patch
mm-swap-clean-up-and-improve-swap-entries-freeing.patch
mm-swap-drop-the-swap_has_cache-flag.patch
mm-swap-remove-no-longer-needed-_swap_info_get.patch
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2025-12-20 22:05 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-20 22:05 + mm-swap-use-swap-cache-as-the-swap-in-synchronize-layer.patch added to mm-new branch Andrew Morton
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.