* [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-30 22:53 ` Yosry Ahmed
2025-10-29 15:58 ` [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
` (19 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
__read_swap_cache_async is widely used to allocate and ensure a folio is
in swapcache, or get the folio if a folio is already there.
It's not async, and it's not doing any read. Rename it to better present
its usage, and prepare to be reworked as part of new swap cache APIs.
Also, add some comments for the function. Worth noting that the
skip_if_exists argument is an long existing workaround that will be
dropped soon.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swap.h | 6 +++---
mm/swap_state.c | 49 ++++++++++++++++++++++++++++++++-----------------
mm/swapfile.c | 2 +-
mm/zswap.c | 4 ++--
4 files changed, 38 insertions(+), 23 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index d034c13d8dd2..0fff92e42cfe 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
void swap_cache_del_folio(struct folio *folio);
+struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
+ struct mempolicy *mpol, pgoff_t ilx,
+ bool *alloced, bool skip_if_exists);
/* Below helpers require the caller to lock and pass in the swap cluster. */
void __swap_cache_del_folio(struct swap_cluster_info *ci,
struct folio *folio, swp_entry_t entry, void *shadow);
@@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
struct swap_iocb **plug);
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
- struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
- bool skip_if_exists);
struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct mempolicy *mpol, pgoff_t ilx);
struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b13e9c4baa90..7765b9474632 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -402,9 +402,28 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
}
}
-struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
- struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
- bool skip_if_exists)
+/**
+ * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
+ * @entry: the swapped out swap entry to be binded to the folio.
+ * @gfp_mask: memory allocation flags
+ * @mpol: NUMA memory allocation policy to be applied
+ * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
+ * @new_page_allocated: sets true if allocation happened, false otherwise
+ * @skip_if_exists: if the slot is a partially cached state, return NULL.
+ * This is a workaround that would be removed shortly.
+ *
+ * Allocate a folio in the swap cache for one swap slot, typically before
+ * doing IO (swap in or swap out). The swap slot indicated by @entry must
+ * have a non-zero swap count (swapped out). Currently only supports order 0.
+ *
+ * Context: Caller must protect the swap device with reference count or locks.
+ * Return: Returns the existing folio if @entry is cached already. Returns
+ * NULL if failed due to -ENOMEM or @entry have a swap count < 1.
+ */
+struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
+ struct mempolicy *mpol, pgoff_t ilx,
+ bool *new_page_allocated,
+ bool skip_if_exists)
{
struct swap_info_struct *si = __swap_entry_to_info(entry);
struct folio *folio;
@@ -452,12 +471,12 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
goto put_and_return;
/*
- * Protect against a recursive call to __read_swap_cache_async()
+ * Protect against a recursive call to swap_cache_alloc_folio()
* on the same entry waiting forever here because SWAP_HAS_CACHE
* is set but the folio is not the swap cache yet. This can
* happen today if mem_cgroup_swapin_charge_folio() below
* triggers reclaim through zswap, which may call
- * __read_swap_cache_async() in the writeback path.
+ * swap_cache_alloc_folio() in the writeback path.
*/
if (skip_if_exists)
goto put_and_return;
@@ -466,7 +485,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* We might race against __swap_cache_del_folio(), and
* stumble across a swap_map entry whose SWAP_HAS_CACHE
* has not yet been cleared. Or race against another
- * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
+ * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
* in swap_map, but not yet added its folio to swap cache.
*/
schedule_timeout_uninterruptible(1);
@@ -509,10 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* and reading the disk if it is not already cached.
* A failure return means that either the page allocation failed or that
* the swap entry is no longer in use.
- *
- * get/put_swap_device() aren't needed to call this function, because
- * __read_swap_cache_async() call them and swap_read_folio() holds the
- * swap cache folio lock.
*/
struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
@@ -529,7 +544,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
return NULL;
mpol = get_vma_policy(vma, addr, 0, &ilx);
- folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
mpol_cond_put(mpol);
@@ -647,9 +662,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
blk_start_plug(&plug);
for (offset = start_offset; offset <= end_offset ; offset++) {
/* Ok, do the async read-ahead now */
- folio = __read_swap_cache_async(
- swp_entry(swp_type(entry), offset),
- gfp_mask, mpol, ilx, &page_allocated, false);
+ folio = swap_cache_alloc_folio(
+ swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
+ &page_allocated, false);
if (!folio)
continue;
if (page_allocated) {
@@ -666,7 +681,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
lru_add_drain(); /* Push any new pages onto the LRU now */
skip:
/* The page was likely read above, so no need for plugging here */
- folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
if (unlikely(page_allocated))
swap_read_folio(folio, NULL);
@@ -761,7 +776,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
continue;
pte_unmap(pte);
pte = NULL;
- folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
+ folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
&page_allocated, false);
if (!folio)
continue;
@@ -781,7 +796,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
lru_add_drain();
skip:
/* The folio was likely read above, so no need for plugging here */
- folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
+ folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
&page_allocated, false);
if (unlikely(page_allocated))
swap_read_folio(folio, NULL);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index c35bb8593f50..849be32377d9 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1573,7 +1573,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
* CPU1 CPU2
* do_swap_page()
* ... swapoff+swapon
- * __read_swap_cache_async()
+ * swap_cache_alloc_folio()
* swapcache_prepare()
* __swap_duplicate()
* // check swap_map
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958..a7a2443912f4 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
return -EEXIST;
mpol = get_task_policy(current);
- folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
- NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+ folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
+ NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
put_swap_device(si);
if (!folio)
return -ENOMEM;
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
@ 2025-10-30 22:53 ` Yosry Ahmed
[not found] ` <CAGsJ_4x1P0ypm70De7qDcDxqvY93GEPW6X2sBS_xfSUem5_S2w@mail.gmail.com>
0 siblings, 1 reply; 50+ messages in thread
From: Yosry Ahmed @ 2025-10-30 22:53 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:58:27PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
> __read_swap_cache_async is widely used to allocate and ensure a folio is
> in swapcache, or get the folio if a folio is already there.
>
> It's not async, and it's not doing any read. Rename it to better present
> its usage, and prepare to be reworked as part of new swap cache APIs.
>
> Also, add some comments for the function. Worth noting that the
> skip_if_exists argument is an long existing workaround that will be
> dropped soon.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/swap.h | 6 +++---
> mm/swap_state.c | 49 ++++++++++++++++++++++++++++++++-----------------
> mm/swapfile.c | 2 +-
> mm/zswap.c | 4 ++--
> 4 files changed, 38 insertions(+), 23 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index d034c13d8dd2..0fff92e42cfe 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -249,6 +249,9 @@ struct folio *swap_cache_get_folio(swp_entry_t entry);
> void *swap_cache_get_shadow(swp_entry_t entry);
> void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
> void swap_cache_del_folio(struct folio *folio);
> +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
> + struct mempolicy *mpol, pgoff_t ilx,
> + bool *alloced, bool skip_if_exists);
> /* Below helpers require the caller to lock and pass in the swap cluster. */
> void __swap_cache_del_folio(struct swap_cluster_info *ci,
> struct folio *folio, swp_entry_t entry, void *shadow);
> @@ -261,9 +264,6 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr);
> struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_area_struct *vma, unsigned long addr,
> struct swap_iocb **plug);
> -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_flags,
> - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
> - bool skip_if_exists);
> struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
> struct mempolicy *mpol, pgoff_t ilx);
> struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index b13e9c4baa90..7765b9474632 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -402,9 +402,28 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
> }
> }
>
> -struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> - struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
> - bool skip_if_exists)
> +/**
> + * swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
> + * @entry: the swapped out swap entry to be binded to the folio.
> + * @gfp_mask: memory allocation flags
> + * @mpol: NUMA memory allocation policy to be applied
> + * @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
> + * @new_page_allocated: sets true if allocation happened, false otherwise
> + * @skip_if_exists: if the slot is a partially cached state, return NULL.
> + * This is a workaround that would be removed shortly.
> + *
> + * Allocate a folio in the swap cache for one swap slot, typically before
> + * doing IO (swap in or swap out). The swap slot indicated by @entry must
> + * have a non-zero swap count (swapped out). Currently only supports order 0.
Is it used for swap in? That's confusing because the next sentence
mention that it needs to be already swapped out.
I suspect you're referring to the zswap writeback use case, but in this
case we're still "swapping-in" the folio from zswap to swap it out to
disk. I'd avoid mentioning swap in here because it's confusing.
Otherwise LGTM:
Reviewed-by: Yosry Ahmed <yosry.ahmed@linux.dev>
> + *
> + * Context: Caller must protect the swap device with reference count or locks.
> + * Return: Returns the existing folio if @entry is cached already. Returns
> + * NULL if failed due to -ENOMEM or @entry have a swap count < 1.
> + */
> +struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
> + struct mempolicy *mpol, pgoff_t ilx,
> + bool *new_page_allocated,
> + bool skip_if_exists)
> {
> struct swap_info_struct *si = __swap_entry_to_info(entry);
> struct folio *folio;
> @@ -452,12 +471,12 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> goto put_and_return;
>
> /*
> - * Protect against a recursive call to __read_swap_cache_async()
> + * Protect against a recursive call to swap_cache_alloc_folio()
> * on the same entry waiting forever here because SWAP_HAS_CACHE
> * is set but the folio is not the swap cache yet. This can
> * happen today if mem_cgroup_swapin_charge_folio() below
> * triggers reclaim through zswap, which may call
> - * __read_swap_cache_async() in the writeback path.
> + * swap_cache_alloc_folio() in the writeback path.
> */
> if (skip_if_exists)
> goto put_and_return;
> @@ -466,7 +485,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * We might race against __swap_cache_del_folio(), and
> * stumble across a swap_map entry whose SWAP_HAS_CACHE
> * has not yet been cleared. Or race against another
> - * __read_swap_cache_async(), which has set SWAP_HAS_CACHE
> + * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
> * in swap_map, but not yet added its folio to swap cache.
> */
> schedule_timeout_uninterruptible(1);
> @@ -509,10 +528,6 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * and reading the disk if it is not already cached.
> * A failure return means that either the page allocation failed or that
> * the swap entry is no longer in use.
> - *
> - * get/put_swap_device() aren't needed to call this function, because
> - * __read_swap_cache_async() call them and swap_read_folio() holds the
> - * swap cache folio lock.
> */
> struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_area_struct *vma, unsigned long addr,
> @@ -529,7 +544,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> return NULL;
>
> mpol = get_vma_policy(vma, addr, 0, &ilx);
> - folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> + folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
> &page_allocated, false);
> mpol_cond_put(mpol);
>
> @@ -647,9 +662,9 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> blk_start_plug(&plug);
> for (offset = start_offset; offset <= end_offset ; offset++) {
> /* Ok, do the async read-ahead now */
> - folio = __read_swap_cache_async(
> - swp_entry(swp_type(entry), offset),
> - gfp_mask, mpol, ilx, &page_allocated, false);
> + folio = swap_cache_alloc_folio(
> + swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
> + &page_allocated, false);
> if (!folio)
> continue;
> if (page_allocated) {
> @@ -666,7 +681,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
> lru_add_drain(); /* Push any new pages onto the LRU now */
> skip:
> /* The page was likely read above, so no need for plugging here */
> - folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> + folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
> &page_allocated, false);
> if (unlikely(page_allocated))
> swap_read_folio(folio, NULL);
> @@ -761,7 +776,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> continue;
> pte_unmap(pte);
> pte = NULL;
> - folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> + folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
> &page_allocated, false);
> if (!folio)
> continue;
> @@ -781,7 +796,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
> lru_add_drain();
> skip:
> /* The folio was likely read above, so no need for plugging here */
> - folio = __read_swap_cache_async(targ_entry, gfp_mask, mpol, targ_ilx,
> + folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
> &page_allocated, false);
> if (unlikely(page_allocated))
> swap_read_folio(folio, NULL);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c35bb8593f50..849be32377d9 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1573,7 +1573,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
> * CPU1 CPU2
> * do_swap_page()
> * ... swapoff+swapon
> - * __read_swap_cache_async()
> + * swap_cache_alloc_folio()
> * swapcache_prepare()
> * __swap_duplicate()
> * // check swap_map
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 5d0f8b13a958..a7a2443912f4 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1014,8 +1014,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> return -EEXIST;
>
> mpol = get_task_policy(current);
> - folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> - NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> + folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
> + NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> put_swap_device(si);
> if (!folio)
> return -ENOMEM;
>
> --
> 2.51.1
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
` (18 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
To prepare for the removal of swap cache bypass swapin, introduce a new
helper that accepts an allocated and charged fresh folio, prepares the
folio, the swap map, and then adds the folio to the swap cache.
This doesn't change how swap cache works yet, we are still depending on
the SWAP_HAS_CACHE in the swap map for synchronization. But all
synchronization hacks are now all in this single helper.
No feature change.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swap_state.c | 197 +++++++++++++++++++++++++++++++-------------------------
1 file changed, 109 insertions(+), 88 deletions(-)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 7765b9474632..d18ca765c04f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -402,6 +402,97 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
}
}
+/**
+ * __swap_cache_prepare_and_add - Prepare the folio and add it to swap cache.
+ * @entry: swap entry to be bound to the folio.
+ * @folio: folio to be added.
+ * @gfp: memory allocation flags for charge, can be 0 if @charged if true.
+ * @charged: if the folio is already charged.
+ * @skip_if_exists: if the slot is in a cached state, return NULL.
+ * This is an old workaround that will be removed shortly.
+ *
+ * Update the swap_map and add folio as swap cache, typically before swapin.
+ * All swap slots covered by the folio must have a non-zero swap count.
+ *
+ * Context: Caller must protect the swap device with reference count or locks.
+ * Return: Returns the folio being added on success. Returns the existing
+ * folio if @entry is cached. Returns NULL if raced with swapin or swapoff.
+ */
+static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
+ struct folio *folio,
+ gfp_t gfp, bool charged,
+ bool skip_if_exists)
+{
+ struct folio *swapcache;
+ void *shadow;
+ int ret;
+
+ /*
+ * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio
+ * into the swap cache. Loop with a schedule delay if raced with
+ * another process setting SWAP_HAS_CACHE. This hackish loop will
+ * be fixed very soon.
+ */
+ for (;;) {
+ ret = swapcache_prepare(entry, folio_nr_pages(folio));
+ if (!ret)
+ break;
+
+ /*
+ * The skip_if_exists is for protecting against a recursive
+ * call to this helper on the same entry waiting forever
+ * here because SWAP_HAS_CACHE is set but the folio is not
+ * in the swap cache yet. This can happen today if
+ * mem_cgroup_swapin_charge_folio() below triggers reclaim
+ * through zswap, which may call this helper again in the
+ * writeback path.
+ *
+ * Large order allocation also needs special handling on
+ * race: if a smaller folio exists in cache, swapin needs
+ * to fallback to order 0, and doing a swap cache lookup
+ * might return a folio that is irrelevant to the faulting
+ * entry because @entry is aligned down. Just return NULL.
+ */
+ if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
+ return NULL;
+
+ /*
+ * Check the swap cache again, we can only arrive
+ * here because swapcache_prepare returns -EEXIST.
+ */
+ swapcache = swap_cache_get_folio(entry);
+ if (swapcache)
+ return swapcache;
+
+ /*
+ * We might race against __swap_cache_del_folio(), and
+ * stumble across a swap_map entry whose SWAP_HAS_CACHE
+ * has not yet been cleared. Or race against another
+ * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
+ * in swap_map, but not yet added its folio to swap cache.
+ */
+ schedule_timeout_uninterruptible(1);
+ }
+
+ __folio_set_locked(folio);
+ __folio_set_swapbacked(folio);
+
+ if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
+ put_swap_folio(folio, entry);
+ folio_unlock(folio);
+ return NULL;
+ }
+
+ swap_cache_add_folio(folio, entry, &shadow);
+ memcg1_swapin(entry, folio_nr_pages(folio));
+ if (shadow)
+ workingset_refault(folio, shadow);
+
+ /* Caller will initiate read into locked folio */
+ folio_add_lru(folio);
+ return folio;
+}
+
/**
* swap_cache_alloc_folio - Allocate folio for swapped out slot in swap cache.
* @entry: the swapped out swap entry to be binded to the folio.
@@ -427,99 +518,29 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
{
struct swap_info_struct *si = __swap_entry_to_info(entry);
struct folio *folio;
- struct folio *new_folio = NULL;
struct folio *result = NULL;
- void *shadow = NULL;
*new_page_allocated = false;
- for (;;) {
- int err;
-
- /*
- * Check the swap cache first, if a cached folio is found,
- * return it unlocked. The caller will lock and check it.
- */
- folio = swap_cache_get_folio(entry);
- if (folio)
- goto got_folio;
-
- /*
- * Just skip read ahead for unused swap slot.
- */
- if (!swap_entry_swapped(si, entry))
- goto put_and_return;
-
- /*
- * Get a new folio to read into from swap. Allocate it now if
- * new_folio not exist, before marking swap_map SWAP_HAS_CACHE,
- * when -EEXIST will cause any racers to loop around until we
- * add it to cache.
- */
- if (!new_folio) {
- new_folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
- if (!new_folio)
- goto put_and_return;
- }
-
- /*
- * Swap entry may have been freed since our caller observed it.
- */
- err = swapcache_prepare(entry, 1);
- if (!err)
- break;
- else if (err != -EEXIST)
- goto put_and_return;
-
- /*
- * Protect against a recursive call to swap_cache_alloc_folio()
- * on the same entry waiting forever here because SWAP_HAS_CACHE
- * is set but the folio is not the swap cache yet. This can
- * happen today if mem_cgroup_swapin_charge_folio() below
- * triggers reclaim through zswap, which may call
- * swap_cache_alloc_folio() in the writeback path.
- */
- if (skip_if_exists)
- goto put_and_return;
+ /* Check the swap cache again for readahead path. */
+ folio = swap_cache_get_folio(entry);
+ if (folio)
+ return folio;
- /*
- * We might race against __swap_cache_del_folio(), and
- * stumble across a swap_map entry whose SWAP_HAS_CACHE
- * has not yet been cleared. Or race against another
- * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
- * in swap_map, but not yet added its folio to swap cache.
- */
- schedule_timeout_uninterruptible(1);
- }
-
- /*
- * The swap entry is ours to swap in. Prepare the new folio.
- */
- __folio_set_locked(new_folio);
- __folio_set_swapbacked(new_folio);
-
- if (mem_cgroup_swapin_charge_folio(new_folio, NULL, gfp_mask, entry))
- goto fail_unlock;
-
- swap_cache_add_folio(new_folio, entry, &shadow);
- memcg1_swapin(entry, 1);
+ /* Skip allocation for unused swap slot for readahead path. */
+ if (!swap_entry_swapped(si, entry))
+ return NULL;
- if (shadow)
- workingset_refault(new_folio, shadow);
-
- /* Caller will initiate read into locked new_folio */
- folio_add_lru(new_folio);
- *new_page_allocated = true;
- folio = new_folio;
-got_folio:
- result = folio;
- goto put_and_return;
-
-fail_unlock:
- put_swap_folio(new_folio, entry);
- folio_unlock(new_folio);
-put_and_return:
- if (!(*new_page_allocated) && new_folio)
- folio_put(new_folio);
+ /* Allocate a new folio to be added into the swap cache. */
+ folio = folio_alloc_mpol(gfp_mask, 0, mpol, ilx, numa_node_id());
+ if (!folio)
+ return NULL;
+ /* Try add the new folio, returns existing folio or NULL on failure. */
+ result = __swap_cache_prepare_and_add(entry, folio, gfp_mask,
+ false, skip_if_exists);
+ if (result == folio)
+ *new_page_allocated = true;
+ else
+ folio_put(folio);
return result;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
2025-10-29 15:58 ` [PATCH 01/19] mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio Kairui Song
2025-10-29 15:58 ` [PATCH 02/19] mm, swap: split swap cache preparation loop into a standalone helper Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-11-04 3:47 ` Barry Song
2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
` (17 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Now the overhead of the swap cache is trivial, bypassing the swap
cache is no longer a valid optimization. So unify the swapin path using
the swap cache. This changes the swap in behavior in multiple ways:
We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
the indicator to bypass both the swap cache and readahead. The swap
count check is not a good indicator for readahead. It existed because
the previously swap design made readahead strictly coupled with swap
cache bypassing. We actually want to always bypass readahead for
SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
swap cache will cause redundant IO.
Now that limitation is gone, with the new introduced helpers and design,
we will always swap cache, so this check can be simplified to check
SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
The second thing here is that this enabled a large swap for all swap
entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
also coupled with swap cache bypassing, and so the count checking side
effect also makes large swap in less effective. Now this is also fixed.
We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
cases.
And to catch potential issues with large swap in, especially with page
exclusiveness and swap cache, more debug sanity checks and comments are
added. But overall, the code is simpler. And new helper and routines
will be used by other components in later commits too. And now it's
possible to rely on the swap cache layer for resolving synchronization
issues, which will also be done by a later commit.
Worth mentioning that for a large folio workload, this may cause more
serious thrashing. This isn't a problem with this commit, but a generic
large folio issue. For a 4K workload, this commit increases the
performance.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/memory.c | 136 +++++++++++++++++++++-----------------------------------
mm/swap.h | 6 +++
mm/swap_state.c | 27 +++++++++++
3 files changed, 84 insertions(+), 85 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 4c3a7e09a159..9a43d4811781 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
+/* Sanity check that a folio is fully exclusive */
+static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
+ unsigned int nr_pages)
+{
+ do {
+ VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
+ entry.val++;
+ } while (--nr_pages);
+}
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
@@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
vm_fault_t do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct folio *swapcache, *folio = NULL;
- DECLARE_WAITQUEUE(wait, current);
+ struct folio *swapcache = NULL, *folio;
struct page *page;
struct swap_info_struct *si = NULL;
rmap_t rmap_flags = RMAP_NONE;
- bool need_clear_cache = false;
bool exclusive = false;
swp_entry_t entry;
pte_t pte;
vm_fault_t ret = 0;
- void *shadow = NULL;
int nr_pages;
unsigned long page_idx;
unsigned long address;
@@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio = swap_cache_get_folio(entry);
if (folio)
swap_update_readahead(folio, vma, vmf->address);
- swapcache = folio;
-
if (!folio) {
- if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
- __swap_count(entry) == 1) {
- /* skip swapcache */
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
folio = alloc_swap_folio(vmf);
if (folio) {
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
-
- nr_pages = folio_nr_pages(folio);
- if (folio_test_large(folio))
- entry.val = ALIGN_DOWN(entry.val, nr_pages);
/*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread
- * may finish swapin first, free the entry, and
- * swapout reusing the same entry. It's
- * undetectable as pte_same() returns true due
- * to entry reuse.
+ * folio is charged, so swapin can only fail due
+ * to raced swapin and return NULL.
*/
- if (swapcache_prepare(entry, nr_pages)) {
- /*
- * Relax a bit to prevent rapid
- * repeated page faults.
- */
- add_wait_queue(&swapcache_wq, &wait);
- schedule_timeout_uninterruptible(1);
- remove_wait_queue(&swapcache_wq, &wait);
- goto out_page;
- }
- need_clear_cache = true;
-
- memcg1_swapin(entry, nr_pages);
-
- shadow = swap_cache_get_shadow(entry);
- if (shadow)
- workingset_refault(folio, shadow);
-
- folio_add_lru(folio);
-
- /* To provide entry to swap_read_folio() */
- folio->swap = entry;
- swap_read_folio(folio, NULL);
- folio->private = NULL;
+ swapcache = swapin_folio(entry, folio);
+ if (swapcache != folio)
+ folio_put(folio);
+ folio = swapcache;
}
} else {
- folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
- vmf);
- swapcache = folio;
+ folio = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vmf);
}
if (!folio) {
@@ -4779,6 +4748,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
}
+ swapcache = folio;
ret |= folio_lock_or_retry(folio, vmf);
if (ret & VM_FAULT_RETRY)
goto out_release;
@@ -4848,24 +4818,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_nomap;
}
- /* allocated large folios for SWP_SYNCHRONOUS_IO */
- if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
- unsigned long nr = folio_nr_pages(folio);
- unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
- unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
- pte_t *folio_ptep = vmf->pte - idx;
- pte_t folio_pte = ptep_get(folio_ptep);
-
- if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
- swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
- goto out_nomap;
-
- page_idx = idx;
- address = folio_start;
- ptep = folio_ptep;
- goto check_folio;
- }
-
nr_pages = 1;
page_idx = 0;
address = vmf->address;
@@ -4909,12 +4861,37 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
BUG_ON(!folio_test_anon(folio) && folio_test_mappedtodisk(folio));
BUG_ON(folio_test_anon(folio) && PageAnonExclusive(page));
+ /*
+ * If a large folio already belongs to anon mapping, then we
+ * can just go on and map it partially.
+ * If not, with the large swapin check above failing, the page table
+ * have changed, so sub pages might got charged to the wrong cgroup,
+ * or even should be shmem. So we have to free it and fallback.
+ * Nothing should have touched it, both anon and shmem checks if a
+ * large folio is fully appliable before use.
+ *
+ * This will be removed once we unify folio allocation in the swap cache
+ * layer, where allocation of a folio stabilizes the swap entries.
+ */
+ if (!folio_test_anon(folio) && folio_test_large(folio) &&
+ nr_pages != folio_nr_pages(folio)) {
+ if (!WARN_ON_ONCE(folio_test_dirty(folio)))
+ swap_cache_del_folio(folio);
+ goto out_nomap;
+ }
+
/*
* Check under PT lock (to protect against concurrent fork() sharing
* the swap entry concurrently) for certainly exclusive pages.
*/
if (!folio_test_ksm(folio)) {
+ /*
+ * The can_swapin_thp check above ensures all PTE have
+ * same exclusivenss, only check one PTE is fine.
+ */
exclusive = pte_swp_exclusive(vmf->orig_pte);
+ if (exclusive)
+ check_swap_exclusive(folio, entry, nr_pages);
if (folio != swapcache) {
/*
* We have a fresh page that is not exposed to the
@@ -4992,18 +4969,16 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->orig_pte = pte_advance_pfn(pte, page_idx);
/* ksm created a completely new copy */
- if (unlikely(folio != swapcache && swapcache)) {
+ if (unlikely(folio != swapcache)) {
folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
} else if (!folio_test_anon(folio)) {
/*
- * We currently only expect small !anon folios which are either
- * fully exclusive or fully shared, or new allocated large
- * folios which are fully exclusive. If we ever get large
- * folios within swapcache here, we have to be careful.
+ * We currently only expect !anon folios that are fully
+ * mappable. See the comment after can_swapin_thp above.
*/
- VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
- VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
} else {
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
@@ -5043,12 +5018,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out:
- /* Clear the swap cache pin for direct swapin after PTL unlock */
- if (need_clear_cache) {
- swapcache_clear(si, entry, nr_pages);
- if (waitqueue_active(&swapcache_wq))
- wake_up(&swapcache_wq);
- }
if (si)
put_swap_device(si);
return ret;
@@ -5056,6 +5025,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
+ if (folio_test_swapcache(folio))
+ folio_free_swap(folio);
folio_unlock(folio);
out_release:
folio_put(folio);
@@ -5063,11 +5034,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_unlock(swapcache);
folio_put(swapcache);
}
- if (need_clear_cache) {
- swapcache_clear(si, entry, nr_pages);
- if (waitqueue_active(&swapcache_wq))
- wake_up(&swapcache_wq);
- }
if (si)
put_swap_device(si);
return ret;
diff --git a/mm/swap.h b/mm/swap.h
index 0fff92e42cfe..214e7d041030 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -268,6 +268,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct mempolicy *mpol, pgoff_t ilx);
struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
struct vm_fault *vmf);
+struct folio *swapin_folio(swp_entry_t entry, struct folio *folio);
void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
unsigned long addr);
@@ -386,6 +387,11 @@ static inline struct folio *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask,
return NULL;
}
+static inline struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
+{
+ return NULL;
+}
+
static inline void swap_update_readahead(struct folio *folio,
struct vm_area_struct *vma, unsigned long addr)
{
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d18ca765c04f..b3737c60aad9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -544,6 +544,33 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
return result;
}
+/**
+ * swapin_folio - swap-in one or multiple entries skipping readahead.
+ * @entry: starting swap entry to swap in
+ * @folio: a new allocated and charged folio
+ *
+ * Reads @entry into @folio, @folio will be added to the swap cache.
+ * If @folio is a large folio, the @entry will be rounded down to align
+ * with the folio size.
+ *
+ * Return: returns pointer to @folio on success. If folio is a large folio
+ * and this raced with another swapin, NULL will be returned. Else, if
+ * another folio was already added to the swap cache, return that swap
+ * cache folio instead.
+ */
+struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
+{
+ struct folio *swapcache;
+ pgoff_t offset = swp_offset(entry);
+ unsigned long nr_pages = folio_nr_pages(folio);
+
+ entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
+ swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true, false);
+ if (swapcache == folio)
+ swap_read_folio(folio, NULL);
+ return swapcache;
+}
+
/*
* Locate a page of swap in physical memory, reserving swap cache space
* and reading the disk if it is not already cached.
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-11-04 3:47 ` Barry Song
2025-11-04 10:44 ` Kairui Song
0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04 3:47 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now the overhead of the swap cache is trivial, bypassing the swap
> cache is no longer a valid optimization. So unify the swapin path using
> the swap cache. This changes the swap in behavior in multiple ways:
>
> We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
> the indicator to bypass both the swap cache and readahead. The swap
> count check is not a good indicator for readahead. It existed because
> the previously swap design made readahead strictly coupled with swap
> cache bypassing. We actually want to always bypass readahead for
> SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
> swap cache will cause redundant IO.
I suppose it’s not only redundant I/O, but also causes additional memory
copies, as each swap-in allocates a new folio. Using swapcache allows the
folio to be shared instead?
>
> Now that limitation is gone, with the new introduced helpers and design,
> we will always swap cache, so this check can be simplified to check
> SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
> SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
>
> The second thing here is that this enabled a large swap for all swap
> entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
> also coupled with swap cache bypassing, and so the count checking side
> effect also makes large swap in less effective. Now this is also fixed.
> We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
> cases.
>
In your cover letter, you mentioned: “it’s especially better for workloads
with swap count > 1 on SYNC_IO devices, about ~20% gain in the above test.”
Is this improvement mainly from mTHP swap-in?
> And to catch potential issues with large swap in, especially with page
> exclusiveness and swap cache, more debug sanity checks and comments are
> added. But overall, the code is simpler. And new helper and routines
> will be used by other components in later commits too. And now it's
> possible to rely on the swap cache layer for resolving synchronization
> issues, which will also be done by a later commit.
>
> Worth mentioning that for a large folio workload, this may cause more
> serious thrashing. This isn't a problem with this commit, but a generic
> large folio issue. For a 4K workload, this commit increases the
> performance.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/memory.c | 136 +++++++++++++++++++++-----------------------------------
> mm/swap.h | 6 +++
> mm/swap_state.c | 27 +++++++++++
> 3 files changed, 84 insertions(+), 85 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 4c3a7e09a159..9a43d4811781 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> +/* Sanity check that a folio is fully exclusive */
> +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> + unsigned int nr_pages)
> +{
> + do {
> + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
> + entry.val++;
> + } while (--nr_pages);
> +}
>
> /*
> * We enter with non-exclusive mmap_lock (to exclude vma changes,
> @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> vm_fault_t do_swap_page(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> - struct folio *swapcache, *folio = NULL;
> - DECLARE_WAITQUEUE(wait, current);
> + struct folio *swapcache = NULL, *folio;
> struct page *page;
> struct swap_info_struct *si = NULL;
> rmap_t rmap_flags = RMAP_NONE;
> - bool need_clear_cache = false;
> bool exclusive = false;
> swp_entry_t entry;
> pte_t pte;
> vm_fault_t ret = 0;
> - void *shadow = NULL;
> int nr_pages;
> unsigned long page_idx;
> unsigned long address;
> @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> folio = swap_cache_get_folio(entry);
> if (folio)
> swap_update_readahead(folio, vma, vmf->address);
> - swapcache = folio;
> -
I wonder if we should move swap_update_readahead() elsewhere. Since for
sync IO you’ve completely dropped readahead, why do we still need to call
update_readahead()?
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
2025-11-04 3:47 ` Barry Song
@ 2025-11-04 10:44 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-04 10:44 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Tue, Nov 4, 2025 at 11:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now the overhead of the swap cache is trivial, bypassing the swap
> > cache is no longer a valid optimization. So unify the swapin path using
> > the swap cache. This changes the swap in behavior in multiple ways:
> >
> > We used to rely on `SWP_SYNCHRONOUS_IO && __swap_count(entry) == 1` as
> > the indicator to bypass both the swap cache and readahead. The swap
> > count check is not a good indicator for readahead. It existed because
> > the previously swap design made readahead strictly coupled with swap
> > cache bypassing. We actually want to always bypass readahead for
> > SWP_SYNCHRONOUS_IO devices even if swap count > 1, But bypassing the
> > swap cache will cause redundant IO.
>
> I suppose it’s not only redundant I/O, but also causes additional memory
> copies, as each swap-in allocates a new folio. Using swapcache allows the
> folio to be shared instead?
Thanks for the review!
Right, one thing I forgot to mention is after this change, workloads
involving mTHP swapin are less likely to OOM, that's related.
>
> >
> > Now that limitation is gone, with the new introduced helpers and design,
> > we will always swap cache, so this check can be simplified to check
> > SWP_SYNCHRONOUS_IO only, effectively disabling readahead for all
> > SWP_SYNCHRONOUS_IO cases, this is a huge win for many workloads.
> >
> > The second thing here is that this enabled a large swap for all swap
> > entries on SWP_SYNCHRONOUS_IO devices. Previously, the large swap in is
> > also coupled with swap cache bypassing, and so the count checking side
> > effect also makes large swap in less effective. Now this is also fixed.
> > We will always have a large swap in support for all SWP_SYNCHRONOUS_IO
> > cases.
> >
>
> In your cover letter, you mentioned: “it’s especially better for workloads
> with swap count > 1 on SYNC_IO devices, about ~20% gain in the above test.”
> Is this improvement mainly from mTHP swap-in?
Mainly from bypassing readahead I think. mTHP swap-in might also help though.
> > And to catch potential issues with large swap in, especially with page
> > exclusiveness and swap cache, more debug sanity checks and comments are
> > added. But overall, the code is simpler. And new helper and routines
> > will be used by other components in later commits too. And now it's
> > possible to rely on the swap cache layer for resolving synchronization
> > issues, which will also be done by a later commit.
> >
> > Worth mentioning that for a large folio workload, this may cause more
> > serious thrashing. This isn't a problem with this commit, but a generic
> > large folio issue. For a 4K workload, this commit increases the
> > performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/memory.c | 136 +++++++++++++++++++++-----------------------------------
> > mm/swap.h | 6 +++
> > mm/swap_state.c | 27 +++++++++++
> > 3 files changed, 84 insertions(+), 85 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 4c3a7e09a159..9a43d4811781 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4613,7 +4613,15 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > }
> > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> >
> > -static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > +/* Sanity check that a folio is fully exclusive */
> > +static void check_swap_exclusive(struct folio *folio, swp_entry_t entry,
> > + unsigned int nr_pages)
> > +{
> > + do {
> > + VM_WARN_ON_ONCE_FOLIO(__swap_count(entry) != 1, folio);
> > + entry.val++;
> > + } while (--nr_pages);
> > +}
> >
> > /*
> > * We enter with non-exclusive mmap_lock (to exclude vma changes,
> > @@ -4626,17 +4634,14 @@ static DECLARE_WAIT_QUEUE_HEAD(swapcache_wq);
> > vm_fault_t do_swap_page(struct vm_fault *vmf)
> > {
> > struct vm_area_struct *vma = vmf->vma;
> > - struct folio *swapcache, *folio = NULL;
> > - DECLARE_WAITQUEUE(wait, current);
> > + struct folio *swapcache = NULL, *folio;
> > struct page *page;
> > struct swap_info_struct *si = NULL;
> > rmap_t rmap_flags = RMAP_NONE;
> > - bool need_clear_cache = false;
> > bool exclusive = false;
> > swp_entry_t entry;
> > pte_t pte;
> > vm_fault_t ret = 0;
> > - void *shadow = NULL;
> > int nr_pages;
> > unsigned long page_idx;
> > unsigned long address;
> > @@ -4707,57 +4712,21 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > folio = swap_cache_get_folio(entry);
> > if (folio)
> > swap_update_readahead(folio, vma, vmf->address);
> > - swapcache = folio;
> > -
>
> I wonder if we should move swap_update_readahead() elsewhere. Since for
> sync IO you’ve completely dropped readahead, why do we still need to call
> update_readahead()?
That's a very good suggestion, the overhead will be smaller too.
I'm not sure if the code will be messy if we move this right now, let
me try, or maybe this optimization can be done later.
I do plan to defer swap cache lookup inside swapin_reahahead /
swapin_folio. We can do that now because swapin_folio requires the
caller to alloc a folio for THP swapin, so doing swap cache lookup
early helps to reduce memory overhead.
Once we unify swapin folio allocation for shmem / anon and always do
folio allocation with swap_cache_alloc_folio, everything will be
arranged in a nice way I think.
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (2 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 03/19] mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-11-04 4:19 ` Barry Song
2025-10-29 15:58 ` [PATCH 05/19] mm, swap: simplify the code and reduce indention Kairui Song
` (16 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
effect is that a folio may stay in swap cache for a longer time due to
lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
are being swapped out very frequently right after swapin, hence improving
the performance. But the long pinning of swap slots also increases the
fragmentation rate of the swap device significantly, and currently,
all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
causes the backing memory to be pinned, increasing the memory pressure.
So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
after swapin finishes. Swap cache has served its role as a
synchronization layer to prevent any parallel swapin from wasting
CPU or memory allocation, and the redundant IO is not a major concern
for SWP_SYNCHRONOUS_IO devices.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/memory.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 9a43d4811781..78457347ae60 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
return 0;
}
-static inline bool should_try_to_free_swap(struct folio *folio,
+static inline bool should_try_to_free_swap(struct swap_info_struct *si,
+ struct folio *folio,
struct vm_area_struct *vma,
unsigned int fault_flags)
{
if (!folio_test_swapcache(folio))
return false;
+ /*
+ * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
+ * Redundant IO is unlikely to be an issue for them, but a
+ * slot being pinned by swap cache may cause more fragmentation
+ * and delayed freeing of swap metadata.
+ */
+ if (data_race(si->flags & SWP_SYNCHRONOUS_IO))
+ return true;
if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
folio_test_mlocked(folio))
return true;
@@ -4935,7 +4944,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* yet.
*/
swap_free_nr(entry, nr_pages);
- if (should_try_to_free_swap(folio, vma, vmf->flags))
+ if (should_try_to_free_swap(si, folio, vma, vmf->flags))
folio_free_swap(folio);
add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
@ 2025-11-04 4:19 ` Barry Song
2025-11-04 8:26 ` Barry Song
0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04 4:19 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> effect is that a folio may stay in swap cache for a longer time due to
> lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> are being swapped out very frequently right after swapin, hence improving
> the performance. But the long pinning of swap slots also increases the
> fragmentation rate of the swap device significantly, and currently,
> all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> causes the backing memory to be pinned, increasing the memory pressure.
>
> So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> after swapin finishes. Swap cache has served its role as a
> synchronization layer to prevent any parallel swapin from wasting
> CPU or memory allocation, and the redundant IO is not a major concern
> for SWP_SYNCHRONOUS_IO devices.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/memory.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 9a43d4811781..78457347ae60 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> return 0;
> }
>
> -static inline bool should_try_to_free_swap(struct folio *folio,
> +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> + struct folio *folio,
> struct vm_area_struct *vma,
> unsigned int fault_flags)
> {
> if (!folio_test_swapcache(folio))
> return false;
> + /*
> + * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> + * Redundant IO is unlikely to be an issue for them, but a
> + * slot being pinned by swap cache may cause more fragmentation
> + * and delayed freeing of swap metadata.
> + */
I don’t like the claim about “redundant I/O” — it sounds misleading. Those
I/Os are not redundant; they are simply saved by swapcache, which prevents
some swap-out I/O when a recently swap-in folio is swapped out again.
So, could we make it a bit more specific in both the comment and the commit
message?
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
2025-11-04 4:19 ` Barry Song
@ 2025-11-04 8:26 ` Barry Song
2025-11-04 10:55 ` Kairui Song
0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04 8:26 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Tue, Nov 4, 2025 at 12:19 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> > effect is that a folio may stay in swap cache for a longer time due to
> > lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> > are being swapped out very frequently right after swapin, hence improving
> > the performance. But the long pinning of swap slots also increases the
> > fragmentation rate of the swap device significantly, and currently,
> > all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> > causes the backing memory to be pinned, increasing the memory pressure.
> >
> > So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> > after swapin finishes. Swap cache has served its role as a
> > synchronization layer to prevent any parallel swapin from wasting
> > CPU or memory allocation, and the redundant IO is not a major concern
> > for SWP_SYNCHRONOUS_IO devices.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/memory.c | 13 +++++++++++--
> > 1 file changed, 11 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 9a43d4811781..78457347ae60 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > return 0;
> > }
> >
> > -static inline bool should_try_to_free_swap(struct folio *folio,
> > +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > + struct folio *folio,
> > struct vm_area_struct *vma,
> > unsigned int fault_flags)
> > {
> > if (!folio_test_swapcache(folio))
> > return false;
> > + /*
> > + * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> > + * Redundant IO is unlikely to be an issue for them, but a
> > + * slot being pinned by swap cache may cause more fragmentation
> > + * and delayed freeing of swap metadata.
> > + */
>
> I don’t like the claim about “redundant I/O” — it sounds misleading. Those
> I/Os are not redundant; they are simply saved by swapcache, which prevents
> some swap-out I/O when a recently swap-in folio is swapped out again.
>
> So, could we make it a bit more specific in both the comment and the commit
> message?
Sorry, on second thought—consider a case where process A mmaps 100 MB and writes
to it to populate memory, then forks process B. If that 100 MB gets swapped out,
and A and B later swap it in separately for reading, with this change it seems
they would each get their own 100 MB copy (total 2 × 100 MB), whereas previously
they could share the same 100 MB?
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
2025-11-04 8:26 ` Barry Song
@ 2025-11-04 10:55 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-04 10:55 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Tue, Nov 4, 2025 at 4:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 12:19 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > Now SWP_SYNCHRONOUS_IO devices are also using swap cache. One side
> > > effect is that a folio may stay in swap cache for a longer time due to
> > > lazy freeing (vm_swap_full()). This can help save some CPU / IO if folios
> > > are being swapped out very frequently right after swapin, hence improving
> > > the performance. But the long pinning of swap slots also increases the
> > > fragmentation rate of the swap device significantly, and currently,
> > > all in-tree SWP_SYNCHRONOUS_IO devices are RAM disks, so it also
> > > causes the backing memory to be pinned, increasing the memory pressure.
> > >
> > > So drop the swap cache immediately for SWP_SYNCHRONOUS_IO devices
> > > after swapin finishes. Swap cache has served its role as a
> > > synchronization layer to prevent any parallel swapin from wasting
> > > CPU or memory allocation, and the redundant IO is not a major concern
> > > for SWP_SYNCHRONOUS_IO devices.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > > mm/memory.c | 13 +++++++++++--
> > > 1 file changed, 11 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 9a43d4811781..78457347ae60 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4359,12 +4359,21 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > return 0;
> > > }
> > >
> > > -static inline bool should_try_to_free_swap(struct folio *folio,
> > > +static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > + struct folio *folio,
> > > struct vm_area_struct *vma,
> > > unsigned int fault_flags)
> > > {
> > > if (!folio_test_swapcache(folio))
> > > return false;
> > > + /*
> > > + * Try to free swap cache for SWP_SYNCHRONOUS_IO devices.
> > > + * Redundant IO is unlikely to be an issue for them, but a
> > > + * slot being pinned by swap cache may cause more fragmentation
> > > + * and delayed freeing of swap metadata.
> > > + */
> >
> > I don’t like the claim about “redundant I/O” — it sounds misleading. Those
> > I/Os are not redundant; they are simply saved by swapcache, which prevents
> > some swap-out I/O when a recently swap-in folio is swapped out again.
> >
> > So, could we make it a bit more specific in both the comment and the commit
> > message?
>
> Sorry, on second thought—consider a case where process A mmaps 100 MB and writes
> to it to populate memory, then forks process B. If that 100 MB gets swapped out,
> and A and B later swap it in separately for reading, with this change it seems
> they would each get their own 100 MB copy (total 2 × 100 MB), whereas previously
> they could share the same 100 MB?
It's a bit tricky here, folio_free_swap only frees the swap cache if a
folio's swap count is 0, so if A swapin these folios first, the swap
cache won't be freed until B also mapped these folios and reduced the
swap count.
And this function is called should_try_to_free_swap: it's only trying
to free the swap cache if swap count == 0. I think I can add some
comments on that.
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 05/19] mm, swap: simplify the code and reduce indention
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (3 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 04/19] mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
` (15 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Now swap cache is always used, multiple swap cache checks are no longer
useful, remove them and reduce the code indention.
No behavior change.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/memory.c | 89 +++++++++++++++++++++++++++++--------------------------------
1 file changed, 43 insertions(+), 46 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 78457347ae60..6c5cd86c4a66 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4763,55 +4763,52 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_release;
page = folio_file_page(folio, swp_offset(entry));
- if (swapcache) {
- /*
- * Make sure folio_free_swap() or swapoff did not release the
- * swapcache from under us. The page pin, and pte_same test
- * below, are not enough to exclude that. Even if it is still
- * swapcache, we need to check that the page's swap has not
- * changed.
- */
- if (unlikely(!folio_matches_swap_entry(folio, entry)))
- goto out_page;
-
- if (unlikely(PageHWPoison(page))) {
- /*
- * hwpoisoned dirty swapcache pages are kept for killing
- * owner processes (which may be unknown at hwpoison time)
- */
- ret = VM_FAULT_HWPOISON;
- goto out_page;
- }
-
- /*
- * KSM sometimes has to copy on read faults, for example, if
- * folio->index of non-ksm folios would be nonlinear inside the
- * anon VMA -- the ksm flag is lost on actual swapout.
- */
- folio = ksm_might_need_to_copy(folio, vma, vmf->address);
- if (unlikely(!folio)) {
- ret = VM_FAULT_OOM;
- folio = swapcache;
- goto out_page;
- } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
- ret = VM_FAULT_HWPOISON;
- folio = swapcache;
- goto out_page;
- }
- if (folio != swapcache)
- page = folio_page(folio, 0);
+ /*
+ * Make sure folio_free_swap() or swapoff did not release the
+ * swapcache from under us. The page pin, and pte_same test
+ * below, are not enough to exclude that. Even if it is still
+ * swapcache, we need to check that the page's swap has not
+ * changed.
+ */
+ if (unlikely(!folio_matches_swap_entry(folio, entry)))
+ goto out_page;
+ if (unlikely(PageHWPoison(page))) {
/*
- * If we want to map a page that's in the swapcache writable, we
- * have to detect via the refcount if we're really the exclusive
- * owner. Try removing the extra reference from the local LRU
- * caches if required.
+ * hwpoisoned dirty swapcache pages are kept for killing
+ * owner processes (which may be unknown at hwpoison time)
*/
- if ((vmf->flags & FAULT_FLAG_WRITE) && folio == swapcache &&
- !folio_test_ksm(folio) && !folio_test_lru(folio))
- lru_add_drain();
+ ret = VM_FAULT_HWPOISON;
+ goto out_page;
}
+ /*
+ * KSM sometimes has to copy on read faults, for example, if
+ * folio->index of non-ksm folios would be nonlinear inside the
+ * anon VMA -- the ksm flag is lost on actual swapout.
+ */
+ folio = ksm_might_need_to_copy(folio, vma, vmf->address);
+ if (unlikely(!folio)) {
+ ret = VM_FAULT_OOM;
+ folio = swapcache;
+ goto out_page;
+ } else if (unlikely(folio == ERR_PTR(-EHWPOISON))) {
+ ret = VM_FAULT_HWPOISON;
+ folio = swapcache;
+ goto out_page;
+ } else if (folio != swapcache)
+ page = folio_page(folio, 0);
+
+ /*
+ * If we want to map a page that's in the swapcache writable, we
+ * have to detect via the refcount if we're really the exclusive
+ * owner. Try removing the extra reference from the local LRU
+ * caches if required.
+ */
+ if ((vmf->flags & FAULT_FLAG_WRITE) &&
+ !folio_test_ksm(folio) && !folio_test_lru(folio))
+ lru_add_drain();
+
folio_throttle_swaprate(folio, GFP_KERNEL);
/*
@@ -5001,7 +4998,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
pte, pte, nr_pages);
folio_unlock(folio);
- if (folio != swapcache && swapcache) {
+ if (unlikely(folio != swapcache)) {
/*
* Hold the lock to avoid the swap entry to be reused
* until we take the PT lock for the pte_same() check
@@ -5039,7 +5036,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_unlock(folio);
out_release:
folio_put(folio);
- if (folio != swapcache && swapcache) {
+ if (folio != swapcache) {
folio_unlock(swapcache);
folio_put(swapcache);
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (4 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 05/19] mm, swap: simplify the code and reduce indention Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-11-04 9:14 ` Barry Song
2025-10-29 15:58 ` [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
` (14 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
To prevent repeated faults of parallel swapin of the same PTE, remove
the folio from the swap cache after the folio is mapped. So any user
faulting from the swap PTE should see the folio in the swap cache and
wait on it.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/memory.c | 21 +++++++++++----------
1 file changed, 11 insertions(+), 10 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 6c5cd86c4a66..589d6fc3d424 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
static inline bool should_try_to_free_swap(struct swap_info_struct *si,
struct folio *folio,
struct vm_area_struct *vma,
+ unsigned int extra_refs,
unsigned int fault_flags)
{
if (!folio_test_swapcache(folio))
@@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
* reference only in case it's likely that we'll be the exclusive user.
*/
return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
- folio_ref_count(folio) == (1 + folio_nr_pages(folio));
+ folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
}
static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
@@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
arch_swap_restore(folio_swap(entry, folio), folio);
- /*
- * Remove the swap entry and conditionally try to free up the swapcache.
- * We're already holding a reference on the page but haven't mapped it
- * yet.
- */
- swap_free_nr(entry, nr_pages);
- if (should_try_to_free_swap(si, folio, vma, vmf->flags))
- folio_free_swap(folio);
-
add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
pte = mk_pte(page, vma->vm_page_prot);
@@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
arch_do_swap_page_nr(vma->vm_mm, vma, address,
pte, pte, nr_pages);
+ /*
+ * Remove the swap entry and conditionally try to free up the
+ * swapcache. Do it after mapping so any raced page fault will
+ * see the folio in swap cache and wait for us.
+ */
+ swap_free_nr(entry, nr_pages);
+ if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
+ folio_free_swap(folio);
+
folio_unlock(folio);
if (unlikely(folio != swapcache)) {
/*
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
@ 2025-11-04 9:14 ` Barry Song
2025-11-04 10:50 ` Kairui Song
0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-04 9:14 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> To prevent repeated faults of parallel swapin of the same PTE, remove
> the folio from the swap cache after the folio is mapped. So any user
> faulting from the swap PTE should see the folio in the swap cache and
> wait on it.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/memory.c | 21 +++++++++++----------
> 1 file changed, 11 insertions(+), 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6c5cd86c4a66..589d6fc3d424 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> struct folio *folio,
> struct vm_area_struct *vma,
> + unsigned int extra_refs,
> unsigned int fault_flags)
> {
> if (!folio_test_swapcache(folio))
> @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> * reference only in case it's likely that we'll be the exclusive user.
> */
> return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> - folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> + folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
> }
>
> static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> */
> arch_swap_restore(folio_swap(entry, folio), folio);
>
> - /*
> - * Remove the swap entry and conditionally try to free up the swapcache.
> - * We're already holding a reference on the page but haven't mapped it
> - * yet.
> - */
> - swap_free_nr(entry, nr_pages);
> - if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> - folio_free_swap(folio);
> -
> add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> pte = mk_pte(page, vma->vm_page_prot);
> @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> arch_do_swap_page_nr(vma->vm_mm, vma, address,
> pte, pte, nr_pages);
>
> + /*
> + * Remove the swap entry and conditionally try to free up the
> + * swapcache. Do it after mapping so any raced page fault will
> + * see the folio in swap cache and wait for us.
This seems like the right optimization—it reduces the race window where we might
allocate a folio, perform the read, and then attempt to map it, only
to find after
taking the PTL that the PTE has already changed.
Although I am not entirely sure that “any raced page fault will see the folio in
swapcache,” it seems there could still be cases where a fault occurs after
folio_free_swap(), and thus can’t see the swapcache entry.
T1:
swap in PF, allocate and add swapcache, map PTE, delete swapcache
T2:
swap in PF before PTE is changed;
...........................................................;
check swapcache after T1 deletes swapcache -> no swapcache found.
> + */
> + swap_free_nr(entry, nr_pages);
> + if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> + folio_free_swap(folio);
> +
> folio_unlock(folio);
> if (unlikely(folio != swapcache)) {
> /*
>
> --
> 2.51.1
>
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
2025-11-04 9:14 ` Barry Song
@ 2025-11-04 10:50 ` Kairui Song
2025-11-04 19:52 ` Barry Song
0 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-11-04 10:50 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Tue, Nov 4, 2025 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > To prevent repeated faults of parallel swapin of the same PTE, remove
> > the folio from the swap cache after the folio is mapped. So any user
> > faulting from the swap PTE should see the folio in the swap cache and
> > wait on it.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/memory.c | 21 +++++++++++----------
> > 1 file changed, 11 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6c5cd86c4a66..589d6fc3d424 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > struct folio *folio,
> > struct vm_area_struct *vma,
> > + unsigned int extra_refs,
> > unsigned int fault_flags)
> > {
> > if (!folio_test_swapcache(folio))
> > @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > * reference only in case it's likely that we'll be the exclusive user.
> > */
> > return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > - folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > + folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
> > }
> >
> > static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > */
> > arch_swap_restore(folio_swap(entry, folio), folio);
> >
> > - /*
> > - * Remove the swap entry and conditionally try to free up the swapcache.
> > - * We're already holding a reference on the page but haven't mapped it
> > - * yet.
> > - */
> > - swap_free_nr(entry, nr_pages);
> > - if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> > - folio_free_swap(folio);
> > -
> > add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > pte = mk_pte(page, vma->vm_page_prot);
> > @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > arch_do_swap_page_nr(vma->vm_mm, vma, address,
> > pte, pte, nr_pages);
> >
> > + /*
> > + * Remove the swap entry and conditionally try to free up the
> > + * swapcache. Do it after mapping so any raced page fault will
> > + * see the folio in swap cache and wait for us.
>
> This seems like the right optimization—it reduces the race window where we might
> allocate a folio, perform the read, and then attempt to map it, only
> to find after
> taking the PTL that the PTE has already changed.
>
> Although I am not entirely sure that “any raced page fault will see the folio in
> swapcache,” it seems there could still be cases where a fault occurs after
> folio_free_swap(), and thus can’t see the swapcache entry.
>
> T1:
> swap in PF, allocate and add swapcache, map PTE, delete swapcache
>
> T2:
> swap in PF before PTE is changed;
> ...........................................................;
> check swapcache after T1 deletes swapcache -> no swapcache found.
Right, that's true. But we will at most only have one repeated fault,
and the time window is much smaller. T2 will PTE != orig_pte and then
return just fine.
So this patch is only reducing the race time window for a potentially
better performance, and this race is basically harmless anyway. I
think it's good enough.
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 06/19] mm, swap: free the swap cache after folio is mapped
2025-11-04 10:50 ` Kairui Song
@ 2025-11-04 19:52 ` Barry Song
0 siblings, 0 replies; 50+ messages in thread
From: Barry Song @ 2025-11-04 19:52 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Tue, Nov 4, 2025 at 6:51 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Nov 4, 2025 at 5:15 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:59 PM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > From: Kairui Song <kasong@tencent.com>
> > >
> > > To prevent repeated faults of parallel swapin of the same PTE, remove
> > > the folio from the swap cache after the folio is mapped. So any user
> > > faulting from the swap PTE should see the folio in the swap cache and
> > > wait on it.
> > >
> > > Signed-off-by: Kairui Song <kasong@tencent.com>
> > > ---
> > > mm/memory.c | 21 +++++++++++----------
> > > 1 file changed, 11 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6c5cd86c4a66..589d6fc3d424 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4362,6 +4362,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
> > > static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > struct folio *folio,
> > > struct vm_area_struct *vma,
> > > + unsigned int extra_refs,
> > > unsigned int fault_flags)
> > > {
> > > if (!folio_test_swapcache(folio))
> > > @@ -4384,7 +4385,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
> > > * reference only in case it's likely that we'll be the exclusive user.
> > > */
> > > return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
> > > - folio_ref_count(folio) == (1 + folio_nr_pages(folio));
> > > + folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
> > > }
> > >
> > > static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
> > > @@ -4935,15 +4936,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > */
> > > arch_swap_restore(folio_swap(entry, folio), folio);
> > >
> > > - /*
> > > - * Remove the swap entry and conditionally try to free up the swapcache.
> > > - * We're already holding a reference on the page but haven't mapped it
> > > - * yet.
> > > - */
> > > - swap_free_nr(entry, nr_pages);
> > > - if (should_try_to_free_swap(si, folio, vma, vmf->flags))
> > > - folio_free_swap(folio);
> > > -
> > > add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
> > > add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
> > > pte = mk_pte(page, vma->vm_page_prot);
> > > @@ -4997,6 +4989,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > > arch_do_swap_page_nr(vma->vm_mm, vma, address,
> > > pte, pte, nr_pages);
> > >
> > > + /*
> > > + * Remove the swap entry and conditionally try to free up the
> > > + * swapcache. Do it after mapping so any raced page fault will
> > > + * see the folio in swap cache and wait for us.
> >
> > This seems like the right optimization—it reduces the race window where we might
> > allocate a folio, perform the read, and then attempt to map it, only
> > to find after
> > taking the PTL that the PTE has already changed.
> >
> > Although I am not entirely sure that “any raced page fault will see the folio in
> > swapcache,” it seems there could still be cases where a fault occurs after
> > folio_free_swap(), and thus can’t see the swapcache entry.
> >
> > T1:
> > swap in PF, allocate and add swapcache, map PTE, delete swapcache
> >
> > T2:
> > swap in PF before PTE is changed;
> > ...........................................................;
> > check swapcache after T1 deletes swapcache -> no swapcache found.
>
> Right, that's true. But we will at most only have one repeated fault,
> and the time window is much smaller. T2 will PTE != orig_pte and then
> return just fine.
>
> So this patch is only reducing the race time window for a potentially
> better performance, and this race is basically harmless anyway. I
> think it's good enough.
Right. What I really disagree with is "Do it after mapping so any
raced page fault
will see the folio in swap cache and wait for". It sounds like it
guarantees no race
at all, so I’d rather we change it to something like "reduced race window".
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (5 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 06/19] mm, swap: free the swap cache after folio is mapped Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
` (13 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Now the overhead of the swap cache is trivial to none, bypassing the
swap cache is no longer a valid optimization.
We have removed the cache bypass swapin for anon memory, now do the same
for shmem. Many helpers and functions can be dropped now.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/shmem.c | 65 +++++++++++++++++------------------------------------------
mm/swap.h | 4 ----
mm/swapfile.c | 35 +++++++++-----------------------
3 files changed, 27 insertions(+), 77 deletions(-)
diff --git a/mm/shmem.c b/mm/shmem.c
index 6580f3cd24bb..759981435953 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2012,10 +2012,9 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
swp_entry_t entry, int order, gfp_t gfp)
{
struct shmem_inode_info *info = SHMEM_I(inode);
+ struct folio *new, *swapcache;
int nr_pages = 1 << order;
- struct folio *new;
gfp_t alloc_gfp;
- void *shadow;
/*
* We have arrived here because our zones are constrained, so don't
@@ -2055,34 +2054,19 @@ static struct folio *shmem_swap_alloc_folio(struct inode *inode,
goto fallback;
}
- /*
- * Prevent parallel swapin from proceeding with the swap cache flag.
- *
- * Of course there is another possible concurrent scenario as well,
- * that is to say, the swap cache flag of a large folio has already
- * been set by swapcache_prepare(), while another thread may have
- * already split the large swap entry stored in the shmem mapping.
- * In this case, shmem_add_to_page_cache() will help identify the
- * concurrent swapin and return -EEXIST.
- */
- if (swapcache_prepare(entry, nr_pages)) {
+ swapcache = swapin_folio(entry, new);
+ if (swapcache != new) {
folio_put(new);
- new = ERR_PTR(-EEXIST);
- /* Try smaller folio to avoid cache conflict */
- goto fallback;
+ if (!swapcache) {
+ /*
+ * The new folio is charged already, swapin can
+ * only fail due to another raced swapin.
+ */
+ new = ERR_PTR(-EEXIST);
+ goto fallback;
+ }
}
-
- __folio_set_locked(new);
- __folio_set_swapbacked(new);
- new->swap = entry;
-
- memcg1_swapin(entry, nr_pages);
- shadow = swap_cache_get_shadow(entry);
- if (shadow)
- workingset_refault(new, shadow);
- folio_add_lru(new);
- swap_read_folio(new, NULL);
- return new;
+ return swapcache;
fallback:
/* Order 0 swapin failed, nothing to fallback to, abort */
if (!order)
@@ -2172,8 +2156,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
}
static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
- struct folio *folio, swp_entry_t swap,
- bool skip_swapcache)
+ struct folio *folio, swp_entry_t swap)
{
struct address_space *mapping = inode->i_mapping;
swp_entry_t swapin_error;
@@ -2189,8 +2172,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
nr_pages = folio_nr_pages(folio);
folio_wait_writeback(folio);
- if (!skip_swapcache)
- swap_cache_del_folio(folio);
+ swap_cache_del_folio(folio);
/*
* Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
* won't be 0 when inode is released and thus trigger WARN_ON(i_blocks)
@@ -2289,7 +2271,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
swp_entry_t swap, index_entry;
struct swap_info_struct *si;
struct folio *folio = NULL;
- bool skip_swapcache = false;
int error, nr_pages, order;
pgoff_t offset;
@@ -2332,7 +2313,6 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
folio = NULL;
goto failed;
}
- skip_swapcache = true;
} else {
/* Cached swapin only supports order 0 folio */
folio = shmem_swapin_cluster(swap, gfp, info, index);
@@ -2388,9 +2368,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
* and swap cache folios are never partially freed.
*/
folio_lock(folio);
- if ((!skip_swapcache && !folio_test_swapcache(folio)) ||
- shmem_confirm_swap(mapping, index, swap) < 0 ||
- folio->swap.val != swap.val) {
+ if (!folio_matches_swap_entry(folio, swap) ||
+ shmem_confirm_swap(mapping, index, swap) < 0) {
error = -EEXIST;
goto unlock;
}
@@ -2422,12 +2401,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (sgp == SGP_WRITE)
folio_mark_accessed(folio);
- if (skip_swapcache) {
- folio->swap.val = 0;
- swapcache_clear(si, swap, nr_pages);
- } else {
- swap_cache_del_folio(folio);
- }
+ swap_cache_del_folio(folio);
folio_mark_dirty(folio);
swap_free_nr(swap, nr_pages);
put_swap_device(si);
@@ -2438,14 +2412,11 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (shmem_confirm_swap(mapping, index, swap) < 0)
error = -EEXIST;
if (error == -EIO)
- shmem_set_folio_swapin_error(inode, index, folio, swap,
- skip_swapcache);
+ shmem_set_folio_swapin_error(inode, index, folio, swap);
unlock:
if (folio)
folio_unlock(folio);
failed_nolock:
- if (skip_swapcache)
- swapcache_clear(si, folio->swap, folio_nr_pages(folio));
if (folio)
folio_put(folio);
put_swap_device(si);
diff --git a/mm/swap.h b/mm/swap.h
index 214e7d041030..e0f05babe13a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -403,10 +403,6 @@ static inline int swap_writeout(struct folio *folio,
return 0;
}
-static inline void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
-}
-
static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
{
return NULL;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 849be32377d9..3898c3a2be62 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1613,22 +1613,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
return NULL;
}
-static void swap_entries_put_cache(struct swap_info_struct *si,
- swp_entry_t entry, int nr)
-{
- unsigned long offset = swp_offset(entry);
- struct swap_cluster_info *ci;
-
- ci = swap_cluster_lock(si, offset);
- if (swap_only_has_cache(si, offset, nr)) {
- swap_entries_free(si, ci, entry, nr);
- } else {
- for (int i = 0; i < nr; i++, entry.val++)
- swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
- }
- swap_cluster_unlock(ci);
-}
-
static bool swap_entries_put_map(struct swap_info_struct *si,
swp_entry_t entry, int nr)
{
@@ -1764,13 +1748,21 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
void put_swap_folio(struct folio *folio, swp_entry_t entry)
{
struct swap_info_struct *si;
+ struct swap_cluster_info *ci;
+ unsigned long offset = swp_offset(entry);
int size = 1 << swap_entry_order(folio_order(folio));
si = _swap_info_get(entry);
if (!si)
return;
- swap_entries_put_cache(si, entry, size);
+ ci = swap_cluster_lock(si, offset);
+ if (swap_only_has_cache(si, offset, size))
+ swap_entries_free(si, ci, entry, size);
+ else
+ for (int i = 0; i < size; i++, entry.val++)
+ swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+ swap_cluster_unlock(ci);
}
int __swap_count(swp_entry_t entry)
@@ -3778,15 +3770,6 @@ int swapcache_prepare(swp_entry_t entry, int nr)
return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
}
-/*
- * Caller should ensure entries belong to the same folio so
- * the entries won't span cross cluster boundary.
- */
-void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
-{
- swap_entries_put_cache(si, entry, nr);
-}
-
/*
* add_swap_count_continuation - called when a swap count is duplicated
* beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (6 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 07/19] mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
` (12 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Nhat Pham <nphamcs@gmail.com>
The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a
("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
belongs to shmem during swapoff.
However, swapoff has since been rewritten in the commit b56a2d8af914
("mm: rid swapoff of quadratic complexity"). Now having swap count ==
SWAP_MAP_SHMEM value is basically the same as having swap count == 1,
and swap_shmem_alloc() behaves analogously to swap_duplicate(). The only
difference of note is that swap_shmem_alloc() does not check for
-ENOMEM returned from __swap_duplicate(), but it is OK because shmem
never re-duplicates any swap entry it owns. This will stil be safe if we
use (batched) swap_duplicate() instead.
This commit adds swap_duplicate_nr(), the batched variant of
swap_duplicate(), and removes the SWAP_MAP_SHMEM state and the
associated swap_shmem_alloc() helper to simplify the state machine (both
mentally and in terms of actual code). We will also have an extra
state/special value that can be repurposed (for swap entries that never
gets re-duplicated).
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 15 +++++++--------
mm/shmem.c | 2 +-
mm/swapfile.c | 42 +++++++++++++++++-------------------------
3 files changed, 25 insertions(+), 34 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38ca3df68716..bf72b548a96d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -230,7 +230,6 @@ enum {
/* Special value in first swap_map */
#define SWAP_MAP_MAX 0x3e /* Max count */
#define SWAP_MAP_BAD 0x3f /* Note page is bad */
-#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */
/* Special value in each swap_map continuation */
#define SWAP_CONT_MAX 0x7f /* Max count */
@@ -458,8 +457,7 @@ bool folio_free_swap(struct folio *folio);
void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
+extern int swap_duplicate_nr(swp_entry_t entry, int nr);
extern int swapcache_prepare(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
@@ -514,11 +512,7 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
return 0;
}
-static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
-{
-}
-
-static inline int swap_duplicate(swp_entry_t swp)
+static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
{
return 0;
}
@@ -569,6 +563,11 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
}
#endif /* CONFIG_SWAP */
+static inline int swap_duplicate(swp_entry_t entry)
+{
+ return swap_duplicate_nr(entry, 1);
+}
+
static inline void free_swap_and_cache(swp_entry_t entry)
{
free_swap_and_cache_nr(entry, 1);
diff --git a/mm/shmem.c b/mm/shmem.c
index 759981435953..46d54a1288fd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1665,7 +1665,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
spin_unlock(&shmem_swaplist_lock);
}
- swap_shmem_alloc(folio->swap, nr_pages);
+ swap_duplicate_nr(folio->swap, nr_pages);
shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
BUG_ON(folio_mapped(folio));
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3898c3a2be62..55362bb2a781 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -201,7 +201,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
unsigned char *map_end = map + nr_pages;
unsigned char count = *map;
- if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
+ if (swap_count(count) != 1)
return false;
while (++map < map_end) {
@@ -1522,12 +1522,6 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
if (usage == SWAP_HAS_CACHE) {
VM_BUG_ON(!has_cache);
has_cache = 0;
- } else if (count == SWAP_MAP_SHMEM) {
- /*
- * Or we could insist on shmem.c using a special
- * swap_shmem_free() and free_shmem_swap_and_cache()...
- */
- count = 0;
} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
if (count == COUNT_CONTINUED) {
if (swap_count_continued(si, offset, count))
@@ -1625,7 +1619,7 @@ static bool swap_entries_put_map(struct swap_info_struct *si,
if (nr <= 1)
goto fallback;
count = swap_count(data_race(si->swap_map[offset]));
- if (count != 1 && count != SWAP_MAP_SHMEM)
+ if (count != 1)
goto fallback;
ci = swap_cluster_lock(si, offset);
@@ -1679,12 +1673,10 @@ static bool swap_entries_put_map_nr(struct swap_info_struct *si,
/*
* Check if it's the last ref of swap entry in the freeing path.
- * Qualified value includes 1, SWAP_HAS_CACHE or SWAP_MAP_SHMEM.
*/
static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
{
- return (count == SWAP_HAS_CACHE) || (count == 1) ||
- (count == SWAP_MAP_SHMEM);
+ return (count == SWAP_HAS_CACHE) || (count == 1);
}
/*
@@ -3672,7 +3664,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
offset = swp_offset(entry);
VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
- VM_WARN_ON(usage == 1 && nr > 1);
ci = swap_cluster_lock(si, offset);
err = 0;
@@ -3732,27 +3723,28 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
return err;
}
-/*
- * Help swapoff by noting that swap entry belongs to shmem/tmpfs
- * (in which case its reference count is never incremented).
- */
-void swap_shmem_alloc(swp_entry_t entry, int nr)
-{
- __swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
-}
-
-/*
- * Increase reference count of swap entry by 1.
+/**
+ * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
+ * by 1.
+ *
+ * @entry: first swap entry from which we want to increase the refcount.
+ * @nr: Number of entries in range.
+ *
* Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
* but could not be atomically allocated. Returns 0, just as if it succeeded,
* if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
* might occur if a page table entry has got corrupted.
+ *
+ * Note that we are currently not handling the case where nr > 1 and we need to
+ * add swap count continuation. This is OK, because no such user exists - shmem
+ * is the only user that can pass nr > 1, and it never re-duplicates any swap
+ * entry it owns.
*/
-int swap_duplicate(swp_entry_t entry)
+int swap_duplicate_nr(swp_entry_t entry, int nr)
{
int err = 0;
- while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
+ while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
err = add_swap_count_continuation(entry, GFP_ATOMIC);
return err;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (7 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 08/19] mm/shmem, swap: remove SWAP_MAP_SHMEM Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
` (11 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
When checking if a swap entry is swapped out, we simply check if the
bitwise result of the count value is larger than 0. But SWAP_MAP_BAD
will also be considered as a swao count value larger than 0.
SWAP_MAP_BAD being considered as a count value larger than 0 is useful
for the swap allocator: they will be seen as a used slot, so the
allocator will skip them. But for the swapped out check, this
isn't correct.
There is currently no observable issue. The swapped out check is only
useful for readahead and folio swapped-out status check. For readahead,
the swap cache layer will abort upon checking and updating the swap map.
For the folio swapped out status check, the swap allocator will never
allocate an entry of bad slots to folio, so that part is fine too. The
worst that could happen now is redundant allocation/freeing of folios
and waste CPU time.
This also makes it easier to get rid of swap map checking and update
during folio insertion in the swap cache layer.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 6 ++++--
mm/swap_state.c | 4 ++--
mm/swapfile.c | 22 +++++++++++-----------
3 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bf72b548a96d..936fa8f9e5f3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -466,7 +466,8 @@ int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
extern sector_t swapdev_block(int, pgoff_t);
extern int __swap_count(swp_entry_t entry);
-extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
+extern bool swap_entry_swapped(struct swap_info_struct *si,
+ unsigned long offset);
extern int swp_swapcount(swp_entry_t entry);
struct backing_dev_info;
extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
@@ -535,7 +536,8 @@ static inline int __swap_count(swp_entry_t entry)
return 0;
}
-static inline bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
+static inline bool swap_entry_swapped(struct swap_info_struct *si,
+ unsigned long offset)
{
return false;
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index b3737c60aad9..aaf8d202434d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -526,8 +526,8 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
if (folio)
return folio;
- /* Skip allocation for unused swap slot for readahead path. */
- if (!swap_entry_swapped(si, entry))
+ /* Skip allocation for unused and bad swap slot for readahead. */
+ if (!swap_entry_swapped(si, swp_offset(entry)))
return NULL;
/* Allocate a new folio to be added into the swap cache. */
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 55362bb2a781..d66141f1c452 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1765,21 +1765,21 @@ int __swap_count(swp_entry_t entry)
return swap_count(si->swap_map[offset]);
}
-/*
- * How many references to @entry are currently swapped out?
- * This does not give an exact answer when swap count is continued,
- * but does include the high COUNT_CONTINUED flag to allow for that.
+/**
+ * swap_entry_swapped - Check if the swap entry at @offset is swapped.
+ * @si: the swap device.
+ * @offset: offset of the swap entry.
*/
-bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry)
+bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset)
{
- pgoff_t offset = swp_offset(entry);
struct swap_cluster_info *ci;
int count;
ci = swap_cluster_lock(si, offset);
count = swap_count(si->swap_map[offset]);
swap_cluster_unlock(ci);
- return !!count;
+
+ return count && count != SWAP_MAP_BAD;
}
/*
@@ -1865,7 +1865,7 @@ static bool folio_swapped(struct folio *folio)
return false;
if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
- return swap_entry_swapped(si, entry);
+ return swap_entry_swapped(si, swp_offset(entry));
return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
}
@@ -3671,10 +3671,10 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
count = si->swap_map[offset + i];
/*
- * swapin_readahead() doesn't check if a swap entry is valid, so the
- * swap entry could be SWAP_MAP_BAD. Check here with lock held.
+ * Allocator never allocates bad slots, and readahead is guarded
+ * by swap_entry_swapped.
*/
- if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
+ if (WARN_ON(swap_count(count) == SWAP_MAP_BAD)) {
err = -ENOENT;
goto unlock_out;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (8 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 09/19] mm, swap: swap entry of a bad slot should not be considered as swapped out Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-31 5:25 ` YoungJun Park
2025-10-29 15:58 ` [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
` (10 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Swap cluster cache reclaim requires releasing the lock, so some extra
checks are needed after the reclaim. To prepare for checking swap cache
using the swap table directly, consolidate the swap cluster reclaim and
check the logic.
Also, adjust it very slightly. By moving the cluster empty and usable
check into the reclaim helper, it will avoid a redundant scan of the
slots if the cluster is empty.
And always scan the whole region during reclaim, don't skip slots
covered by a reclaimed folio. Because the reclaim is lockless, it's
possible that new cache lands at any time. And for allocation, we want
all caches to be reclaimed to avoid fragmentation. And besides, if the
scan offset is not aligned with the size of the reclaimed folio, we are
skipping some existing caches.
There should be no observable behavior change, which might slightly
improve the fragmentation issue or performance.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
1 file changed, 23 insertions(+), 24 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index d66141f1c452..e4c521528817 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
return 0;
}
-static bool cluster_reclaim_range(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- unsigned long start, unsigned long end)
+static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long start, unsigned int order)
{
+ unsigned int nr_pages = 1 << order;
+ unsigned long offset = start, end = start + nr_pages;
unsigned char *map = si->swap_map;
- unsigned long offset = start;
int nr_reclaim;
spin_unlock(&ci->lock);
do {
switch (READ_ONCE(map[offset])) {
case 0:
- offset++;
break;
case SWAP_HAS_CACHE:
nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
- if (nr_reclaim > 0)
- offset += nr_reclaim;
- else
+ if (nr_reclaim < 0)
goto out;
break;
default:
goto out;
}
- } while (offset < end);
+ } while (++offset < end);
out:
spin_lock(&ci->lock);
+
+ /*
+ * We just dropped ci->lock so cluster could be used by another
+ * order or got freed, check if it's still usable or empty.
+ */
+ if (!cluster_is_usable(ci, order))
+ return SWAP_ENTRY_INVALID;
+ if (cluster_is_empty(ci))
+ return cluster_offset(si, ci);
+
/*
* Recheck the range no matter reclaim succeeded or not, the slot
* could have been be freed while we are not holding the lock.
*/
for (offset = start; offset < end; offset++)
if (READ_ONCE(map[offset]))
- return false;
+ return SWAP_ENTRY_INVALID;
- return true;
+ return start;
}
static bool cluster_scan_range(struct swap_info_struct *si,
@@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
unsigned int nr_pages = 1 << order;
- bool need_reclaim, ret;
+ bool need_reclaim;
lockdep_assert_held(&ci->lock);
@@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
continue;
if (need_reclaim) {
- ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
- /*
- * Reclaim drops ci->lock and cluster could be used
- * by another order. Not checking flag as off-list
- * cluster has no flag set, and change of list
- * won't cause fragmentation.
- */
- if (!cluster_is_usable(ci, order))
- goto out;
- if (cluster_is_empty(ci))
- offset = start;
+ found = cluster_reclaim_range(si, ci, offset, order);
/* Reclaim failed but cluster is usable, try next */
- if (!ret)
+ if (!found)
continue;
+ offset = found;
}
if (!cluster_alloc_range(si, ci, offset, usage, order))
break;
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic
2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
@ 2025-10-31 5:25 ` YoungJun Park
2025-10-31 7:11 ` Kairui Song
0 siblings, 1 reply; 50+ messages in thread
From: YoungJun Park @ 2025-10-31 5:25 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:58:36PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
>
Hello Kairu, great work on your patchwork. :)
> Swap cluster cache reclaim requires releasing the lock, so some extra
> checks are needed after the reclaim. To prepare for checking swap cache
> using the swap table directly, consolidate the swap cluster reclaim and
> check the logic.
>
> Also, adjust it very slightly. By moving the cluster empty and usable
> check into the reclaim helper, it will avoid a redundant scan of the
> slots if the cluster is empty.
This is Change 1
> And always scan the whole region during reclaim, don't skip slots
> covered by a reclaimed folio. Because the reclaim is lockless, it's
> possible that new cache lands at any time. And for allocation, we want
> all caches to be reclaimed to avoid fragmentation. And besides, if the
> scan offset is not aligned with the size of the reclaimed folio, we are
> skipping some existing caches.
This is Change 2
> There should be no observable behavior change, which might slightly
> improve the fragmentation issue or performance.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
> 1 file changed, 23 insertions(+), 24 deletions(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index d66141f1c452..e4c521528817 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> return 0;
> }
>
> -static bool cluster_reclaim_range(struct swap_info_struct *si,
> - struct swap_cluster_info *ci,
> - unsigned long start, unsigned long end)
> +static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
> + struct swap_cluster_info *ci,
> + unsigned long start, unsigned int order)
> {
> + unsigned int nr_pages = 1 << order;
> + unsigned long offset = start, end = start + nr_pages;
> unsigned char *map = si->swap_map;
> - unsigned long offset = start;
> int nr_reclaim;
>
> spin_unlock(&ci->lock);
> do {
> switch (READ_ONCE(map[offset])) {
> case 0:
> - offset++;
> break;
> case SWAP_HAS_CACHE:
> nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> - if (nr_reclaim > 0)
> - offset += nr_reclaim;
> - else
> + if (nr_reclaim < 0)
> goto out;
> break;
> default:
> goto out;
> }
> - } while (offset < end);
> + } while (++offset < end);
Change 2
> out:
> spin_lock(&ci->lock);
> +
> + /*
> + * We just dropped ci->lock so cluster could be used by another
> + * order or got freed, check if it's still usable or empty.
> + */
> + if (!cluster_is_usable(ci, order))
> + return SWAP_ENTRY_INVALID;
> + if (cluster_is_empty(ci))
> + return cluster_offset(si, ci);
> +
Change 1
> /*
> * Recheck the range no matter reclaim succeeded or not, the slot
> * could have been be freed while we are not holding the lock.
> */
> for (offset = start; offset < end; offset++)
> if (READ_ONCE(map[offset]))
> - return false;
> + return SWAP_ENTRY_INVALID;
>
> - return true;
> + return start;
> }
>
> static bool cluster_scan_range(struct swap_info_struct *si,
> @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> unsigned int nr_pages = 1 << order;
> - bool need_reclaim, ret;
> + bool need_reclaim;
>
> lockdep_assert_held(&ci->lock);
>
> @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
> continue;
> if (need_reclaim) {
> - ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> - /*
> - * Reclaim drops ci->lock and cluster could be used
> - * by another order. Not checking flag as off-list
> - * cluster has no flag set, and change of list
> - * won't cause fragmentation.
> - */
> - if (!cluster_is_usable(ci, order))
> - goto out;
> - if (cluster_is_empty(ci))
> - offset = start;
> + found = cluster_reclaim_range(si, ci, offset, order);
> /* Reclaim failed but cluster is usable, try next */
> - if (!ret)
Part of Change 1 (apply return value change)
As I understand Change 1 just remove redudant checking.
But, I think another part changed also.
(maybe I don't fully understand comment or something)
cluster_reclaim_range can return SWAP_ENTRY_INVALID
if the cluster becomes unusable for the requested order.
(!cluster_is_usable return SWAP_ENTRY_INVALID)
And it continues loop to the next offset for reclaim try.
Is this the intended behavior?
If this is the intended behavior, the comment:
/* Reclaim failed but cluster is usable, try next */
might be a bit misleading, as the cluster could be unusable in this
failure case. Perhaps it could be updated to reflect this?
Or I think any other thing need to be changed..?
(cluster_is_usable function name change etc)
Thanks.
Youngjun Park
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic
2025-10-31 5:25 ` YoungJun Park
@ 2025-10-31 7:11 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-31 7:11 UTC (permalink / raw)
To: YoungJun Park
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Fri, Oct 31, 2025 at 1:25 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:36PM +0800, Kairui Song wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
>
> Hello Kairu, great work on your patchwork. :)
> > Swap cluster cache reclaim requires releasing the lock, so some extra
> > checks are needed after the reclaim. To prepare for checking swap cache
> > using the swap table directly, consolidate the swap cluster reclaim and
> > check the logic.
> >
> > Also, adjust it very slightly. By moving the cluster empty and usable
> > check into the reclaim helper, it will avoid a redundant scan of the
> > slots if the cluster is empty.
>
> This is Change 1
>
> > And always scan the whole region during reclaim, don't skip slots
> > covered by a reclaimed folio. Because the reclaim is lockless, it's
> > possible that new cache lands at any time. And for allocation, we want
> > all caches to be reclaimed to avoid fragmentation. And besides, if the
> > scan offset is not aligned with the size of the reclaimed folio, we are
> > skipping some existing caches.
>
> This is Change 2
>
> > There should be no observable behavior change, which might slightly
> > improve the fragmentation issue or performance.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/swapfile.c | 47 +++++++++++++++++++++++------------------------
> > 1 file changed, 23 insertions(+), 24 deletions(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index d66141f1c452..e4c521528817 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -778,42 +778,50 @@ static int swap_cluster_setup_bad_slot(struct swap_cluster_info *cluster_info,
> > return 0;
> > }
> >
> > -static bool cluster_reclaim_range(struct swap_info_struct *si,
> > - struct swap_cluster_info *ci,
> > - unsigned long start, unsigned long end)
> > +static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
> > + struct swap_cluster_info *ci,
> > + unsigned long start, unsigned int order)
> > {
> > + unsigned int nr_pages = 1 << order;
> > + unsigned long offset = start, end = start + nr_pages;
> > unsigned char *map = si->swap_map;
> > - unsigned long offset = start;
> > int nr_reclaim;
> >
> > spin_unlock(&ci->lock);
> > do {
> > switch (READ_ONCE(map[offset])) {
> > case 0:
> > - offset++;
> > break;
> > case SWAP_HAS_CACHE:
> > nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> > - if (nr_reclaim > 0)
> > - offset += nr_reclaim;
> > - else
> > + if (nr_reclaim < 0)
> > goto out;
> > break;
> > default:
> > goto out;
> > }
> > - } while (offset < end);
> > + } while (++offset < end);
>
> Change 2
>
> > out:
> > spin_lock(&ci->lock);
> > +
> > + /*
> > + * We just dropped ci->lock so cluster could be used by another
> > + * order or got freed, check if it's still usable or empty.
> > + */
> > + if (!cluster_is_usable(ci, order))
> > + return SWAP_ENTRY_INVALID;
> > + if (cluster_is_empty(ci))
> > + return cluster_offset(si, ci);
> > +
>
> Change 1
>
> > /*
> > * Recheck the range no matter reclaim succeeded or not, the slot
> > * could have been be freed while we are not holding the lock.
> > */
> > for (offset = start; offset < end; offset++)
> > if (READ_ONCE(map[offset]))
> > - return false;
> > + return SWAP_ENTRY_INVALID;
> >
> > - return true;
> > + return start;
> > }
> >
> > static bool cluster_scan_range(struct swap_info_struct *si,
> > @@ -901,7 +909,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> > unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> > unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> > unsigned int nr_pages = 1 << order;
> > - bool need_reclaim, ret;
> > + bool need_reclaim;
> >
> > lockdep_assert_held(&ci->lock);
> >
> > @@ -913,20 +921,11 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> > if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
> > continue;
> > if (need_reclaim) {
> > - ret = cluster_reclaim_range(si, ci, offset, offset + nr_pages);
> > - /*
> > - * Reclaim drops ci->lock and cluster could be used
> > - * by another order. Not checking flag as off-list
> > - * cluster has no flag set, and change of list
> > - * won't cause fragmentation.
> > - */
> > - if (!cluster_is_usable(ci, order))
> > - goto out;
> > - if (cluster_is_empty(ci))
> > - offset = start;
> > + found = cluster_reclaim_range(si, ci, offset, order);
> > /* Reclaim failed but cluster is usable, try next */
> > - if (!ret)
>
> Part of Change 1 (apply return value change)
>
> As I understand Change 1 just remove redudant checking.
> But, I think another part changed also.
> (maybe I don't fully understand comment or something)
>
> cluster_reclaim_range can return SWAP_ENTRY_INVALID
> if the cluster becomes unusable for the requested order.
> (!cluster_is_usable return SWAP_ENTRY_INVALID)
> And it continues loop to the next offset for reclaim try.
> Is this the intended behavior?
Thanks for the very careful review! I should keep the
cluster_is_usable check or abort in other ways to avoid touching an
unusable cluster, will fix it.
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (9 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 10/19] mm, swap: consolidate cluster reclaim and check logic Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
` (9 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
No feature change, split the common logic into a stand alone helper to
be reused later.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 62 +++++++++++++++++++++++++++++------------------------------
1 file changed, 31 insertions(+), 31 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index e4c521528817..56054af12afd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3646,26 +3646,14 @@ void si_swapinfo(struct sysinfo *val)
* - swap-cache reference is requested but the entry is not used. -> ENOENT
* - swap-mapped reference requested but needs continued swap count. -> ENOMEM
*/
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+static int swap_dup_entries(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long offset,
+ unsigned char usage, int nr)
{
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- unsigned long offset;
- unsigned char count;
- unsigned char has_cache;
- int err, i;
-
- si = swap_entry_to_info(entry);
- if (WARN_ON_ONCE(!si)) {
- pr_err("%s%08lx\n", Bad_file, entry.val);
- return -EINVAL;
- }
-
- offset = swp_offset(entry);
- VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
- ci = swap_cluster_lock(si, offset);
+ int i;
+ unsigned char count, has_cache;
- err = 0;
for (i = 0; i < nr; i++) {
count = si->swap_map[offset + i];
@@ -3673,25 +3661,20 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
* Allocator never allocates bad slots, and readahead is guarded
* by swap_entry_swapped.
*/
- if (WARN_ON(swap_count(count) == SWAP_MAP_BAD)) {
- err = -ENOENT;
- goto unlock_out;
- }
+ if (WARN_ON(swap_count(count) == SWAP_MAP_BAD))
+ return -ENOENT;
has_cache = count & SWAP_HAS_CACHE;
count &= ~SWAP_HAS_CACHE;
if (!count && !has_cache) {
- err = -ENOENT;
+ return -ENOENT;
} else if (usage == SWAP_HAS_CACHE) {
if (has_cache)
- err = -EEXIST;
+ return -EEXIST;
} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
- err = -EINVAL;
+ return -EINVAL;
}
-
- if (err)
- goto unlock_out;
}
for (i = 0; i < nr; i++) {
@@ -3710,14 +3693,31 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
* Don't need to rollback changes, because if
* usage == 1, there must be nr == 1.
*/
- err = -ENOMEM;
- goto unlock_out;
+ return -ENOMEM;
}
WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
}
-unlock_out:
+ return 0;
+}
+
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+{
+ int err;
+ struct swap_info_struct *si;
+ struct swap_cluster_info *ci;
+ unsigned long offset = swp_offset(entry);
+
+ si = swap_entry_to_info(entry);
+ if (WARN_ON_ONCE(!si)) {
+ pr_err("%s%08lx\n", Bad_file, entry.val);
+ return -EINVAL;
+ }
+
+ VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
+ ci = swap_cluster_lock(si, offset);
+ err = swap_dup_entries(si, ci, offset, usage, nr);
swap_cluster_unlock(ci);
return err;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (10 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 11/19] mm, swap: split locked entry duplicating into a standalone helper Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 19:25 ` kernel test robot
2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
` (8 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Current swap in synchronization mostly uses the swap_map's
SWAP_HAS_CACHE bit. Whoever sets the bit first does the actual
work to swap in a folio.
This has been causing many issues as it's just a poor implementation
of a bit lock. Raced users have no idea what is pinning a slot, so
it has to loop with a schedule_timeout_uninterruptible(1), which is
ugly and causes long-tailing or other performance issues. Besides,
the abuse of SWAP_HAS_CACHE has been causing many other troubles for
synchronization or maintenance.
This is the first step to remove this bit completely. This will also save
one bit for the 8-bit swap counting field.
We have just removed all swap in paths that bypass the swap cache, and
now both the swap cache and swap map are protected by the cluster lock.
So now we can just resolve the swap synchronization with the swap cache
layer directly using the cluster lock. Whoever inserts a folio in the
swap cache first does the swap in work. And because folios are locked
during swap operations, other raced users will just wait on the folio
lock.
The SWAP_HAS_CACHE will be removed in later commit. For now, we still set
it for some remaining users. But now we do the bit setting and swap cache
folio adding in the same critical section, after swap cache is ready.
No one will have to spin on the SWAP_HAS_CACHE bit anymore.
This both simplifies the logic and should improve the performance,
eliminating issues like the one solved in commit 01626a1823024
("mm: avoid unconditional one-tick sleep when swapcache_prepare fails"),
or the "skip_if_exists" from commit a65b0e7607ccb
("zswap: make shrinking memcg-aware"), which will be removed very soon.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 6 ---
mm/swap.h | 14 ++++++-
mm/swap_state.c | 103 +++++++++++++++++++++++++++++----------------------
mm/swapfile.c | 39 ++++++++++++-------
mm/vmscan.c | 1 -
5 files changed, 95 insertions(+), 68 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 936fa8f9e5f3..69025b473472 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -458,7 +458,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
@@ -518,11 +517,6 @@ static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
return 0;
}
-static inline int swapcache_prepare(swp_entry_t swp, int nr)
-{
- return 0;
-}
-
static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
{
}
diff --git a/mm/swap.h b/mm/swap.h
index e0f05babe13a..3cd99850bbaf 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -234,6 +234,14 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
return folio_entry.val == round_down(entry.val, nr_pages);
}
+/* Temporary internal helpers */
+void __swapcache_set_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry);
+void __swapcache_clear_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry, unsigned int nr);
+
/*
* All swap cache helpers below require the caller to ensure the swap entries
* used are valid and stablize the device by any of the following ways:
@@ -247,7 +255,8 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
*/
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow);
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadow, bool alloc);
void swap_cache_del_folio(struct folio *folio);
struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
struct mempolicy *mpol, pgoff_t ilx,
@@ -413,7 +422,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
return NULL;
}
-static inline void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadow)
+static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadow, bool alloc)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index aaf8d202434d..2d53e3b5e8e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -128,34 +128,66 @@ void *swap_cache_get_shadow(swp_entry_t entry)
* @entry: The swap entry corresponding to the folio.
* @gfp: gfp_mask for XArray node allocation.
* @shadowp: If a shadow is found, return the shadow.
+ * @alloc: If it's the allocator that is trying to insert a folio. Allocator
+ * sets SWAP_HAS_CACHE to pin slots before insert so skip map update.
*
* Context: Caller must ensure @entry is valid and protect the swap device
* with reference count or locks.
* The caller also needs to update the corresponding swap_map slots with
* SWAP_HAS_CACHE bit to avoid race or conflict.
*/
-void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp)
+int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadowp, bool alloc)
{
+ int err;
void *shadow = NULL;
+ struct swap_info_struct *si;
unsigned long old_tb, new_tb;
struct swap_cluster_info *ci;
- unsigned int ci_start, ci_off, ci_end;
+ unsigned int ci_start, ci_off, ci_end, offset;
unsigned long nr_pages = folio_nr_pages(folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+ si = __swap_entry_to_info(entry);
new_tb = folio_to_swp_tb(folio);
ci_start = swp_cluster_offset(entry);
ci_end = ci_start + nr_pages;
ci_off = ci_start;
- ci = swap_cluster_lock(__swap_entry_to_info(entry), swp_offset(entry));
+ offset = swp_offset(entry);
+ ci = swap_cluster_lock(si, swp_offset(entry));
+ if (unlikely(!ci->table)) {
+ err = -ENOENT;
+ goto failed;
+ }
do {
- old_tb = __swap_table_xchg(ci, ci_off, new_tb);
- WARN_ON_ONCE(swp_tb_is_folio(old_tb));
+ old_tb = __swap_table_get(ci, ci_off);
+ if (unlikely(swp_tb_is_folio(old_tb))) {
+ err = -EEXIST;
+ goto failed;
+ }
+ if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
+ err = -ENOENT;
+ goto failed;
+ }
if (swp_tb_is_shadow(old_tb))
shadow = swp_tb_to_shadow(old_tb);
+ offset++;
+ } while (++ci_off < ci_end);
+
+ ci_off = ci_start;
+ offset = swp_offset(entry);
+ do {
+ /*
+ * Still need to pin the slots with SWAP_HAS_CACHE since
+ * swap allocator depends on that.
+ */
+ if (!alloc)
+ __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
+ __swap_table_set(ci, ci_off, new_tb);
+ offset++;
} while (++ci_off < ci_end);
folio_ref_add(folio, nr_pages);
@@ -168,6 +200,11 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
if (shadowp)
*shadowp = shadow;
+ return 0;
+
+failed:
+ swap_cluster_unlock(ci);
+ return err;
}
/**
@@ -186,6 +223,7 @@ void swap_cache_add_folio(struct folio *folio, swp_entry_t entry, void **shadowp
void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
swp_entry_t entry, void *shadow)
{
+ struct swap_info_struct *si;
unsigned long old_tb, new_tb;
unsigned int ci_start, ci_off, ci_end;
unsigned long nr_pages = folio_nr_pages(folio);
@@ -195,6 +233,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
VM_WARN_ON_ONCE_FOLIO(folio_test_writeback(folio), folio);
+ si = __swap_entry_to_info(entry);
new_tb = shadow_swp_to_tb(shadow);
ci_start = swp_cluster_offset(entry);
ci_end = ci_start + nr_pages;
@@ -210,6 +249,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
folio_clear_swapcache(folio);
node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
+ __swapcache_clear_cached(si, ci, entry, nr_pages);
}
/**
@@ -231,7 +271,6 @@ void swap_cache_del_folio(struct folio *folio)
__swap_cache_del_folio(ci, folio, entry, NULL);
swap_cluster_unlock(ci);
- put_swap_folio(folio, entry);
folio_ref_sub(folio, folio_nr_pages(folio));
}
@@ -423,67 +462,37 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
gfp_t gfp, bool charged,
bool skip_if_exists)
{
- struct folio *swapcache;
+ struct folio *swapcache = NULL;
void *shadow;
int ret;
- /*
- * Check and pin the swap map with SWAP_HAS_CACHE, then add the folio
- * into the swap cache. Loop with a schedule delay if raced with
- * another process setting SWAP_HAS_CACHE. This hackish loop will
- * be fixed very soon.
- */
+ __folio_set_locked(folio);
+ __folio_set_swapbacked(folio);
for (;;) {
- ret = swapcache_prepare(entry, folio_nr_pages(folio));
+ ret = swap_cache_add_folio(folio, entry, &shadow, false);
if (!ret)
break;
/*
- * The skip_if_exists is for protecting against a recursive
- * call to this helper on the same entry waiting forever
- * here because SWAP_HAS_CACHE is set but the folio is not
- * in the swap cache yet. This can happen today if
- * mem_cgroup_swapin_charge_folio() below triggers reclaim
- * through zswap, which may call this helper again in the
- * writeback path.
- *
- * Large order allocation also needs special handling on
+ * Large order allocation needs special handling on
* race: if a smaller folio exists in cache, swapin needs
* to fallback to order 0, and doing a swap cache lookup
* might return a folio that is irrelevant to the faulting
* entry because @entry is aligned down. Just return NULL.
*/
if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
- return NULL;
+ goto failed;
- /*
- * Check the swap cache again, we can only arrive
- * here because swapcache_prepare returns -EEXIST.
- */
swapcache = swap_cache_get_folio(entry);
if (swapcache)
- return swapcache;
-
- /*
- * We might race against __swap_cache_del_folio(), and
- * stumble across a swap_map entry whose SWAP_HAS_CACHE
- * has not yet been cleared. Or race against another
- * swap_cache_alloc_folio(), which has set SWAP_HAS_CACHE
- * in swap_map, but not yet added its folio to swap cache.
- */
- schedule_timeout_uninterruptible(1);
+ goto failed;
}
- __folio_set_locked(folio);
- __folio_set_swapbacked(folio);
-
if (!charged && mem_cgroup_swapin_charge_folio(folio, NULL, gfp, entry)) {
- put_swap_folio(folio, entry);
- folio_unlock(folio);
- return NULL;
+ swap_cache_del_folio(folio);
+ goto failed;
}
- swap_cache_add_folio(folio, entry, &shadow);
memcg1_swapin(entry, folio_nr_pages(folio));
if (shadow)
workingset_refault(folio, shadow);
@@ -491,6 +500,10 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
/* Caller will initiate read into locked folio */
folio_add_lru(folio);
return folio;
+
+failed:
+ folio_unlock(folio);
+ return swapcache;
}
/**
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 56054af12afd..415db36d85d3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1461,7 +1461,11 @@ int folio_alloc_swap(struct folio *folio)
if (!entry.val)
return -ENOMEM;
- swap_cache_add_folio(folio, entry, NULL);
+ /*
+ * Allocator has pinned the slots with SWAP_HAS_CACHE
+ * so it should never fail
+ */
+ WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
return 0;
@@ -1567,9 +1571,8 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
* do_swap_page()
* ... swapoff+swapon
* swap_cache_alloc_folio()
- * swapcache_prepare()
- * __swap_duplicate()
- * // check swap_map
+ * swap_cache_add_folio()
+ * // check swap_map
* // verify PTE not changed
*
* In __swap_duplicate(), the swap_map need to be checked before
@@ -3748,17 +3751,25 @@ int swap_duplicate_nr(swp_entry_t entry, int nr)
return err;
}
-/*
- * @entry: first swap entry from which we allocate nr swap cache.
- *
- * Called when allocating swap cache for existing swap entries,
- * This can return error codes. Returns 0 at success.
- * -EEXIST means there is a swap cache.
- * Note: return code is different from swap_duplicate().
- */
-int swapcache_prepare(swp_entry_t entry, int nr)
+/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */
+void __swapcache_set_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry)
+{
+ WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1));
+}
+
+/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock */
+void __swapcache_clear_cached(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ swp_entry_t entry, unsigned int nr)
{
- return __swap_duplicate(entry, SWAP_HAS_CACHE, nr);
+ if (swap_only_has_cache(si, swp_offset(entry), nr)) {
+ swap_entries_free(si, ci, entry, nr);
+ } else {
+ for (int i = 0; i < nr; i++, entry.val++)
+ swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+ }
}
/*
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e74a2807930..76b9c21a7fe2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -762,7 +762,6 @@ static int __remove_mapping(struct address_space *mapping, struct folio *folio,
__swap_cache_del_folio(ci, folio, swap, shadow);
memcg1_swapout(folio, swap);
swap_cluster_unlock_irq(ci);
- put_swap_folio(folio, swap);
} else {
void (*free_folio)(struct folio *);
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer
2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
@ 2025-10-29 19:25 ` kernel test robot
0 siblings, 0 replies; 50+ messages in thread
From: kernel test robot @ 2025-10-29 19:25 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
Baoquan He, Barry Song, Chris Li, Nhat Pham, Johannes Weiner,
Yosry Ahmed, David Hildenbrand, Youngjun Park, Hugh Dickins,
Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
Matthew Wilcox (Oracle), linux-kernel, Kairui Song
Hi Kairui,
kernel test robot noticed the following build warnings:
[auto build test WARNING on f30d294530d939fa4b77d61bc60f25c4284841fa]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
base: f30d294530d939fa4b77d61bc60f25c4284841fa
patch link: https://lore.kernel.org/r/20251029-swap-table-p2-v1-12-3d43f3b6ec32%40tencent.com
patch subject: [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer
config: arm-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300338.GvcdaiCz-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project d1c086e82af239b245fe8d7832f2753436634990)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300338.GvcdaiCz-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510300338.GvcdaiCz-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from mm/filemap.c:66:
>> mm/swap.h:428:1: warning: non-void function does not return a value [-Wreturn-type]
428 | }
| ^
1 warning generated.
--
In file included from mm/gup.c:29:
>> mm/swap.h:428:1: warning: non-void function does not return a value [-Wreturn-type]
428 | }
| ^
mm/gup.c:74:29: warning: unused function 'try_get_folio' [-Wunused-function]
74 | static inline struct folio *try_get_folio(struct page *page, int refs)
| ^~~~~~~~~~~~~
2 warnings generated.
vim +428 mm/swap.h
014bb1de4fc17d5 NeilBrown 2022-05-09 424
2eaa2d7ed6e0caa Kairui Song 2025-10-29 425 static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
2eaa2d7ed6e0caa Kairui Song 2025-10-29 426 void **shadow, bool alloc)
014bb1de4fc17d5 NeilBrown 2022-05-09 427 {
014bb1de4fc17d5 NeilBrown 2022-05-09 @428 }
014bb1de4fc17d5 NeilBrown 2022-05-09 429
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (11 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 12/19] mm, swap: use swap cache as the swap in synchronize layer Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-11-07 3:07 ` Barry Song
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
` (7 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Remove the "skip if exists" check from commit a65b0e7607ccb ("zswap:
make shrinking memcg-aware"). It was needed because there is a tiny time
window between setting the SWAP_HAS_CACHE bit and actually adding the
folio to the swap cache. If a user is trying to add the folio into the
swap cache but another user was interrupted after setting SWAP_HAS_CACHE
but hasn't added the folio to the swap cache yet, it might lead to a
deadlock.
We have moved the bit setting to the same critical section as adding the
folio, so this is no longer needed. Remove it and clean it up.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swap.h | 2 +-
mm/swap_state.c | 27 ++++++++++-----------------
mm/zswap.c | 2 +-
3 files changed, 12 insertions(+), 19 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 3cd99850bbaf..a3c5f2dca0d5 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -260,7 +260,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
void swap_cache_del_folio(struct folio *folio);
struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
struct mempolicy *mpol, pgoff_t ilx,
- bool *alloced, bool skip_if_exists);
+ bool *alloced);
/* Below helpers require the caller to lock and pass in the swap cluster. */
void __swap_cache_del_folio(struct swap_cluster_info *ci,
struct folio *folio, swp_entry_t entry, void *shadow);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 2d53e3b5e8e9..d2bcca92b6e0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -447,8 +447,6 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
* @folio: folio to be added.
* @gfp: memory allocation flags for charge, can be 0 if @charged if true.
* @charged: if the folio is already charged.
- * @skip_if_exists: if the slot is in a cached state, return NULL.
- * This is an old workaround that will be removed shortly.
*
* Update the swap_map and add folio as swap cache, typically before swapin.
* All swap slots covered by the folio must have a non-zero swap count.
@@ -459,8 +457,7 @@ void swap_update_readahead(struct folio *folio, struct vm_area_struct *vma,
*/
static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
struct folio *folio,
- gfp_t gfp, bool charged,
- bool skip_if_exists)
+ gfp_t gfp, bool charged)
{
struct folio *swapcache = NULL;
void *shadow;
@@ -480,7 +477,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
* might return a folio that is irrelevant to the faulting
* entry because @entry is aligned down. Just return NULL.
*/
- if (ret != -EEXIST || skip_if_exists || folio_test_large(folio))
+ if (ret != -EEXIST || folio_test_large(folio))
goto failed;
swapcache = swap_cache_get_folio(entry);
@@ -513,8 +510,6 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
* @mpol: NUMA memory allocation policy to be applied
* @ilx: NUMA interleave index, for use only when MPOL_INTERLEAVE
* @new_page_allocated: sets true if allocation happened, false otherwise
- * @skip_if_exists: if the slot is a partially cached state, return NULL.
- * This is a workaround that would be removed shortly.
*
* Allocate a folio in the swap cache for one swap slot, typically before
* doing IO (swap in or swap out). The swap slot indicated by @entry must
@@ -526,8 +521,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
*/
struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
struct mempolicy *mpol, pgoff_t ilx,
- bool *new_page_allocated,
- bool skip_if_exists)
+ bool *new_page_allocated)
{
struct swap_info_struct *si = __swap_entry_to_info(entry);
struct folio *folio;
@@ -548,8 +542,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
if (!folio)
return NULL;
/* Try add the new folio, returns existing folio or NULL on failure. */
- result = __swap_cache_prepare_and_add(entry, folio, gfp_mask,
- false, skip_if_exists);
+ result = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
if (result == folio)
*new_page_allocated = true;
else
@@ -578,7 +571,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
unsigned long nr_pages = folio_nr_pages(folio);
entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
- swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true, false);
+ swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true);
if (swapcache == folio)
swap_read_folio(folio, NULL);
return swapcache;
@@ -606,7 +599,7 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
mpol = get_vma_policy(vma, addr, 0, &ilx);
folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
- &page_allocated, false);
+ &page_allocated);
mpol_cond_put(mpol);
if (page_allocated)
@@ -725,7 +718,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
/* Ok, do the async read-ahead now */
folio = swap_cache_alloc_folio(
swp_entry(swp_type(entry), offset), gfp_mask, mpol, ilx,
- &page_allocated, false);
+ &page_allocated);
if (!folio)
continue;
if (page_allocated) {
@@ -743,7 +736,7 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
skip:
/* The page was likely read above, so no need for plugging here */
folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
- &page_allocated, false);
+ &page_allocated);
if (unlikely(page_allocated))
swap_read_folio(folio, NULL);
return folio;
@@ -838,7 +831,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
pte_unmap(pte);
pte = NULL;
folio = swap_cache_alloc_folio(entry, gfp_mask, mpol, ilx,
- &page_allocated, false);
+ &page_allocated);
if (!folio)
continue;
if (page_allocated) {
@@ -858,7 +851,7 @@ static struct folio *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
skip:
/* The folio was likely read above, so no need for plugging here */
folio = swap_cache_alloc_folio(targ_entry, gfp_mask, mpol, targ_ilx,
- &page_allocated, false);
+ &page_allocated);
if (unlikely(page_allocated))
swap_read_folio(folio, NULL);
return folio;
diff --git a/mm/zswap.c b/mm/zswap.c
index a7a2443912f4..d8a33db9d3cc 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1015,7 +1015,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
mpol = get_task_policy(current);
folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, mpol,
- NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
+ NO_INTERLEAVE_INDEX, &folio_was_allocated);
put_swap_device(si);
if (!folio)
return -ENOMEM;
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state
2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
@ 2025-11-07 3:07 ` Barry Song
0 siblings, 0 replies; 50+ messages in thread
From: Barry Song @ 2025-11-07 3:07 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
> struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
> struct mempolicy *mpol, pgoff_t ilx,
> - bool *new_page_allocated,
> - bool skip_if_exists)
> + bool *new_page_allocated)
> {
> struct swap_info_struct *si = __swap_entry_to_info(entry);
> struct folio *folio;
> @@ -548,8 +542,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_mask,
> if (!folio)
> return NULL;
> /* Try add the new folio, returns existing folio or NULL on failure. */
> - result = __swap_cache_prepare_and_add(entry, folio, gfp_mask,
> - false, skip_if_exists);
> + result = __swap_cache_prepare_and_add(entry, folio, gfp_mask, false);
> if (result == folio)
> *new_page_allocated = true;
> else
> @@ -578,7 +571,7 @@ struct folio *swapin_folio(swp_entry_t entry, struct folio *folio)
> unsigned long nr_pages = folio_nr_pages(folio);
>
> entry = swp_entry(swp_type(entry), round_down(offset, nr_pages));
> - swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true, false);
> + swapcache = __swap_cache_prepare_and_add(entry, folio, 0, true);
> if (swapcache == folio)
> swap_read_folio(folio, NULL);
> return swapcache;
I wonder if we could also drop the "charged" — it doesn’t seem
difficult to move the charging step before
__swap_cache_prepare_and_add(), even for swap_cache_alloc_folio()?
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (12 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 13/19] mm, swap: remove workaround for unsynchronized swap map cache state Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 19:25 ` kernel test robot
` (2 more replies)
2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
` (6 subsequent siblings)
20 siblings, 3 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
The current swap entry allocation/freeing workflow has never had a clear
definition. This makes it hard to debug or add new optimizations.
This commit introduces a proper definition of how swap entries would be
allocated and freed. Now, most operations are folio based, so they will
never exceed one swap cluster, and we now have a cleaner border between
swap and the rest of mm, making it much easier to follow and debug,
especially with new added sanity checks. Also making more optimization
possible.
Swap entry will be mostly allocated and free with a folio bound.
The folio lock will be useful for resolving many swap ralated races.
Now swap allocation (except hibernation) always starts with a folio in
the swap cache, and gets duped/freed protected by the folio lock:
- folio_alloc_swap() - The only allocation entry point now.
Context: The folio must be locked.
This allocates one or a set of continuous swap slots for a folio and
binds them to the folio by adding the folio to the swap cache. The
swap slots' swap count start with zero value.
- folio_dup_swap() - Increase the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This increases the ref count of swap entries allocated to a folio.
Newly allocated swap slots' count has to be increased by this helper
as the folio got unmapped (and swap entries got installed).
- folio_put_swap() - Decrease the swap count of one or more entries.
Context: The folio must be locked and in the swap cache. For now, the
caller still has to lock the new swap entry owner (e.g., PTL).
This decreases the ref count of swap entries allocated to a folio.
Typically, swapin will decrease the swap count as the folio got
installed back and the swap entry got uninstalled
This won't remove the folio from the swap cache and free the
slot. Lazy freeing of swap cache is helpful for reducing IO.
There is already a folio_free_swap() for immediate cache reclaim.
This part could be further optimized later.
The above locking constraints could be further relaxed when the swap
table if fully implemented. Currently dup still needs the caller
to lock the swap entry container (e.g. PTL), or a concurrent zap
may underflow the swap count.
Some swap users need to interact with swap count without involving folio
(e.g. forking/zapping the page table or mapping truncate without swapin).
In such cases, the caller has to ensure there is no race condition on
whatever owns the swap count and use the below helpers:
- swap_put_entries_direct() - Decrease the swap count directly.
Context: The caller must lock whatever is referencing the slots to
avoid a race.
Typically the page table zapping or shmem mapping truncate will need
to free swap slots directly. If a slot is cached (has a folio bound),
this will also try to release the swap cache.
- swap_dup_entry_direct() - Increase the swap count directly.
Context: The caller must lock whatever is referencing the entries to
avoid race, and the entries must already have a swap count > 1.
Typically, forking will need to copy the page table and hence needs to
increase the swap count of the entries in the table. The page table is
locked while referencing the swap entries, so the entries all have a
swap count > 1 and can't be freed.
Hibernation subsystem is a bit different, so two special wrappers are here:
- swap_alloc_hibernation_slot() - Allocate one entry from one device.
- swap_free_hibernation_slot() - Free one entry allocated by the above
helper.
All hibernation entries are exclusive to the hibernation subsystem and
should not interact with ordinary swap routines.
By separating the workflows, it will be possible to bind folio more
tightly with swap cache and get rid of the SWAP_HAS_CACHE as a temporary
pin.
This commit should not introduce any behavior change
Signed-off-by: Kairui Song <kasong@tencent.com>
---
arch/s390/mm/pgtable.c | 2 +-
include/linux/swap.h | 58 +++++++++----------
kernel/power/swap.c | 10 ++--
mm/madvise.c | 2 +-
mm/memory.c | 15 +++--
mm/rmap.c | 7 ++-
mm/shmem.c | 10 ++--
mm/swap.h | 37 +++++++++++++
mm/swapfile.c | 148 ++++++++++++++++++++++++++++++++++---------------
9 files changed, 192 insertions(+), 97 deletions(-)
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 0fde20bbc50b..c51304a4418e 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -692,7 +692,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
dec_mm_counter(mm, mm_counter(folio));
}
- free_swap_and_cache(entry);
+ swap_put_entries_direct(entry, 1);
}
void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 69025b473472..ac3caa4c6999 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,14 +452,8 @@ static inline long get_nr_swap_pages(void)
}
extern void si_swapinfo(struct sysinfo *);
-int folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern int swap_duplicate_nr(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
extern unsigned int count_swap_pages(int, int);
@@ -472,6 +466,29 @@ struct backing_dev_info;
extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
sector_t swap_folio_sector(struct folio *folio);
+/*
+ * If there is an existing swap slot reference (swap entry) and the caller
+ * guarantees that there is no race modification of it (e.g., PTL
+ * protecting the swap entry in page table; shmem's cmpxchg protects t
+ * he swap entry in shmem mapping), these two helpers below can be used
+ * to put/dup the entries directly.
+ *
+ * All entries must be allocated by folio_alloc_swap(). And they must have
+ * a swap count > 1. See comments of folio_*_swap helpers for more info.
+ */
+int swap_dup_entry_direct(swp_entry_t entry);
+void swap_put_entries_direct(swp_entry_t entry, int nr);
+
+/*
+ * folio_free_swap tries to free the swap entries pinned by a swap cache
+ * folio, it has to be here to be called by other components.
+ */
+bool folio_free_swap(struct folio *folio);
+
+/* Allocate / free (hibernation) exclusive entries */
+swp_entry_t swap_alloc_hibernation_slot(int type);
+void swap_free_hibernation_slot(swp_entry_t entry);
+
static inline void put_swap_device(struct swap_info_struct *si)
{
percpu_ref_put(&si->users);
@@ -499,10 +516,6 @@ static inline void put_swap_device(struct swap_info_struct *si)
#define free_pages_and_swap_cache(pages, nr) \
release_pages((pages), (nr));
-static inline void free_swap_and_cache_nr(swp_entry_t entry, int nr)
-{
-}
-
static inline void free_swap_cache(struct folio *folio)
{
}
@@ -512,12 +525,12 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
return 0;
}
-static inline int swap_duplicate_nr(swp_entry_t swp, int nr_pages)
+static inline int swap_dup_entry_direct(swp_entry_t ent)
{
return 0;
}
-static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
+static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
{
}
@@ -541,11 +554,6 @@ static inline int swp_swapcount(swp_entry_t entry)
return 0;
}
-static inline int folio_alloc_swap(struct folio *folio)
-{
- return -EINVAL;
-}
-
static inline bool folio_free_swap(struct folio *folio)
{
return false;
@@ -558,22 +566,6 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
return -EINVAL;
}
#endif /* CONFIG_SWAP */
-
-static inline int swap_duplicate(swp_entry_t entry)
-{
- return swap_duplicate_nr(entry, 1);
-}
-
-static inline void free_swap_and_cache(swp_entry_t entry)
-{
- free_swap_and_cache_nr(entry, 1);
-}
-
-static inline void swap_free(swp_entry_t entry)
-{
- swap_free_nr(entry, 1);
-}
-
#ifdef CONFIG_MEMCG
static inline int mem_cgroup_swappiness(struct mem_cgroup *memcg)
{
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 0beff7eeaaba..546a0c701970 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -179,10 +179,10 @@ sector_t alloc_swapdev_block(int swap)
{
unsigned long offset;
- offset = swp_offset(get_swap_page_of_type(swap));
+ offset = swp_offset(swap_alloc_hibernation_slot(swap));
if (offset) {
if (swsusp_extents_insert(offset))
- swap_free(swp_entry(swap, offset));
+ swap_free_hibernation_slot(swp_entry(swap, offset));
else
return swapdev_block(swap, offset);
}
@@ -197,6 +197,7 @@ sector_t alloc_swapdev_block(int swap)
void free_all_swap_pages(int swap)
{
+ unsigned long offset;
struct rb_node *node;
while ((node = swsusp_extents.rb_node)) {
@@ -204,8 +205,9 @@ void free_all_swap_pages(int swap)
ext = rb_entry(node, struct swsusp_extent, node);
rb_erase(node, &swsusp_extents);
- swap_free_nr(swp_entry(swap, ext->start),
- ext->end - ext->start + 1);
+
+ for (offset = ext->start; offset < ext->end; offset++)
+ swap_free_hibernation_slot(swp_entry(swap, offset));
kfree(ext);
}
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..3cf2097d2085 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -697,7 +697,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
max_nr = (end - addr) / PAGE_SIZE;
nr = swap_pte_batch(pte, max_nr, ptent);
nr_swap -= nr;
- free_swap_and_cache_nr(entry, nr);
+ swap_put_entries_direct(entry, nr);
clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
} else if (is_hwpoison_entry(entry) ||
is_poisoned_swp_entry(entry)) {
diff --git a/mm/memory.c b/mm/memory.c
index 589d6fc3d424..27d91ae3648a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -933,7 +933,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
swp_entry_t entry = pte_to_swp_entry(orig_pte);
if (likely(!non_swap_entry(entry))) {
- if (swap_duplicate(entry) < 0)
+ if (swap_dup_entry_direct(entry) < 0)
return -EIO;
/* make sure dst_mm is on swapoff's mmlist. */
@@ -1746,7 +1746,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
nr = swap_pte_batch(pte, max_nr, ptent);
rss[MM_SWAPENTS] -= nr;
- free_swap_and_cache_nr(entry, nr);
+ swap_put_entries_direct(entry, nr);
} else if (is_migration_entry(entry)) {
struct folio *folio = pfn_swap_entry_folio(entry);
@@ -4932,7 +4932,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/*
* Some architectures may have to restore extra metadata to the page
* when reading from swap. This metadata may be indexed by swap entry
- * so this must be called before swap_free().
+ * so this must be called before folio_put_swap().
*/
arch_swap_restore(folio_swap(entry, folio), folio);
@@ -4970,6 +4970,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (unlikely(folio != swapcache)) {
folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
folio_add_lru_vma(folio, vma);
+ folio_put_swap(swapcache, NULL);
} else if (!folio_test_anon(folio)) {
/*
* We currently only expect !anon folios that are fully
@@ -4978,9 +4979,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
+ folio_put_swap(folio, NULL);
} else {
+ VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
- rmap_flags);
+ rmap_flags);
+ folio_put_swap(folio, nr_pages == 1 ? page : NULL);
}
VM_BUG_ON(!folio_test_anon(folio) ||
@@ -4994,7 +4998,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* swapcache. Do it after mapping so any raced page fault will
* see the folio in swap cache and wait for us.
*/
- swap_free_nr(entry, nr_pages);
if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
folio_free_swap(folio);
@@ -5004,7 +5007,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
* Hold the lock to avoid the swap entry to be reused
* until we take the PT lock for the pte_same() check
* (to avoid false positives from pte_same). For
- * further safety release the lock after the swap_free
+ * further safety release the lock after the folio_put_swap
* so that the swap count won't change under a
* parallel locked swapcache.
*/
diff --git a/mm/rmap.c b/mm/rmap.c
index 1954c538a991..844864831797 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -82,6 +82,7 @@
#include <trace/events/migrate.h>
#include "internal.h"
+#include "swap.h"
static struct kmem_cache *anon_vma_cachep;
static struct kmem_cache *anon_vma_chain_cachep;
@@ -2146,7 +2147,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
goto discard;
}
- if (swap_duplicate(entry) < 0) {
+ if (folio_dup_swap(folio, subpage) < 0) {
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_abort;
}
@@ -2157,7 +2158,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
* so we'll not check/care.
*/
if (arch_unmap_one(mm, vma, address, pteval) < 0) {
- swap_free(entry);
+ folio_put_swap(folio, subpage);
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_abort;
}
@@ -2165,7 +2166,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
/* See folio_try_share_anon_rmap(): clear PTE first. */
if (anon_exclusive &&
folio_try_share_anon_rmap_pte(folio, subpage)) {
- swap_free(entry);
+ folio_put_swap(folio, subpage);
set_pte_at(mm, address, pvmw.pte, pteval);
goto walk_abort;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index 46d54a1288fd..5e6cb763d945 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -982,7 +982,7 @@ static long shmem_free_swap(struct address_space *mapping,
old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
if (old != radswap)
return 0;
- free_swap_and_cache_nr(radix_to_swp_entry(radswap), 1 << order);
+ swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
return 1 << order;
}
@@ -1665,7 +1665,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
spin_unlock(&shmem_swaplist_lock);
}
- swap_duplicate_nr(folio->swap, nr_pages);
+ folio_dup_swap(folio, NULL);
shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
BUG_ON(folio_mapped(folio));
@@ -1686,7 +1686,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
/* Swap entry might be erased by racing shmem_free_swap() */
if (!error) {
shmem_recalc_inode(inode, 0, -nr_pages);
- swap_free_nr(folio->swap, nr_pages);
+ folio_put_swap(folio, NULL);
}
/*
@@ -2172,6 +2172,7 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
nr_pages = folio_nr_pages(folio);
folio_wait_writeback(folio);
+ folio_put_swap(folio, NULL);
swap_cache_del_folio(folio);
/*
* Don't treat swapin error folio as alloced. Otherwise inode->i_blocks
@@ -2179,7 +2180,6 @@ static void shmem_set_folio_swapin_error(struct inode *inode, pgoff_t index,
* in shmem_evict_inode().
*/
shmem_recalc_inode(inode, -nr_pages, -nr_pages);
- swap_free_nr(swap, nr_pages);
}
static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
@@ -2401,9 +2401,9 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
if (sgp == SGP_WRITE)
folio_mark_accessed(folio);
+ folio_put_swap(folio, NULL);
swap_cache_del_folio(folio);
folio_mark_dirty(folio);
- swap_free_nr(swap, nr_pages);
put_swap_device(si);
*foliop = folio;
diff --git a/mm/swap.h b/mm/swap.h
index a3c5f2dca0d5..74c61129d7b7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -183,6 +183,28 @@ static inline void swap_cluster_unlock_irq(struct swap_cluster_info *ci)
spin_unlock_irq(&ci->lock);
}
+/*
+ * Below are the core routines for doing swap for a folio.
+ * All helpers requires the folio to be locked, and a locked folio
+ * in the swap cache pins the swap entries / slots allocated to the
+ * folio, swap relies heavily on the swap cache and folio lock for
+ * synchronization.
+ *
+ * folio_alloc_swap(): the entry point for a folio to be swapped
+ * out. It allocates swap slots and pins the slots with swap cache.
+ * The slots start with a swap count of zero.
+ *
+ * folio_dup_swap(): increases the swap count of a folio, usually
+ * during it gets unmapped and a swap entry is installed to replace
+ * it (e.g., swap entry in page table). A swap slot with swap
+ * count == 0 should only be increasd by this helper.
+ *
+ * folio_put_swap(): does the opposite thing of folio_dup_swap().
+ */
+int folio_alloc_swap(struct folio *folio);
+int folio_dup_swap(struct folio *folio, struct page *subpage);
+void folio_put_swap(struct folio *folio, struct page *subpage);
+
/* linux/mm/page_io.c */
int sio_pool_init(void);
struct swap_iocb;
@@ -363,9 +385,24 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
return NULL;
}
+static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+{
+ return -EINVAL;
+}
+
+static inline int folio_dup_swap(struct folio *folio, struct page *page)
+{
+ return -EINVAL;
+}
+
+static inline void folio_put_swap(struct folio *folio, struct page *page)
+{
+}
+
static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
{
}
+
static inline void swap_write_unplug(struct swap_iocb *sio)
{
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 415db36d85d3..426b0b6d583f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -58,6 +58,9 @@ static void swap_entries_free(struct swap_info_struct *si,
swp_entry_t entry, unsigned int nr_pages);
static void swap_range_alloc(struct swap_info_struct *si,
unsigned int nr_entries);
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
+static bool swap_entries_put_map(struct swap_info_struct *si,
+ swp_entry_t entry, int nr);
static bool folio_swapcache_freeable(struct folio *folio);
static void move_cluster(struct swap_info_struct *si,
struct swap_cluster_info *ci, struct list_head *list,
@@ -1467,6 +1470,12 @@ int folio_alloc_swap(struct folio *folio)
*/
WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
+ /*
+ * Allocator should always allocate aligned entries so folio based
+ * operations never crossed more than one cluster.
+ */
+ VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
+
return 0;
out_free:
@@ -1474,6 +1483,62 @@ int folio_alloc_swap(struct folio *folio)
return -ENOMEM;
}
+/**
+ * folio_dup_swap() - Increase swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded.
+ * @subpage: if not NULL, only increase the swap count of this subpage.
+ *
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ * The caller also has to ensure there is no raced call to
+ * swap_put_entries_direct before this helper returns, or the swap
+ * map may underflow (TODO: maybe we should allow or avoid underflow to
+ * make swap refcount lockless).
+ */
+int folio_dup_swap(struct folio *folio, struct page *subpage)
+{
+ int err = 0;
+ swp_entry_t entry = folio->swap;
+ unsigned long nr_pages = folio_nr_pages(folio);
+
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+ if (subpage) {
+ entry.val += folio_page_idx(folio, subpage);
+ nr_pages = 1;
+ }
+
+ while (!err && __swap_duplicate(entry, 1, nr_pages) == -ENOMEM)
+ err = add_swap_count_continuation(entry, GFP_ATOMIC);
+
+ return err;
+}
+
+/**
+ * folio_put_swap() - Decrease swap count of swap entries of a folio.
+ * @folio: folio with swap entries bounded, must be in swap cache and locked.
+ * @subpage: if not NULL, only decrease the swap count of this subpage.
+ *
+ * This won't free the swap slots even if swap count drops to zero, they are
+ * still pinned by the swap cache. User may call folio_free_swap to free them.
+ * Context: Caller must ensure the folio is locked and in the swap cache.
+ */
+void folio_put_swap(struct folio *folio, struct page *subpage)
+{
+ swp_entry_t entry = folio->swap;
+ unsigned long nr_pages = folio_nr_pages(folio);
+
+ VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+ if (subpage) {
+ entry.val += folio_page_idx(folio, subpage);
+ nr_pages = 1;
+ }
+
+ swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages);
+}
+
static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
{
struct swap_info_struct *si;
@@ -1714,28 +1779,6 @@ static void swap_entries_free(struct swap_info_struct *si,
partial_free_cluster(si, ci);
}
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
-{
- int nr;
- struct swap_info_struct *sis;
- unsigned long offset = swp_offset(entry);
-
- sis = _swap_info_get(entry);
- if (!sis)
- return;
-
- while (nr_pages) {
- nr = min_t(int, nr_pages, SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
- swap_entries_put_map(sis, swp_entry(sis->type, offset), nr);
- offset += nr;
- nr_pages -= nr;
- }
-}
-
/*
* Called after dropping swapcache to decrease refcnt to swap entries.
*/
@@ -1924,16 +1967,19 @@ bool folio_free_swap(struct folio *folio)
}
/**
- * free_swap_and_cache_nr() - Release reference on range of swap entries and
- * reclaim their cache if no more references remain.
+ * swap_put_entries_direct() - Release reference on range of swap entries and
+ * reclaim their cache if no more references remain.
* @entry: First entry of range.
* @nr: Number of entries in range.
*
* For each swap entry in the contiguous range, release a reference. If any swap
* entries become free, try to reclaim their underlying folios, if present. The
* offset range is defined by [entry.offset, entry.offset + nr).
+ *
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being released.
*/
-void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+void swap_put_entries_direct(swp_entry_t entry, int nr)
{
const unsigned long start_offset = swp_offset(entry);
const unsigned long end_offset = start_offset + nr;
@@ -1942,10 +1988,9 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
unsigned long offset;
si = get_swap_device(entry);
- if (!si)
+ if (WARN_ON_ONCE(!si))
return;
-
- if (WARN_ON(end_offset > si->max))
+ if (WARN_ON_ONCE(end_offset > si->max))
goto out;
/*
@@ -1989,8 +2034,8 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
}
#ifdef CONFIG_HIBERNATION
-
-swp_entry_t get_swap_page_of_type(int type)
+/* Allocate a slot for hibernation */
+swp_entry_t swap_alloc_hibernation_slot(int type)
{
struct swap_info_struct *si = swap_type_to_info(type);
unsigned long offset;
@@ -2020,6 +2065,27 @@ swp_entry_t get_swap_page_of_type(int type)
return entry;
}
+/* Free a slot allocated by swap_alloc_hibernation_slot */
+void swap_free_hibernation_slot(swp_entry_t entry)
+{
+ struct swap_info_struct *si;
+ struct swap_cluster_info *ci;
+ pgoff_t offset = swp_offset(entry);
+
+ si = get_swap_device(entry);
+ if (WARN_ON(!si))
+ return;
+
+ ci = swap_cluster_lock(si, offset);
+ swap_entry_put_locked(si, ci, entry, 1);
+ WARN_ON(swap_entry_swapped(si, offset));
+ swap_cluster_unlock(ci);
+
+ /* In theory readahead might add it to the swap cache by accident */
+ __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
+ put_swap_device(si);
+}
+
/*
* Find the swap type that corresponds to given device (if any).
*
@@ -2181,7 +2247,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
/*
* Some architectures may have to restore extra metadata to the page
* when reading from swap. This metadata may be indexed by swap entry
- * so this must be called before swap_free().
+ * so this must be called before folio_put_swap().
*/
arch_swap_restore(folio_swap(entry, folio), folio);
@@ -2222,7 +2288,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
new_pte = pte_mkuffd_wp(new_pte);
setpte:
set_pte_at(vma->vm_mm, addr, pte, new_pte);
- swap_free(entry);
+ folio_put_swap(folio, page);
out:
if (pte)
pte_unmap_unlock(pte, ptl);
@@ -3725,28 +3791,22 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
return err;
}
-/**
- * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
- * by 1.
- *
+/*
+ * swap_dup_entry_direct() - Increase reference count of a swap entry by one.
* @entry: first swap entry from which we want to increase the refcount.
- * @nr: Number of entries in range.
*
* Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
* but could not be atomically allocated. Returns 0, just as if it succeeded,
* if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
* might occur if a page table entry has got corrupted.
*
- * Note that we are currently not handling the case where nr > 1 and we need to
- * add swap count continuation. This is OK, because no such user exists - shmem
- * is the only user that can pass nr > 1, and it never re-duplicates any swap
- * entry it owns.
+ * Context: Caller must ensure there is no race condition on the reference
+ * owner. e.g., locking the PTL of a PTE containing the entry being increased.
*/
-int swap_duplicate_nr(swp_entry_t entry, int nr)
+int swap_dup_entry_direct(swp_entry_t entry)
{
int err = 0;
-
- while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
+ while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
err = add_swap_count_continuation(entry, GFP_ATOMIC);
return err;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
@ 2025-10-29 19:25 ` kernel test robot
2025-10-30 5:25 ` Kairui Song
2025-10-29 19:25 ` kernel test robot
2025-11-01 4:51 ` YoungJun Park
2 siblings, 1 reply; 50+ messages in thread
From: kernel test robot @ 2025-10-29 19:25 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: oe-kbuild-all, Andrew Morton, Linux Memory Management List,
Baoquan He, Barry Song, Chris Li, Nhat Pham, Johannes Weiner,
Yosry Ahmed, David Hildenbrand, Youngjun Park, Hugh Dickins,
Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
Matthew Wilcox (Oracle), linux-kernel, Kairui Song
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on f30d294530d939fa4b77d61bc60f25c4284841fa]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
base: f30d294530d939fa4b77d61bc60f25c4284841fa
patch link: https://lore.kernel.org/r/20251029-swap-table-p2-v1-14-3d43f3b6ec32%40tencent.com
patch subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510300316.UL4gxAlC-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from mm/vmscan.c:70:
mm/swap.h: In function 'swap_cache_add_folio':
mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
465 | }
| ^
mm/vmscan.c: In function 'shrink_folio_list':
>> mm/vmscan.c:1298:37: error: too few arguments to function 'folio_alloc_swap'
1298 | if (folio_alloc_swap(folio)) {
| ^~~~~~~~~~~~~~~~
mm/swap.h:388:19: note: declared here
388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
| ^~~~~~~~~~~~~~~~
mm/vmscan.c:1314:45: error: too few arguments to function 'folio_alloc_swap'
1314 | if (folio_alloc_swap(folio))
| ^~~~~~~~~~~~~~~~
mm/swap.h:388:19: note: declared here
388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
| ^~~~~~~~~~~~~~~~
--
In file included from mm/shmem.c:44:
mm/swap.h: In function 'swap_cache_add_folio':
mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
465 | }
| ^
mm/shmem.c: In function 'shmem_writeout':
>> mm/shmem.c:1649:14: error: too few arguments to function 'folio_alloc_swap'
1649 | if (!folio_alloc_swap(folio)) {
| ^~~~~~~~~~~~~~~~
mm/swap.h:388:19: note: declared here
388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
| ^~~~~~~~~~~~~~~~
vim +/folio_alloc_swap +1298 mm/vmscan.c
d791ea676b6648 NeilBrown 2022-05-09 1072
^1da177e4c3f41 Linus Torvalds 2005-04-16 1073 /*
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1074) * shrink_folio_list() returns the number of reclaimed pages
^1da177e4c3f41 Linus Torvalds 2005-04-16 1075 */
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1076) static unsigned int shrink_folio_list(struct list_head *folio_list,
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1077) struct pglist_data *pgdat, struct scan_control *sc,
7d709f49babc28 Gregory Price 2025-04-24 1078 struct reclaim_stat *stat, bool ignore_references,
7d709f49babc28 Gregory Price 2025-04-24 1079 struct mem_cgroup *memcg)
^1da177e4c3f41 Linus Torvalds 2005-04-16 1080 {
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1081) struct folio_batch free_folios;
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1082) LIST_HEAD(ret_folios);
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1083) LIST_HEAD(demote_folios);
a479b078fddb0a Li Zhijian 2025-01-10 1084 unsigned int nr_reclaimed = 0, nr_demoted = 0;
730ec8c01a2bd6 Maninder Singh 2020-06-03 1085 unsigned int pgactivate = 0;
26aa2d199d6f2c Dave Hansen 2021-09-02 1086 bool do_demote_pass;
2282679fb20bf0 NeilBrown 2022-05-09 1087 struct swap_iocb *plug = NULL;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1088
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1089) folio_batch_init(&free_folios);
060f005f074791 Kirill Tkhai 2019-03-05 1090 memset(stat, 0, sizeof(*stat));
^1da177e4c3f41 Linus Torvalds 2005-04-16 1091 cond_resched();
7d709f49babc28 Gregory Price 2025-04-24 1092 do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
^1da177e4c3f41 Linus Torvalds 2005-04-16 1093
26aa2d199d6f2c Dave Hansen 2021-09-02 1094 retry:
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1095) while (!list_empty(folio_list)) {
^1da177e4c3f41 Linus Torvalds 2005-04-16 1096 struct address_space *mapping;
be7c07d60e13ac Matthew Wilcox (Oracle 2021-12-23 1097) struct folio *folio;
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1098) enum folio_references references = FOLIOREF_RECLAIM;
d791ea676b6648 NeilBrown 2022-05-09 1099 bool dirty, writeback;
98879b3b9edc16 Yang Shi 2019-07-11 1100 unsigned int nr_pages;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1101
^1da177e4c3f41 Linus Torvalds 2005-04-16 1102 cond_resched();
^1da177e4c3f41 Linus Torvalds 2005-04-16 1103
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1104) folio = lru_to_folio(folio_list);
be7c07d60e13ac Matthew Wilcox (Oracle 2021-12-23 1105) list_del(&folio->lru);
^1da177e4c3f41 Linus Torvalds 2005-04-16 1106
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1107) if (!folio_trylock(folio))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1108 goto keep;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1109
1b0449544c6482 Jinjiang Tu 2025-03-18 1110 if (folio_contain_hwpoisoned_page(folio)) {
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1111 /*
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1112 * unmap_poisoned_folio() can't handle large
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1113 * folio, just skip it. memory_failure() will
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1114 * handle it if the UCE is triggered again.
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1115 */
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1116 if (folio_test_large(folio))
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1117 goto keep_locked;
9f1e8cd0b7c4c9 Jinjiang Tu 2025-06-27 1118
1b0449544c6482 Jinjiang Tu 2025-03-18 1119 unmap_poisoned_folio(folio, folio_pfn(folio), false);
1b0449544c6482 Jinjiang Tu 2025-03-18 1120 folio_unlock(folio);
1b0449544c6482 Jinjiang Tu 2025-03-18 1121 folio_put(folio);
1b0449544c6482 Jinjiang Tu 2025-03-18 1122 continue;
1b0449544c6482 Jinjiang Tu 2025-03-18 1123 }
1b0449544c6482 Jinjiang Tu 2025-03-18 1124
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1125) VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
^1da177e4c3f41 Linus Torvalds 2005-04-16 1126
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1127) nr_pages = folio_nr_pages(folio);
98879b3b9edc16 Yang Shi 2019-07-11 1128
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1129) /* Account the number of base pages */
98879b3b9edc16 Yang Shi 2019-07-11 1130 sc->nr_scanned += nr_pages;
80e4342601abfa Christoph Lameter 2006-02-11 1131
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1132) if (unlikely(!folio_evictable(folio)))
ad6b67041a4549 Minchan Kim 2017-05-03 1133 goto activate_locked;
894bc310419ac9 Lee Schermerhorn 2008-10-18 1134
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1135) if (!sc->may_unmap && folio_mapped(folio))
80e4342601abfa Christoph Lameter 2006-02-11 1136 goto keep_locked;
80e4342601abfa Christoph Lameter 2006-02-11 1137
e2be15f6c3eece Mel Gorman 2013-07-03 1138 /*
894befec4d70b1 Andrey Ryabinin 2018-04-10 1139 * The number of dirty pages determines if a node is marked
8cd7c588decf47 Mel Gorman 2021-11-05 1140 * reclaim_congested. kswapd will stall and start writing
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1141) * folios if the tail of the LRU is all dirty unqueued folios.
e2be15f6c3eece Mel Gorman 2013-07-03 1142 */
e20c41b1091a24 Matthew Wilcox (Oracle 2022-01-17 1143) folio_check_dirty_writeback(folio, &dirty, &writeback);
e2be15f6c3eece Mel Gorman 2013-07-03 1144 if (dirty || writeback)
c79b7b96db8b12 Matthew Wilcox (Oracle 2022-01-17 1145) stat->nr_dirty += nr_pages;
e2be15f6c3eece Mel Gorman 2013-07-03 1146
e2be15f6c3eece Mel Gorman 2013-07-03 1147 if (dirty && !writeback)
c79b7b96db8b12 Matthew Wilcox (Oracle 2022-01-17 1148) stat->nr_unqueued_dirty += nr_pages;
e2be15f6c3eece Mel Gorman 2013-07-03 1149
d04e8acd03e5c3 Mel Gorman 2013-07-03 1150 /*
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1151) * Treat this folio as congested if folios are cycling
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1152) * through the LRU so quickly that the folios marked
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1153) * for immediate reclaim are making it to the end of
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1154) * the LRU a second time.
d04e8acd03e5c3 Mel Gorman 2013-07-03 1155 */
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1156) if (writeback && folio_test_reclaim(folio))
c79b7b96db8b12 Matthew Wilcox (Oracle 2022-01-17 1157) stat->nr_congested += nr_pages;
e2be15f6c3eece Mel Gorman 2013-07-03 1158
e62e384e9da8d9 Michal Hocko 2012-07-31 1159 /*
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1160) * If a folio at the tail of the LRU is under writeback, there
283aba9f9e0e48 Mel Gorman 2013-07-03 1161 * are three cases to consider.
283aba9f9e0e48 Mel Gorman 2013-07-03 1162 *
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1163) * 1) If reclaim is encountering an excessive number
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1164) * of folios under writeback and this folio has both
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1165) * the writeback and reclaim flags set, then it
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1166) * indicates that folios are being queued for I/O but
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1167) * are being recycled through the LRU before the I/O
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1168) * can complete. Waiting on the folio itself risks an
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1169) * indefinite stall if it is impossible to writeback
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1170) * the folio due to I/O error or disconnected storage
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1171) * so instead note that the LRU is being scanned too
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1172) * quickly and the caller can stall after the folio
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1173) * list has been processed.
283aba9f9e0e48 Mel Gorman 2013-07-03 1174 *
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1175) * 2) Global or new memcg reclaim encounters a folio that is
ecf5fc6e9654cd Michal Hocko 2015-08-04 1176 * not marked for immediate reclaim, or the caller does not
ecf5fc6e9654cd Michal Hocko 2015-08-04 1177 * have __GFP_FS (or __GFP_IO if it's simply going to swap,
0c4f8ed498cea1 Joanne Koong 2025-04-14 1178 * not to fs), or the folio belongs to a mapping where
0c4f8ed498cea1 Joanne Koong 2025-04-14 1179 * waiting on writeback during reclaim may lead to a deadlock.
0c4f8ed498cea1 Joanne Koong 2025-04-14 1180 * In this case mark the folio for immediate reclaim and
0c4f8ed498cea1 Joanne Koong 2025-04-14 1181 * continue scanning.
283aba9f9e0e48 Mel Gorman 2013-07-03 1182 *
d791ea676b6648 NeilBrown 2022-05-09 1183 * Require may_enter_fs() because we would wait on fs, which
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1184) * may not have submitted I/O yet. And the loop driver might
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1185) * enter reclaim, and deadlock if it waits on a folio for
283aba9f9e0e48 Mel Gorman 2013-07-03 1186 * which it is needed to do the write (loop masks off
283aba9f9e0e48 Mel Gorman 2013-07-03 1187 * __GFP_IO|__GFP_FS for this reason); but more thought
283aba9f9e0e48 Mel Gorman 2013-07-03 1188 * would probably show more reasons.
283aba9f9e0e48 Mel Gorman 2013-07-03 1189 *
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1190) * 3) Legacy memcg encounters a folio that already has the
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1191) * reclaim flag set. memcg does not have any dirty folio
283aba9f9e0e48 Mel Gorman 2013-07-03 1192 * throttling so we could easily OOM just because too many
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1193) * folios are in writeback and there is nothing else to
283aba9f9e0e48 Mel Gorman 2013-07-03 1194 * reclaim. Wait for the writeback to complete.
c55e8d035b28b2 Johannes Weiner 2017-02-24 1195 *
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1196) * In cases 1) and 2) we activate the folios to get them out of
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1197) * the way while we continue scanning for clean folios on the
c55e8d035b28b2 Johannes Weiner 2017-02-24 1198 * inactive list and refilling from the active list. The
c55e8d035b28b2 Johannes Weiner 2017-02-24 1199 * observation here is that waiting for disk writes is more
c55e8d035b28b2 Johannes Weiner 2017-02-24 1200 * expensive than potentially causing reloads down the line.
c55e8d035b28b2 Johannes Weiner 2017-02-24 1201 * Since they're marked for immediate reclaim, they won't put
c55e8d035b28b2 Johannes Weiner 2017-02-24 1202 * memory pressure on the cache working set any longer than it
c55e8d035b28b2 Johannes Weiner 2017-02-24 1203 * takes to write them to disk.
e62e384e9da8d9 Michal Hocko 2012-07-31 1204 */
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1205) if (folio_test_writeback(folio)) {
0c4f8ed498cea1 Joanne Koong 2025-04-14 1206 mapping = folio_mapping(folio);
0c4f8ed498cea1 Joanne Koong 2025-04-14 1207
283aba9f9e0e48 Mel Gorman 2013-07-03 1208 /* Case 1 above */
283aba9f9e0e48 Mel Gorman 2013-07-03 1209 if (current_is_kswapd() &&
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1210) folio_test_reclaim(folio) &&
599d0c954f91d0 Mel Gorman 2016-07-28 1211 test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
c79b7b96db8b12 Matthew Wilcox (Oracle 2022-01-17 1212) stat->nr_immediate += nr_pages;
c55e8d035b28b2 Johannes Weiner 2017-02-24 1213 goto activate_locked;
283aba9f9e0e48 Mel Gorman 2013-07-03 1214
283aba9f9e0e48 Mel Gorman 2013-07-03 1215 /* Case 2 above */
b5ead35e7e1d34 Johannes Weiner 2019-11-30 1216 } else if (writeback_throttling_sane(sc) ||
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1217) !folio_test_reclaim(folio) ||
0c4f8ed498cea1 Joanne Koong 2025-04-14 1218 !may_enter_fs(folio, sc->gfp_mask) ||
0c4f8ed498cea1 Joanne Koong 2025-04-14 1219 (mapping &&
0c4f8ed498cea1 Joanne Koong 2025-04-14 1220 mapping_writeback_may_deadlock_on_reclaim(mapping))) {
c3b94f44fcb072 Hugh Dickins 2012-07-31 1221 /*
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1222) * This is slightly racy -
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1223) * folio_end_writeback() might have
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1224) * just cleared the reclaim flag, then
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1225) * setting the reclaim flag here ends up
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1226) * interpreted as the readahead flag - but
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1227) * that does not matter enough to care.
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1228) * What we do want is for this folio to
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1229) * have the reclaim flag set next time
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1230) * memcg reclaim reaches the tests above,
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1231) * so it will then wait for writeback to
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1232) * avoid OOM; and it's also appropriate
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1233) * in global reclaim.
c3b94f44fcb072 Hugh Dickins 2012-07-31 1234 */
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1235) folio_set_reclaim(folio);
c79b7b96db8b12 Matthew Wilcox (Oracle 2022-01-17 1236) stat->nr_writeback += nr_pages;
c55e8d035b28b2 Johannes Weiner 2017-02-24 1237 goto activate_locked;
283aba9f9e0e48 Mel Gorman 2013-07-03 1238
283aba9f9e0e48 Mel Gorman 2013-07-03 1239 /* Case 3 above */
283aba9f9e0e48 Mel Gorman 2013-07-03 1240 } else {
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1241) folio_unlock(folio);
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1242) folio_wait_writeback(folio);
d33e4e1412c8b6 Matthew Wilcox (Oracle 2022-05-12 1243) /* then go back and try same folio again */
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1244) list_add_tail(&folio->lru, folio_list);
7fadc820222497 Hugh Dickins 2015-09-08 1245 continue;
e62e384e9da8d9 Michal Hocko 2012-07-31 1246 }
283aba9f9e0e48 Mel Gorman 2013-07-03 1247 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1248
8940b34a4e082a Minchan Kim 2019-09-25 1249 if (!ignore_references)
d92013d1e5e47f Matthew Wilcox (Oracle 2022-02-15 1250) references = folio_check_references(folio, sc);
02c6de8d757cb3 Minchan Kim 2012-10-08 1251
dfc8d636cdb95f Johannes Weiner 2010-03-05 1252 switch (references) {
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1253) case FOLIOREF_ACTIVATE:
^1da177e4c3f41 Linus Torvalds 2005-04-16 1254 goto activate_locked;
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1255) case FOLIOREF_KEEP:
98879b3b9edc16 Yang Shi 2019-07-11 1256 stat->nr_ref_keep += nr_pages;
645747462435d8 Johannes Weiner 2010-03-05 1257 goto keep_locked;
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1258) case FOLIOREF_RECLAIM:
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1259) case FOLIOREF_RECLAIM_CLEAN:
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1260) ; /* try to reclaim the folio below */
dfc8d636cdb95f Johannes Weiner 2010-03-05 1261 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1262
26aa2d199d6f2c Dave Hansen 2021-09-02 1263 /*
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1264) * Before reclaiming the folio, try to relocate
26aa2d199d6f2c Dave Hansen 2021-09-02 1265 * its contents to another node.
26aa2d199d6f2c Dave Hansen 2021-09-02 1266 */
26aa2d199d6f2c Dave Hansen 2021-09-02 1267 if (do_demote_pass &&
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1268) (thp_migration_supported() || !folio_test_large(folio))) {
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1269) list_add(&folio->lru, &demote_folios);
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1270) folio_unlock(folio);
26aa2d199d6f2c Dave Hansen 2021-09-02 1271 continue;
26aa2d199d6f2c Dave Hansen 2021-09-02 1272 }
26aa2d199d6f2c Dave Hansen 2021-09-02 1273
^1da177e4c3f41 Linus Torvalds 2005-04-16 1274 /*
^1da177e4c3f41 Linus Torvalds 2005-04-16 1275 * Anonymous process memory has backing store?
^1da177e4c3f41 Linus Torvalds 2005-04-16 1276 * Try to allocate it some swap space here.
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1277) * Lazyfree folio could be freed directly
^1da177e4c3f41 Linus Torvalds 2005-04-16 1278 */
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1279) if (folio_test_anon(folio) && folio_test_swapbacked(folio)) {
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1280) if (!folio_test_swapcache(folio)) {
63eb6b93ce725e Hugh Dickins 2008-11-19 1281 if (!(sc->gfp_mask & __GFP_IO))
63eb6b93ce725e Hugh Dickins 2008-11-19 1282 goto keep_locked;
d4b4084ac3154c Matthew Wilcox (Oracle 2022-02-04 1283) if (folio_maybe_dma_pinned(folio))
feb889fb40fafc Linus Torvalds 2021-01-16 1284 goto keep_locked;
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1285) if (folio_test_large(folio)) {
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1286) /* cannot split folio, skip it */
8710f6ed34e7bc David Hildenbrand 2024-08-02 1287 if (!can_split_folio(folio, 1, NULL))
b8f593cd0896b8 Ying Huang 2017-07-06 1288 goto activate_locked;
747552b1e71b40 Ying Huang 2017-07-06 1289 /*
5ed890ce514785 Ryan Roberts 2024-04-08 1290 * Split partially mapped folios right away.
5ed890ce514785 Ryan Roberts 2024-04-08 1291 * We can free the unmapped pages without IO.
747552b1e71b40 Ying Huang 2017-07-06 1292 */
8422acdc97ed58 Usama Arif 2024-08-30 1293 if (data_race(!list_empty(&folio->_deferred_list) &&
8422acdc97ed58 Usama Arif 2024-08-30 1294 folio_test_partially_mapped(folio)) &&
5ed890ce514785 Ryan Roberts 2024-04-08 1295 split_folio_to_list(folio, folio_list))
747552b1e71b40 Ying Huang 2017-07-06 1296 goto activate_locked;
747552b1e71b40 Ying Huang 2017-07-06 1297 }
7d14492199f93c Kairui Song 2025-10-24 @1298 if (folio_alloc_swap(folio)) {
d0f048ac39f6a7 Barry Song 2024-04-12 1299 int __maybe_unused order = folio_order(folio);
d0f048ac39f6a7 Barry Song 2024-04-12 1300
09c02e56327bda Matthew Wilcox (Oracle 2022-05-12 1301) if (!folio_test_large(folio))
98879b3b9edc16 Yang Shi 2019-07-11 1302 goto activate_locked_split;
bd4c82c22c367e Ying Huang 2017-09-06 1303 /* Fallback to swap normal pages */
5ed890ce514785 Ryan Roberts 2024-04-08 1304 if (split_folio_to_list(folio, folio_list))
0f0746589e4be0 Minchan Kim 2017-07-06 1305 goto activate_locked;
fe490cc0fe9e6e Ying Huang 2017-09-06 1306 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
5ed890ce514785 Ryan Roberts 2024-04-08 1307 if (nr_pages >= HPAGE_PMD_NR) {
5ed890ce514785 Ryan Roberts 2024-04-08 1308 count_memcg_folio_events(folio,
5ed890ce514785 Ryan Roberts 2024-04-08 1309 THP_SWPOUT_FALLBACK, 1);
fe490cc0fe9e6e Ying Huang 2017-09-06 1310 count_vm_event(THP_SWPOUT_FALLBACK);
5ed890ce514785 Ryan Roberts 2024-04-08 1311 }
fe490cc0fe9e6e Ying Huang 2017-09-06 1312 #endif
e26060d1fbd31a Kanchana P Sridhar 2024-10-02 1313 count_mthp_stat(order, MTHP_STAT_SWPOUT_FALLBACK);
7d14492199f93c Kairui Song 2025-10-24 1314 if (folio_alloc_swap(folio))
98879b3b9edc16 Yang Shi 2019-07-11 1315 goto activate_locked_split;
0f0746589e4be0 Minchan Kim 2017-07-06 1316 }
b487a2da3575b6 Kairui Song 2025-03-14 1317 /*
b487a2da3575b6 Kairui Song 2025-03-14 1318 * Normally the folio will be dirtied in unmap because its
b487a2da3575b6 Kairui Song 2025-03-14 1319 * pte should be dirty. A special case is MADV_FREE page. The
b487a2da3575b6 Kairui Song 2025-03-14 1320 * page's pte could have dirty bit cleared but the folio's
b487a2da3575b6 Kairui Song 2025-03-14 1321 * SwapBacked flag is still set because clearing the dirty bit
b487a2da3575b6 Kairui Song 2025-03-14 1322 * and SwapBacked flag has no lock protected. For such folio,
b487a2da3575b6 Kairui Song 2025-03-14 1323 * unmap will not set dirty bit for it, so folio reclaim will
b487a2da3575b6 Kairui Song 2025-03-14 1324 * not write the folio out. This can cause data corruption when
b487a2da3575b6 Kairui Song 2025-03-14 1325 * the folio is swapped in later. Always setting the dirty flag
b487a2da3575b6 Kairui Song 2025-03-14 1326 * for the folio solves the problem.
b487a2da3575b6 Kairui Song 2025-03-14 1327 */
b487a2da3575b6 Kairui Song 2025-03-14 1328 folio_mark_dirty(folio);
bd4c82c22c367e Ying Huang 2017-09-06 1329 }
e2be15f6c3eece Mel Gorman 2013-07-03 1330 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1331
98879b3b9edc16 Yang Shi 2019-07-11 1332 /*
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1333) * If the folio was split above, the tail pages will make
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1334) * their own pass through this function and be accounted
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1335) * then.
98879b3b9edc16 Yang Shi 2019-07-11 1336 */
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1337) if ((nr_pages > 1) && !folio_test_large(folio)) {
98879b3b9edc16 Yang Shi 2019-07-11 1338 sc->nr_scanned -= (nr_pages - 1);
98879b3b9edc16 Yang Shi 2019-07-11 1339 nr_pages = 1;
98879b3b9edc16 Yang Shi 2019-07-11 1340 }
98879b3b9edc16 Yang Shi 2019-07-11 1341
^1da177e4c3f41 Linus Torvalds 2005-04-16 1342 /*
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1343) * The folio is mapped into the page tables of one or more
^1da177e4c3f41 Linus Torvalds 2005-04-16 1344 * processes. Try to unmap it here.
^1da177e4c3f41 Linus Torvalds 2005-04-16 1345 */
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1346) if (folio_mapped(folio)) {
013339df116c2e Shakeel Butt 2020-12-14 1347 enum ttu_flags flags = TTU_BATCH_FLUSH;
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1348) bool was_swapbacked = folio_test_swapbacked(folio);
bd4c82c22c367e Ying Huang 2017-09-06 1349
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1350) if (folio_test_pmd_mappable(folio))
bd4c82c22c367e Ying Huang 2017-09-06 1351 flags |= TTU_SPLIT_HUGE_PMD;
73bc32875ee9b1 Barry Song 2024-03-06 1352 /*
73bc32875ee9b1 Barry Song 2024-03-06 1353 * Without TTU_SYNC, try_to_unmap will only begin to
73bc32875ee9b1 Barry Song 2024-03-06 1354 * hold PTL from the first present PTE within a large
73bc32875ee9b1 Barry Song 2024-03-06 1355 * folio. Some initial PTEs might be skipped due to
73bc32875ee9b1 Barry Song 2024-03-06 1356 * races with parallel PTE writes in which PTEs can be
73bc32875ee9b1 Barry Song 2024-03-06 1357 * cleared temporarily before being written new present
73bc32875ee9b1 Barry Song 2024-03-06 1358 * values. This will lead to a large folio is still
73bc32875ee9b1 Barry Song 2024-03-06 1359 * mapped while some subpages have been partially
73bc32875ee9b1 Barry Song 2024-03-06 1360 * unmapped after try_to_unmap; TTU_SYNC helps
73bc32875ee9b1 Barry Song 2024-03-06 1361 * try_to_unmap acquire PTL from the first PTE,
73bc32875ee9b1 Barry Song 2024-03-06 1362 * eliminating the influence of temporary PTE values.
73bc32875ee9b1 Barry Song 2024-03-06 1363 */
e5a119c4a6835a Barry Song 2024-06-30 1364 if (folio_test_large(folio))
73bc32875ee9b1 Barry Song 2024-03-06 1365 flags |= TTU_SYNC;
1f318a9b0dc399 Jaewon Kim 2020-06-03 1366
869f7ee6f64773 Matthew Wilcox (Oracle 2022-02-15 1367) try_to_unmap(folio, flags);
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1368) if (folio_mapped(folio)) {
98879b3b9edc16 Yang Shi 2019-07-11 1369 stat->nr_unmap_fail += nr_pages;
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1370) if (!was_swapbacked &&
1bee2c1677bcb5 Matthew Wilcox (Oracle 2022-05-12 1371) folio_test_swapbacked(folio))
1f318a9b0dc399 Jaewon Kim 2020-06-03 1372 stat->nr_lazyfree_fail += nr_pages;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1373 goto activate_locked;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1374 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1375 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1376
d824ec2a154677 Jan Kara 2023-04-28 1377 /*
d824ec2a154677 Jan Kara 2023-04-28 1378 * Folio is unmapped now so it cannot be newly pinned anymore.
d824ec2a154677 Jan Kara 2023-04-28 1379 * No point in trying to reclaim folio if it is pinned.
d824ec2a154677 Jan Kara 2023-04-28 1380 * Furthermore we don't want to reclaim underlying fs metadata
d824ec2a154677 Jan Kara 2023-04-28 1381 * if the folio is pinned and thus potentially modified by the
d824ec2a154677 Jan Kara 2023-04-28 1382 * pinning process as that may upset the filesystem.
d824ec2a154677 Jan Kara 2023-04-28 1383 */
d824ec2a154677 Jan Kara 2023-04-28 1384 if (folio_maybe_dma_pinned(folio))
d824ec2a154677 Jan Kara 2023-04-28 1385 goto activate_locked;
d824ec2a154677 Jan Kara 2023-04-28 1386
5441d4902f9692 Matthew Wilcox (Oracle 2022-05-12 1387) mapping = folio_mapping(folio);
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1388) if (folio_test_dirty(folio)) {
e2a80749555d73 Baolin Wang 2025-10-17 1389 if (folio_is_file_lru(folio)) {
49ea7eb65e7c50 Mel Gorman 2011-10-31 1390 /*
49ea7eb65e7c50 Mel Gorman 2011-10-31 1391 * Immediately reclaim when written back.
5a9e34747c9f73 Vishal Moola (Oracle 2022-12-21 1392) * Similar in principle to folio_deactivate()
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1393) * except we already have the folio isolated
49ea7eb65e7c50 Mel Gorman 2011-10-31 1394 * and know it's dirty
49ea7eb65e7c50 Mel Gorman 2011-10-31 1395 */
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1396) node_stat_mod_folio(folio, NR_VMSCAN_IMMEDIATE,
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1397) nr_pages);
e2a80749555d73 Baolin Wang 2025-10-17 1398 if (!folio_test_reclaim(folio))
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1399) folio_set_reclaim(folio);
49ea7eb65e7c50 Mel Gorman 2011-10-31 1400
c55e8d035b28b2 Johannes Weiner 2017-02-24 1401 goto activate_locked;
ee72886d8ed5d9 Mel Gorman 2011-10-31 1402 }
ee72886d8ed5d9 Mel Gorman 2011-10-31 1403
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1404) if (references == FOLIOREF_RECLAIM_CLEAN)
^1da177e4c3f41 Linus Torvalds 2005-04-16 1405 goto keep_locked;
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1406) if (!may_enter_fs(folio, sc->gfp_mask))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1407 goto keep_locked;
52a8363eae3872 Christoph Lameter 2006-02-01 1408 if (!sc->may_writepage)
^1da177e4c3f41 Linus Torvalds 2005-04-16 1409 goto keep_locked;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1410
d950c9477d51f0 Mel Gorman 2015-09-04 1411 /*
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1412) * Folio is dirty. Flush the TLB if a writable entry
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1413) * potentially exists to avoid CPU writes after I/O
d950c9477d51f0 Mel Gorman 2015-09-04 1414 * starts and then write it out here.
d950c9477d51f0 Mel Gorman 2015-09-04 1415 */
d950c9477d51f0 Mel Gorman 2015-09-04 1416 try_to_unmap_flush_dirty();
809bc86517cc40 Baolin Wang 2024-08-12 1417 switch (pageout(folio, mapping, &plug, folio_list)) {
^1da177e4c3f41 Linus Torvalds 2005-04-16 1418 case PAGE_KEEP:
^1da177e4c3f41 Linus Torvalds 2005-04-16 1419 goto keep_locked;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1420 case PAGE_ACTIVATE:
809bc86517cc40 Baolin Wang 2024-08-12 1421 /*
809bc86517cc40 Baolin Wang 2024-08-12 1422 * If shmem folio is split when writeback to swap,
809bc86517cc40 Baolin Wang 2024-08-12 1423 * the tail pages will make their own pass through
809bc86517cc40 Baolin Wang 2024-08-12 1424 * this function and be accounted then.
809bc86517cc40 Baolin Wang 2024-08-12 1425 */
809bc86517cc40 Baolin Wang 2024-08-12 1426 if (nr_pages > 1 && !folio_test_large(folio)) {
809bc86517cc40 Baolin Wang 2024-08-12 1427 sc->nr_scanned -= (nr_pages - 1);
809bc86517cc40 Baolin Wang 2024-08-12 1428 nr_pages = 1;
809bc86517cc40 Baolin Wang 2024-08-12 1429 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1430 goto activate_locked;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1431 case PAGE_SUCCESS:
809bc86517cc40 Baolin Wang 2024-08-12 1432 if (nr_pages > 1 && !folio_test_large(folio)) {
809bc86517cc40 Baolin Wang 2024-08-12 1433 sc->nr_scanned -= (nr_pages - 1);
809bc86517cc40 Baolin Wang 2024-08-12 1434 nr_pages = 1;
809bc86517cc40 Baolin Wang 2024-08-12 1435 }
c79b7b96db8b12 Matthew Wilcox (Oracle 2022-01-17 1436) stat->nr_pageout += nr_pages;
96f8bf4fb1dd26 Johannes Weiner 2020-06-03 1437
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1438) if (folio_test_writeback(folio))
41ac1999c3e356 Mel Gorman 2012-05-29 1439 goto keep;
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1440) if (folio_test_dirty(folio))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1441 goto keep;
7d3579e8e61937 KOSAKI Motohiro 2010-10-26 1442
^1da177e4c3f41 Linus Torvalds 2005-04-16 1443 /*
^1da177e4c3f41 Linus Torvalds 2005-04-16 1444 * A synchronous write - probably a ramdisk. Go
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1445) * ahead and try to reclaim the folio.
^1da177e4c3f41 Linus Torvalds 2005-04-16 1446 */
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1447) if (!folio_trylock(folio))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1448 goto keep;
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1449) if (folio_test_dirty(folio) ||
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1450) folio_test_writeback(folio))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1451 goto keep_locked;
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1452) mapping = folio_mapping(folio);
01359eb2013b4b Gustavo A. R. Silva 2020-12-14 1453 fallthrough;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1454 case PAGE_CLEAN:
49bd2bf9679f4a Matthew Wilcox (Oracle 2022-05-12 1455) ; /* try to free the folio below */
^1da177e4c3f41 Linus Torvalds 2005-04-16 1456 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1457 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1458
^1da177e4c3f41 Linus Torvalds 2005-04-16 1459 /*
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1460) * If the folio has buffers, try to free the buffer
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1461) * mappings associated with this folio. If we succeed
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1462) * we try to free the folio as well.
^1da177e4c3f41 Linus Torvalds 2005-04-16 1463 *
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1464) * We do this even if the folio is dirty.
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1465) * filemap_release_folio() does not perform I/O, but it
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1466) * is possible for a folio to have the dirty flag set,
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1467) * but it is actually clean (all its buffers are clean).
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1468) * This happens if the buffers were written out directly,
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1469) * with submit_bh(). ext3 will do this, as well as
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1470) * the blockdev mapping. filemap_release_folio() will
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1471) * discover that cleanness and will drop the buffers
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1472) * and mark the folio clean - it can be freed.
^1da177e4c3f41 Linus Torvalds 2005-04-16 1473 *
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1474) * Rarely, folios can have buffers and no ->mapping.
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1475) * These are the folios which were not successfully
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1476) * invalidated in truncate_cleanup_folio(). We try to
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1477) * drop those buffers here and if that worked, and the
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1478) * folio is no longer mapped into process address space
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1479) * (refcount == 1) it can be freed. Otherwise, leave
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1480) * the folio on the LRU so it is swappable.
^1da177e4c3f41 Linus Torvalds 2005-04-16 1481 */
0201ebf274a306 David Howells 2023-06-28 1482 if (folio_needs_release(folio)) {
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1483) if (!filemap_release_folio(folio, sc->gfp_mask))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1484 goto activate_locked;
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1485) if (!mapping && folio_ref_count(folio) == 1) {
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1486) folio_unlock(folio);
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1487) if (folio_put_testzero(folio))
^1da177e4c3f41 Linus Torvalds 2005-04-16 1488 goto free_it;
e286781d5f2e9c Nicholas Piggin 2008-07-25 1489 else {
e286781d5f2e9c Nicholas Piggin 2008-07-25 1490 /*
e286781d5f2e9c Nicholas Piggin 2008-07-25 1491 * rare race with speculative reference.
e286781d5f2e9c Nicholas Piggin 2008-07-25 1492 * the speculative reference will free
0a36111c8c20b2 Matthew Wilcox (Oracle 2022-05-12 1493) * this folio shortly, so we may
e286781d5f2e9c Nicholas Piggin 2008-07-25 1494 * increment nr_reclaimed here (and
e286781d5f2e9c Nicholas Piggin 2008-07-25 1495 * leave it off the LRU).
e286781d5f2e9c Nicholas Piggin 2008-07-25 1496 */
9aafcffc18785f Miaohe Lin 2022-05-12 1497 nr_reclaimed += nr_pages;
e286781d5f2e9c Nicholas Piggin 2008-07-25 1498 continue;
e286781d5f2e9c Nicholas Piggin 2008-07-25 1499 }
e286781d5f2e9c Nicholas Piggin 2008-07-25 1500 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1501 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1502
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1503) if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
802a3a92ad7ac0 Shaohua Li 2017-05-03 1504 /* follow __remove_mapping for reference */
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1505) if (!folio_ref_freeze(folio, 1))
49d2e9cc454436 Christoph Lameter 2006-01-08 1506 goto keep_locked;
d17be2d9ff6c68 Miaohe Lin 2021-09-02 1507 /*
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1508) * The folio has only one reference left, which is
d17be2d9ff6c68 Miaohe Lin 2021-09-02 1509 * from the isolation. After the caller puts the
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1510) * folio back on the lru and drops the reference, the
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1511) * folio will be freed anyway. It doesn't matter
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1512) * which lru it goes on. So we don't bother checking
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1513) * the dirty flag here.
d17be2d9ff6c68 Miaohe Lin 2021-09-02 1514 */
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1515) count_vm_events(PGLAZYFREED, nr_pages);
64daa5d818ae34 Matthew Wilcox (Oracle 2022-05-12 1516) count_memcg_folio_events(folio, PGLAZYFREED, nr_pages);
be7c07d60e13ac Matthew Wilcox (Oracle 2021-12-23 1517) } else if (!mapping || !__remove_mapping(mapping, folio, true,
b910718a948a91 Johannes Weiner 2019-11-30 1518 sc->target_mem_cgroup))
802a3a92ad7ac0 Shaohua Li 2017-05-03 1519 goto keep_locked;
9a1ea439b16b92 Hugh Dickins 2018-12-28 1520
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1521) folio_unlock(folio);
e286781d5f2e9c Nicholas Piggin 2008-07-25 1522 free_it:
98879b3b9edc16 Yang Shi 2019-07-11 1523 /*
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1524) * Folio may get swapped out as a whole, need to account
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1525) * all pages in it.
98879b3b9edc16 Yang Shi 2019-07-11 1526 */
98879b3b9edc16 Yang Shi 2019-07-11 1527 nr_reclaimed += nr_pages;
abe4c3b50c3f25 Mel Gorman 2010-08-09 1528
f8f931bba0f920 Hugh Dickins 2024-10-27 1529 folio_unqueue_deferred_split(folio);
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1530) if (folio_batch_add(&free_folios, folio) == 0) {
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1531) mem_cgroup_uncharge_folios(&free_folios);
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1532) try_to_unmap_flush();
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1533) free_unref_folios(&free_folios);
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1534) }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1535 continue;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1536
98879b3b9edc16 Yang Shi 2019-07-11 1537 activate_locked_split:
98879b3b9edc16 Yang Shi 2019-07-11 1538 /*
98879b3b9edc16 Yang Shi 2019-07-11 1539 * The tail pages that are failed to add into swap cache
98879b3b9edc16 Yang Shi 2019-07-11 1540 * reach here. Fixup nr_scanned and nr_pages.
98879b3b9edc16 Yang Shi 2019-07-11 1541 */
98879b3b9edc16 Yang Shi 2019-07-11 1542 if (nr_pages > 1) {
98879b3b9edc16 Yang Shi 2019-07-11 1543 sc->nr_scanned -= (nr_pages - 1);
98879b3b9edc16 Yang Shi 2019-07-11 1544 nr_pages = 1;
98879b3b9edc16 Yang Shi 2019-07-11 1545 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1546 activate_locked:
68a22394c286a2 Rik van Riel 2008-10-18 1547 /* Not a candidate for swapping, so reclaim swap space. */
246b648038096c Matthew Wilcox (Oracle 2022-05-12 1548) if (folio_test_swapcache(folio) &&
9202d527b715f6 Matthew Wilcox (Oracle 2022-09-02 1549) (mem_cgroup_swap_full(folio) || folio_test_mlocked(folio)))
bdb0ed54a4768d Matthew Wilcox (Oracle 2022-09-02 1550) folio_free_swap(folio);
246b648038096c Matthew Wilcox (Oracle 2022-05-12 1551) VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
246b648038096c Matthew Wilcox (Oracle 2022-05-12 1552) if (!folio_test_mlocked(folio)) {
246b648038096c Matthew Wilcox (Oracle 2022-05-12 1553) int type = folio_is_file_lru(folio);
246b648038096c Matthew Wilcox (Oracle 2022-05-12 1554) folio_set_active(folio);
98879b3b9edc16 Yang Shi 2019-07-11 1555 stat->nr_activate[type] += nr_pages;
246b648038096c Matthew Wilcox (Oracle 2022-05-12 1556) count_memcg_folio_events(folio, PGACTIVATE, nr_pages);
ad6b67041a4549 Minchan Kim 2017-05-03 1557 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1558 keep_locked:
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1559) folio_unlock(folio);
^1da177e4c3f41 Linus Torvalds 2005-04-16 1560 keep:
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1561) list_add(&folio->lru, &ret_folios);
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1562) VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1563) folio_test_unevictable(folio), folio);
^1da177e4c3f41 Linus Torvalds 2005-04-16 1564 }
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1565) /* 'folio_list' is always empty here */
26aa2d199d6f2c Dave Hansen 2021-09-02 1566
c28a0e9695b724 Matthew Wilcox (Oracle 2022-05-12 1567) /* Migrate folios selected for demotion */
a479b078fddb0a Li Zhijian 2025-01-10 1568 nr_demoted = demote_folio_list(&demote_folios, pgdat);
a479b078fddb0a Li Zhijian 2025-01-10 1569 nr_reclaimed += nr_demoted;
a479b078fddb0a Li Zhijian 2025-01-10 1570 stat->nr_demoted += nr_demoted;
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1571) /* Folios that could not be demoted are still in @demote_folios */
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1572) if (!list_empty(&demote_folios)) {
6b426d071419a4 Mina Almasry 2022-12-01 1573 /* Folios which weren't demoted go back on @folio_list */
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1574) list_splice_init(&demote_folios, folio_list);
6b426d071419a4 Mina Almasry 2022-12-01 1575
6b426d071419a4 Mina Almasry 2022-12-01 1576 /*
6b426d071419a4 Mina Almasry 2022-12-01 1577 * goto retry to reclaim the undemoted folios in folio_list if
6b426d071419a4 Mina Almasry 2022-12-01 1578 * desired.
6b426d071419a4 Mina Almasry 2022-12-01 1579 *
6b426d071419a4 Mina Almasry 2022-12-01 1580 * Reclaiming directly from top tier nodes is not often desired
6b426d071419a4 Mina Almasry 2022-12-01 1581 * due to it breaking the LRU ordering: in general memory
6b426d071419a4 Mina Almasry 2022-12-01 1582 * should be reclaimed from lower tier nodes and demoted from
6b426d071419a4 Mina Almasry 2022-12-01 1583 * top tier nodes.
6b426d071419a4 Mina Almasry 2022-12-01 1584 *
6b426d071419a4 Mina Almasry 2022-12-01 1585 * However, disabling reclaim from top tier nodes entirely
6b426d071419a4 Mina Almasry 2022-12-01 1586 * would cause ooms in edge scenarios where lower tier memory
6b426d071419a4 Mina Almasry 2022-12-01 1587 * is unreclaimable for whatever reason, eg memory being
6b426d071419a4 Mina Almasry 2022-12-01 1588 * mlocked or too hot to reclaim. We can disable reclaim
6b426d071419a4 Mina Almasry 2022-12-01 1589 * from top tier nodes in proactive reclaim though as that is
6b426d071419a4 Mina Almasry 2022-12-01 1590 * not real memory pressure.
6b426d071419a4 Mina Almasry 2022-12-01 1591 */
6b426d071419a4 Mina Almasry 2022-12-01 1592 if (!sc->proactive) {
26aa2d199d6f2c Dave Hansen 2021-09-02 1593 do_demote_pass = false;
26aa2d199d6f2c Dave Hansen 2021-09-02 1594 goto retry;
26aa2d199d6f2c Dave Hansen 2021-09-02 1595 }
6b426d071419a4 Mina Almasry 2022-12-01 1596 }
abe4c3b50c3f25 Mel Gorman 2010-08-09 1597
98879b3b9edc16 Yang Shi 2019-07-11 1598 pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
98879b3b9edc16 Yang Shi 2019-07-11 1599
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1600) mem_cgroup_uncharge_folios(&free_folios);
72b252aed506b8 Mel Gorman 2015-09-04 1601 try_to_unmap_flush();
bc2ff4cbc3294c Matthew Wilcox (Oracle 2024-02-27 1602) free_unref_folios(&free_folios);
abe4c3b50c3f25 Mel Gorman 2010-08-09 1603
49fd9b6df54e61 Matthew Wilcox (Oracle 2022-09-02 1604) list_splice(&ret_folios, folio_list);
886cf1901db962 Kirill Tkhai 2019-05-13 1605 count_vm_events(PGACTIVATE, pgactivate);
060f005f074791 Kirill Tkhai 2019-03-05 1606
2282679fb20bf0 NeilBrown 2022-05-09 1607 if (plug)
2282679fb20bf0 NeilBrown 2022-05-09 1608 swap_write_unplug(plug);
05ff51376f01fd Andrew Morton 2006-03-22 1609 return nr_reclaimed;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1610 }
^1da177e4c3f41 Linus Torvalds 2005-04-16 1611
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-10-29 19:25 ` kernel test robot
@ 2025-10-30 5:25 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-30 5:25 UTC (permalink / raw)
To: kernel test robot
Cc: linux-mm, oe-kbuild-all, Andrew Morton, Baoquan He, Barry Song,
Chris Li, Nhat Pham, Johannes Weiner, Yosry Ahmed,
David Hildenbrand, Youngjun Park, Hugh Dickins, Baolin Wang,
Huang, Ying, Kemeng Shi, Lorenzo Stoakes, Matthew Wilcox (Oracle),
linux-kernel
On Thu, Oct 30, 2025 at 3:30 AM kernel test robot <lkp@intel.com> wrote:
>
> Hi Kairui,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on f30d294530d939fa4b77d61bc60f25c4284841fa]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
> base: f30d294530d939fa4b77d61bc60f25c4284841fa
> patch link: https://lore.kernel.org/r/20251029-swap-table-p2-v1-14-3d43f3b6ec32%40tencent.com
> patch subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
> config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/config)
> compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300316.UL4gxAlC-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202510300316.UL4gxAlC-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> In file included from mm/vmscan.c:70:
> mm/swap.h: In function 'swap_cache_add_folio':
> mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
> 465 | }
> | ^
> mm/vmscan.c: In function 'shrink_folio_list':
> >> mm/vmscan.c:1298:37: error: too few arguments to function 'folio_alloc_swap'
> 1298 | if (folio_alloc_swap(folio)) {
> | ^~~~~~~~~~~~~~~~
> mm/swap.h:388:19: note: declared here
> 388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> | ^~~~~~~~~~~~~~~~
> mm/vmscan.c:1314:45: error: too few arguments to function 'folio_alloc_swap'
> 1314 | if (folio_alloc_swap(folio))
> | ^~~~~~~~~~~~~~~~
> mm/swap.h:388:19: note: declared here
> 388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> | ^~~~~~~~~~~~~~~~
> --
> In file included from mm/shmem.c:44:
> mm/swap.h: In function 'swap_cache_add_folio':
> mm/swap.h:465:1: warning: no return statement in function returning non-void [-Wreturn-type]
> 465 | }
> | ^
> mm/shmem.c: In function 'shmem_writeout':
> >> mm/shmem.c:1649:14: error: too few arguments to function 'folio_alloc_swap'
> 1649 | if (!folio_alloc_swap(folio)) {
> | ^~~~~~~~~~~~~~~~
> mm/swap.h:388:19: note: declared here
> 388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
> | ^~~~~~~~~~~~~~~~
>
Thanks, I forgot to update the empty place holder for folio_alloc_swap
during rebase:
diff --git a/mm/swap.h b/mm/swap.h
index 74c61129d7b7..9aa99061573a 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -385,7 +385,7 @@ static inline struct swap_info_struct
*__swap_entry_to_info(swp_entry_t entry)
return NULL;
}
-static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
+static inline int folio_alloc_swap(struct folio *folio)
{
return -EINVAL;
}
^ permalink raw reply related [flat|nested] 50+ messages in thread
* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
2025-10-29 19:25 ` kernel test robot
@ 2025-10-29 19:25 ` kernel test robot
2025-11-01 4:51 ` YoungJun Park
2 siblings, 0 replies; 50+ messages in thread
From: kernel test robot @ 2025-10-29 19:25 UTC (permalink / raw)
To: Kairui Song, linux-mm
Cc: llvm, oe-kbuild-all, Andrew Morton, Linux Memory Management List,
Baoquan He, Barry Song, Chris Li, Nhat Pham, Johannes Weiner,
Yosry Ahmed, David Hildenbrand, Youngjun Park, Hugh Dickins,
Baolin Wang, Huang, Ying, Kemeng Shi, Lorenzo Stoakes,
Matthew Wilcox (Oracle), linux-kernel, Kairui Song
Hi Kairui,
kernel test robot noticed the following build errors:
[auto build test ERROR on f30d294530d939fa4b77d61bc60f25c4284841fa]
url: https://github.com/intel-lab-lkp/linux/commits/Kairui-Song/mm-swap-rename-__read_swap_cache_async-to-swap_cache_alloc_folio/20251030-000506
base: f30d294530d939fa4b77d61bc60f25c4284841fa
patch link: https://lore.kernel.org/r/20251029-swap-table-p2-v1-14-3d43f3b6ec32%40tencent.com
patch subject: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20251030/202510300341.cOYqY4ki-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251030/202510300341.cOYqY4ki-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510300341.cOYqY4ki-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from mm/shmem.c:44:
mm/swap.h:465:1: warning: non-void function does not return a value [-Wreturn-type]
465 | }
| ^
>> mm/shmem.c:1649:29: error: too few arguments to function call, expected 2, have 1
1649 | if (!folio_alloc_swap(folio)) {
| ~~~~~~~~~~~~~~~~ ^
mm/swap.h:388:19: note: 'folio_alloc_swap' declared here
388 | static inline int folio_alloc_swap(struct folio *folio, gfp_t gfp)
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning and 1 error generated.
vim +1649 mm/shmem.c
^1da177e4c3f41 Linus Torvalds 2005-04-16 1563
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1564) /**
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1565) * shmem_writeout - Write the folio to swap
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1566) * @folio: The folio to write
44b1b073eb3614 Christoph Hellwig 2025-06-10 1567 * @plug: swap plug
44b1b073eb3614 Christoph Hellwig 2025-06-10 1568 * @folio_list: list to put back folios on split
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1569) *
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1570) * Move the folio from the page cache to the swap cache.
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1571) */
44b1b073eb3614 Christoph Hellwig 2025-06-10 1572 int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
44b1b073eb3614 Christoph Hellwig 2025-06-10 1573 struct list_head *folio_list)
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1574) {
8ccee8c19c605a Luis Chamberlain 2023-03-09 1575 struct address_space *mapping = folio->mapping;
8ccee8c19c605a Luis Chamberlain 2023-03-09 1576 struct inode *inode = mapping->host;
8ccee8c19c605a Luis Chamberlain 2023-03-09 1577 struct shmem_inode_info *info = SHMEM_I(inode);
2c6efe9cf2d784 Luis Chamberlain 2023-03-09 1578 struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
6922c0c7abd387 Hugh Dickins 2011-08-03 1579 pgoff_t index;
650180760be6bb Baolin Wang 2024-08-12 1580 int nr_pages;
809bc86517cc40 Baolin Wang 2024-08-12 1581 bool split = false;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1582
adae46ac1e38a2 Ricardo Cañuelo Navarro 2025-02-26 1583 if ((info->flags & VM_LOCKED) || sbinfo->noswap)
9a976f0c847b67 Luis Chamberlain 2023-03-09 1584 goto redirty;
9a976f0c847b67 Luis Chamberlain 2023-03-09 1585
9a976f0c847b67 Luis Chamberlain 2023-03-09 1586 if (!total_swap_pages)
9a976f0c847b67 Luis Chamberlain 2023-03-09 1587 goto redirty;
9a976f0c847b67 Luis Chamberlain 2023-03-09 1588
1e6decf30af5c5 Hugh Dickins 2021-09-02 1589 /*
809bc86517cc40 Baolin Wang 2024-08-12 1590 * If CONFIG_THP_SWAP is not enabled, the large folio should be
809bc86517cc40 Baolin Wang 2024-08-12 1591 * split when swapping.
809bc86517cc40 Baolin Wang 2024-08-12 1592 *
809bc86517cc40 Baolin Wang 2024-08-12 1593 * And shrinkage of pages beyond i_size does not split swap, so
809bc86517cc40 Baolin Wang 2024-08-12 1594 * swapout of a large folio crossing i_size needs to split too
809bc86517cc40 Baolin Wang 2024-08-12 1595 * (unless fallocate has been used to preallocate beyond EOF).
1e6decf30af5c5 Hugh Dickins 2021-09-02 1596 */
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1597) if (folio_test_large(folio)) {
809bc86517cc40 Baolin Wang 2024-08-12 1598 index = shmem_fallocend(inode,
809bc86517cc40 Baolin Wang 2024-08-12 1599 DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE));
809bc86517cc40 Baolin Wang 2024-08-12 1600 if ((index > folio->index && index < folio_next_index(folio)) ||
809bc86517cc40 Baolin Wang 2024-08-12 1601 !IS_ENABLED(CONFIG_THP_SWAP))
809bc86517cc40 Baolin Wang 2024-08-12 1602 split = true;
809bc86517cc40 Baolin Wang 2024-08-12 1603 }
809bc86517cc40 Baolin Wang 2024-08-12 1604
809bc86517cc40 Baolin Wang 2024-08-12 1605 if (split) {
809bc86517cc40 Baolin Wang 2024-08-12 1606 try_split:
1e6decf30af5c5 Hugh Dickins 2021-09-02 1607 /* Ensure the subpages are still dirty */
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1608) folio_test_set_dirty(folio);
44b1b073eb3614 Christoph Hellwig 2025-06-10 1609 if (split_folio_to_list(folio, folio_list))
1e6decf30af5c5 Hugh Dickins 2021-09-02 1610 goto redirty;
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1611) folio_clear_dirty(folio);
1e6decf30af5c5 Hugh Dickins 2021-09-02 1612 }
1e6decf30af5c5 Hugh Dickins 2021-09-02 1613
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1614) index = folio->index;
650180760be6bb Baolin Wang 2024-08-12 1615 nr_pages = folio_nr_pages(folio);
1635f6a74152f1 Hugh Dickins 2012-05-29 1616
1635f6a74152f1 Hugh Dickins 2012-05-29 1617 /*
1635f6a74152f1 Hugh Dickins 2012-05-29 1618 * This is somewhat ridiculous, but without plumbing a SWAP_MAP_FALLOC
1635f6a74152f1 Hugh Dickins 2012-05-29 1619 * value into swapfile.c, the only way we can correctly account for a
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1620) * fallocated folio arriving here is now to initialize it and write it.
1aac1400319d30 Hugh Dickins 2012-05-29 1621 *
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1622) * That's okay for a folio already fallocated earlier, but if we have
1aac1400319d30 Hugh Dickins 2012-05-29 1623 * not yet completed the fallocation, then (a) we want to keep track
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1624) * of this folio in case we have to undo it, and (b) it may not be a
1aac1400319d30 Hugh Dickins 2012-05-29 1625 * good idea to continue anyway, once we're pushing into swap. So
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1626) * reactivate the folio, and let shmem_fallocate() quit when too many.
1635f6a74152f1 Hugh Dickins 2012-05-29 1627 */
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1628) if (!folio_test_uptodate(folio)) {
1aac1400319d30 Hugh Dickins 2012-05-29 1629 if (inode->i_private) {
1aac1400319d30 Hugh Dickins 2012-05-29 1630 struct shmem_falloc *shmem_falloc;
1aac1400319d30 Hugh Dickins 2012-05-29 1631 spin_lock(&inode->i_lock);
1aac1400319d30 Hugh Dickins 2012-05-29 1632 shmem_falloc = inode->i_private;
1aac1400319d30 Hugh Dickins 2012-05-29 1633 if (shmem_falloc &&
8e205f779d1443 Hugh Dickins 2014-07-23 1634 !shmem_falloc->waitq &&
1aac1400319d30 Hugh Dickins 2012-05-29 1635 index >= shmem_falloc->start &&
1aac1400319d30 Hugh Dickins 2012-05-29 1636 index < shmem_falloc->next)
d77b90d2b26426 Baolin Wang 2024-12-19 1637 shmem_falloc->nr_unswapped += nr_pages;
1aac1400319d30 Hugh Dickins 2012-05-29 1638 else
1aac1400319d30 Hugh Dickins 2012-05-29 1639 shmem_falloc = NULL;
1aac1400319d30 Hugh Dickins 2012-05-29 1640 spin_unlock(&inode->i_lock);
1aac1400319d30 Hugh Dickins 2012-05-29 1641 if (shmem_falloc)
1aac1400319d30 Hugh Dickins 2012-05-29 1642 goto redirty;
1aac1400319d30 Hugh Dickins 2012-05-29 1643 }
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1644) folio_zero_range(folio, 0, folio_size(folio));
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1645) flush_dcache_folio(folio);
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1646) folio_mark_uptodate(folio);
1635f6a74152f1 Hugh Dickins 2012-05-29 1647 }
1635f6a74152f1 Hugh Dickins 2012-05-29 1648
7d14492199f93c Kairui Song 2025-10-24 @1649 if (!folio_alloc_swap(folio)) {
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1650 bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
6344a6d9ce13ae Hugh Dickins 2025-07-16 1651 int error;
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1652
b1dea800ac3959 Hugh Dickins 2011-05-11 1653 /*
b1dea800ac3959 Hugh Dickins 2011-05-11 1654 * Add inode to shmem_unuse()'s list of swapped-out inodes,
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1655) * if it's not already there. Do it now before the folio is
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1656 * removed from page cache, when its pagelock no longer
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1657 * protects the inode from eviction. And do it now, after
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1658 * we've incremented swapped, because shmem_unuse() will
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1659 * prune a !swapped inode from the swaplist.
b1dea800ac3959 Hugh Dickins 2011-05-11 1660 */
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1661 if (first_swapped) {
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1662 spin_lock(&shmem_swaplist_lock);
05bf86b4ccfd0f Hugh Dickins 2011-05-14 1663 if (list_empty(&info->swaplist))
b56a2d8af9147a Vineeth Remanan Pillai 2019-03-05 1664 list_add(&info->swaplist, &shmem_swaplist);
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1665 spin_unlock(&shmem_swaplist_lock);
ea693aaa5ce5ad Hugh Dickins 2025-07-16 1666 }
b1dea800ac3959 Hugh Dickins 2011-05-11 1667
80d6ed40156385 Kairui Song 2025-10-29 1668 folio_dup_swap(folio, NULL);
b487a2da3575b6 Kairui Song 2025-03-14 1669 shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
267a4c76bbdb95 Hugh Dickins 2015-12-11 1670
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1671) BUG_ON(folio_mapped(folio));
6344a6d9ce13ae Hugh Dickins 2025-07-16 1672 error = swap_writeout(folio, plug);
6344a6d9ce13ae Hugh Dickins 2025-07-16 1673 if (error != AOP_WRITEPAGE_ACTIVATE) {
6344a6d9ce13ae Hugh Dickins 2025-07-16 1674 /* folio has been unlocked */
6344a6d9ce13ae Hugh Dickins 2025-07-16 1675 return error;
6344a6d9ce13ae Hugh Dickins 2025-07-16 1676 }
6344a6d9ce13ae Hugh Dickins 2025-07-16 1677
6344a6d9ce13ae Hugh Dickins 2025-07-16 1678 /*
6344a6d9ce13ae Hugh Dickins 2025-07-16 1679 * The intention here is to avoid holding on to the swap when
6344a6d9ce13ae Hugh Dickins 2025-07-16 1680 * zswap was unable to compress and unable to writeback; but
6344a6d9ce13ae Hugh Dickins 2025-07-16 1681 * it will be appropriate if other reactivate cases are added.
6344a6d9ce13ae Hugh Dickins 2025-07-16 1682 */
6344a6d9ce13ae Hugh Dickins 2025-07-16 1683 error = shmem_add_to_page_cache(folio, mapping, index,
6344a6d9ce13ae Hugh Dickins 2025-07-16 1684 swp_to_radix_entry(folio->swap),
6344a6d9ce13ae Hugh Dickins 2025-07-16 1685 __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
6344a6d9ce13ae Hugh Dickins 2025-07-16 1686 /* Swap entry might be erased by racing shmem_free_swap() */
6344a6d9ce13ae Hugh Dickins 2025-07-16 1687 if (!error) {
6344a6d9ce13ae Hugh Dickins 2025-07-16 1688 shmem_recalc_inode(inode, 0, -nr_pages);
80d6ed40156385 Kairui Song 2025-10-29 1689 folio_put_swap(folio, NULL);
6344a6d9ce13ae Hugh Dickins 2025-07-16 1690 }
6344a6d9ce13ae Hugh Dickins 2025-07-16 1691
6344a6d9ce13ae Hugh Dickins 2025-07-16 1692 /*
fd8d4f862f8c27 Kairui Song 2025-09-17 1693 * The swap_cache_del_folio() below could be left for
6344a6d9ce13ae Hugh Dickins 2025-07-16 1694 * shrink_folio_list()'s folio_free_swap() to dispose of;
6344a6d9ce13ae Hugh Dickins 2025-07-16 1695 * but I'm a little nervous about letting this folio out of
6344a6d9ce13ae Hugh Dickins 2025-07-16 1696 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
6344a6d9ce13ae Hugh Dickins 2025-07-16 1697 * e.g. folio_mapping(folio) might give an unexpected answer.
6344a6d9ce13ae Hugh Dickins 2025-07-16 1698 */
fd8d4f862f8c27 Kairui Song 2025-09-17 1699 swap_cache_del_folio(folio);
6344a6d9ce13ae Hugh Dickins 2025-07-16 1700 goto redirty;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1701 }
b487a2da3575b6 Kairui Song 2025-03-14 1702 if (nr_pages > 1)
b487a2da3575b6 Kairui Song 2025-03-14 1703 goto try_split;
^1da177e4c3f41 Linus Torvalds 2005-04-16 1704 redirty:
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1705) folio_mark_dirty(folio);
f530ed0e2d01aa Matthew Wilcox (Oracle 2022-09-02 1706) return AOP_WRITEPAGE_ACTIVATE; /* Return with folio locked */
^1da177e4c3f41 Linus Torvalds 2005-04-16 1707 }
7b73c12c6ebf00 Matthew Wilcox (Oracle 2025-04-02 1708) EXPORT_SYMBOL_GPL(shmem_writeout);
^1da177e4c3f41 Linus Torvalds 2005-04-16 1709
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
2025-10-29 19:25 ` kernel test robot
2025-10-29 19:25 ` kernel test robot
@ 2025-11-01 4:51 ` YoungJun Park
2025-11-01 8:59 ` Kairui Song
2 siblings, 1 reply; 50+ messages in thread
From: YoungJun Park @ 2025-11-01 4:51 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:58:40PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
Hello Kairui!
> The current swap entry allocation/freeing workflow has never had a clear
> definition. This makes it hard to debug or add new optimizations.
>
> This commit introduces a proper definition of how swap entries would be
> allocated and freed. Now, most operations are folio based, so they will
> never exceed one swap cluster, and we now have a cleaner border between
> swap and the rest of mm, making it much easier to follow and debug,
> especially with new added sanity checks. Also making more optimization
> possible.
>
> Swap entry will be mostly allocated and free with a folio bound.
> The folio lock will be useful for resolving many swap ralated races.
>
> Now swap allocation (except hibernation) always starts with a folio in
> the swap cache, and gets duped/freed protected by the folio lock:
>
> - folio_alloc_swap() - The only allocation entry point now.
> Context: The folio must be locked.
> This allocates one or a set of continuous swap slots for a folio and
> binds them to the folio by adding the folio to the swap cache. The
> swap slots' swap count start with zero value.
>
> - folio_dup_swap() - Increase the swap count of one or more entries.
> Context: The folio must be locked and in the swap cache. For now, the
> caller still has to lock the new swap entry owner (e.g., PTL).
> This increases the ref count of swap entries allocated to a folio.
> Newly allocated swap slots' count has to be increased by this helper
> as the folio got unmapped (and swap entries got installed).
>
> - folio_put_swap() - Decrease the swap count of one or more entries.
> Context: The folio must be locked and in the swap cache. For now, the
> caller still has to lock the new swap entry owner (e.g., PTL).
> This decreases the ref count of swap entries allocated to a folio.
> Typically, swapin will decrease the swap count as the folio got
> installed back and the swap entry got uninstalled
>
> This won't remove the folio from the swap cache and free the
> slot. Lazy freeing of swap cache is helpful for reducing IO.
> There is already a folio_free_swap() for immediate cache reclaim.
> This part could be further optimized later.
>
> The above locking constraints could be further relaxed when the swap
> table if fully implemented. Currently dup still needs the caller
> to lock the swap entry container (e.g. PTL), or a concurrent zap
> may underflow the swap count.
>
> Some swap users need to interact with swap count without involving folio
> (e.g. forking/zapping the page table or mapping truncate without swapin).
> In such cases, the caller has to ensure there is no race condition on
> whatever owns the swap count and use the below helpers:
>
> - swap_put_entries_direct() - Decrease the swap count directly.
> Context: The caller must lock whatever is referencing the slots to
> avoid a race.
>
> Typically the page table zapping or shmem mapping truncate will need
> to free swap slots directly. If a slot is cached (has a folio bound),
> this will also try to release the swap cache.
>
> - swap_dup_entry_direct() - Increase the swap count directly.
> Context: The caller must lock whatever is referencing the entries to
> avoid race, and the entries must already have a swap count > 1.
>
> Typically, forking will need to copy the page table and hence needs to
> increase the swap count of the entries in the table. The page table is
> locked while referencing the swap entries, so the entries all have a
> swap count > 1 and can't be freed.
>
> Hibernation subsystem is a bit different, so two special wrappers are here:
>
> - swap_alloc_hibernation_slot() - Allocate one entry from one device.
> - swap_free_hibernation_slot() - Free one entry allocated by the above
> helper.
During the code review, I found something to be verified.
It is not directly releavant your patch,
I send the email for checking it right and possible fix on this patch.
on the swap_alloc_hibernation_slot function
nr_swap_pages is decreased. but as I think it is decreased on swap_range_alloc.
The nr_swap_pages are decremented as the callflow as like the below.
cluster_alloc_swap_entry -> alloc_swap_scan_cluster
-> closter_alloc_range -> swap_range_alloc
Introduced on
4f78252da887ee7e9d1875dd6e07d9baa936c04f
mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
#ifdef CONFIG_HIBERNATION
/* Allocate a slot for hibernation */
swp_entry_t swap_alloc_hibernation_slot(int type)
{
....
local_unlock(&percpu_swap_cluster.lock);
if (offset) {
entry = swp_entry(si->type, offset);
atomic_long_dec(&nr_swap_pages); // here
Thank you,
Youngjun Park
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-11-01 4:51 ` YoungJun Park
@ 2025-11-01 8:59 ` Kairui Song
2025-11-01 9:08 ` YoungJun Park
0 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-11-01 8:59 UTC (permalink / raw)
To: YoungJun Park
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Sat, Nov 1, 2025 at 12:51 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:40PM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
>
> Hello Kairui!
>
> > The current swap entry allocation/freeing workflow has never had a clear
> > definition. This makes it hard to debug or add new optimizations.
> >
> > This commit introduces a proper definition of how swap entries would be
> > allocated and freed. Now, most operations are folio based, so they will
> > never exceed one swap cluster, and we now have a cleaner border between
> > swap and the rest of mm, making it much easier to follow and debug,
> > especially with new added sanity checks. Also making more optimization
> > possible.
> >
> > Swap entry will be mostly allocated and free with a folio bound.
> > The folio lock will be useful for resolving many swap ralated races.
> >
> > Now swap allocation (except hibernation) always starts with a folio in
> > the swap cache, and gets duped/freed protected by the folio lock:
> >
> > - folio_alloc_swap() - The only allocation entry point now.
> > Context: The folio must be locked.
> > This allocates one or a set of continuous swap slots for a folio and
> > binds them to the folio by adding the folio to the swap cache. The
> > swap slots' swap count start with zero value.
> >
> > - folio_dup_swap() - Increase the swap count of one or more entries.
> > Context: The folio must be locked and in the swap cache. For now, the
> > caller still has to lock the new swap entry owner (e.g., PTL).
> > This increases the ref count of swap entries allocated to a folio.
> > Newly allocated swap slots' count has to be increased by this helper
> > as the folio got unmapped (and swap entries got installed).
> >
> > - folio_put_swap() - Decrease the swap count of one or more entries.
> > Context: The folio must be locked and in the swap cache. For now, the
> > caller still has to lock the new swap entry owner (e.g., PTL).
> > This decreases the ref count of swap entries allocated to a folio.
> > Typically, swapin will decrease the swap count as the folio got
> > installed back and the swap entry got uninstalled
> >
> > This won't remove the folio from the swap cache and free the
> > slot. Lazy freeing of swap cache is helpful for reducing IO.
> > There is already a folio_free_swap() for immediate cache reclaim.
> > This part could be further optimized later.
> >
> > The above locking constraints could be further relaxed when the swap
> > table if fully implemented. Currently dup still needs the caller
> > to lock the swap entry container (e.g. PTL), or a concurrent zap
> > may underflow the swap count.
> >
> > Some swap users need to interact with swap count without involving folio
> > (e.g. forking/zapping the page table or mapping truncate without swapin).
> > In such cases, the caller has to ensure there is no race condition on
> > whatever owns the swap count and use the below helpers:
> >
> > - swap_put_entries_direct() - Decrease the swap count directly.
> > Context: The caller must lock whatever is referencing the slots to
> > avoid a race.
> >
> > Typically the page table zapping or shmem mapping truncate will need
> > to free swap slots directly. If a slot is cached (has a folio bound),
> > this will also try to release the swap cache.
> >
> > - swap_dup_entry_direct() - Increase the swap count directly.
> > Context: The caller must lock whatever is referencing the entries to
> > avoid race, and the entries must already have a swap count > 1.
> >
> > Typically, forking will need to copy the page table and hence needs to
> > increase the swap count of the entries in the table. The page table is
> > locked while referencing the swap entries, so the entries all have a
> > swap count > 1 and can't be freed.
> >
> > Hibernation subsystem is a bit different, so two special wrappers are here:
> >
> > - swap_alloc_hibernation_slot() - Allocate one entry from one device.
> > - swap_free_hibernation_slot() - Free one entry allocated by the above
> > helper.
>
> During the code review, I found something to be verified.
> It is not directly releavant your patch,
> I send the email for checking it right and possible fix on this patch.
>
> on the swap_alloc_hibernation_slot function
> nr_swap_pages is decreased. but as I think it is decreased on swap_range_alloc.
>
> The nr_swap_pages are decremented as the callflow as like the below.
>
> cluster_alloc_swap_entry -> alloc_swap_scan_cluster
> -> closter_alloc_range -> swap_range_alloc
>
> Introduced on
> 4f78252da887ee7e9d1875dd6e07d9baa936c04f
> mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
>
Yeah, you are right, that's a bug introduced by 4f78252da887, will you
send a patch to fix that ? Or I can send one, just remove the
atomic_long_dec(&nr_swap_pages) in get_swap_page_of_type then we are
fine.
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 14/19] mm, swap: sanitize swap entry management workflow
2025-11-01 8:59 ` Kairui Song
@ 2025-11-01 9:08 ` YoungJun Park
0 siblings, 0 replies; 50+ messages in thread
From: YoungJun Park @ 2025-11-01 9:08 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Sat, Nov 01, 2025 at 04:59:05PM +0800, Kairui Song wrote:
> On Sat, Nov 1, 2025 at 12:51 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 11:58:40PM +0800, Kairui Song wrote:
> > > From: Kairui Song <kasong@tencent.com>
> > During the code review, I found something to be verified.
> > It is not directly releavant your patch,
> > I send the email for checking it right and possible fix on this patch.
> >
> > on the swap_alloc_hibernation_slot function
> > nr_swap_pages is decreased. but as I think it is decreased on swap_range_alloc.
> >
> > The nr_swap_pages are decremented as the callflow as like the below.
> >
> > cluster_alloc_swap_entry -> alloc_swap_scan_cluster
> > -> closter_alloc_range -> swap_range_alloc
> >
> > Introduced on
> > 4f78252da887ee7e9d1875dd6e07d9baa936c04f
> > mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()
> >
>
> Yeah, you are right, that's a bug introduced by 4f78252da887, will you
> send a patch to fix that ? Or I can send one, just remove the
> atomic_long_dec(&nr_swap_pages) in get_swap_page_of_type then we are
> fine.
Thank you for double check. I will send a patch soon.
Regards,
Youngjun Park
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (13 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 14/19] mm, swap: sanitize swap entry management workflow Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 16:52 ` Kairui Song
2025-10-31 5:56 ` YoungJun Park
2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
` (5 subsequent siblings)
20 siblings, 2 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
This pinning usage here can be dropped by adding the folio to swap
cache directly on allocation.
All swap allocations are folio-based now (except for hibernation), so
the swap allocator can always take the folio as the parameter. And now
both swap cache (swap table) and swap map are protected by the cluster
lock, scanning the map and inserting the folio can be done in the same
critical section. This eliminates the time window that a slot is pinned
by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
multiple times.
This is both a cleanup and an optimization.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 5 --
mm/swap.h | 8 +--
mm/swap_state.c | 56 +++++++++++-------
mm/swapfile.c | 161 +++++++++++++++++++++------------------------------
4 files changed, 105 insertions(+), 125 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac3caa4c6999..4b4b81fbc6a3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
}
extern void si_swapinfo(struct sysinfo *);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
int swap_type_of(dev_t device, sector_t offset);
int find_first_swap(dev_t *device);
@@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
{
}
-static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
-{
-}
-
static inline int __swap_count(swp_entry_t entry)
{
return 0;
diff --git a/mm/swap.h b/mm/swap.h
index 74c61129d7b7..03694ffa662f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
*/
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
- void **shadow, bool alloc);
void swap_cache_del_folio(struct folio *folio);
struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
struct mempolicy *mpol, pgoff_t ilx,
bool *alloced);
/* Below helpers require the caller to lock and pass in the swap cluster. */
+void __swap_cache_add_folio(struct swap_cluster_info *ci,
+ struct folio *folio, swp_entry_t entry);
void __swap_cache_del_folio(struct swap_cluster_info *ci,
struct folio *folio, swp_entry_t entry, void *shadow);
void __swap_cache_replace_folio(struct swap_cluster_info *ci,
@@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
return NULL;
}
-static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
- void **shadow, bool alloc)
+static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
+ struct folio *folio, swp_entry_t entry)
{
}
diff --git a/mm/swap_state.c b/mm/swap_state.c
index d2bcca92b6e0..85d9f99c384f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
return NULL;
}
+void __swap_cache_add_folio(struct swap_cluster_info *ci,
+ struct folio *folio, swp_entry_t entry)
+{
+ unsigned long new_tb;
+ unsigned int ci_start, ci_off, ci_end;
+ unsigned long nr_pages = folio_nr_pages(folio);
+
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
+
+ new_tb = folio_to_swp_tb(folio);
+ ci_start = swp_cluster_offset(entry);
+ ci_off = ci_start;
+ ci_end = ci_start + nr_pages;
+ do {
+ VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
+ __swap_table_set(ci, ci_off, new_tb);
+ } while (++ci_off < ci_end);
+
+ folio_ref_add(folio, nr_pages);
+ folio_set_swapcache(folio);
+ folio->swap = entry;
+
+ node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
+ lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
+}
+
/**
* swap_cache_add_folio - Add a folio into the swap cache.
* @folio: The folio to be added.
@@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
* The caller also needs to update the corresponding swap_map slots with
* SWAP_HAS_CACHE bit to avoid race or conflict.
*/
-int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
- void **shadowp, bool alloc)
+static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
+ void **shadowp)
{
int err;
void *shadow = NULL;
+ unsigned long old_tb;
struct swap_info_struct *si;
- unsigned long old_tb, new_tb;
struct swap_cluster_info *ci;
unsigned int ci_start, ci_off, ci_end, offset;
unsigned long nr_pages = folio_nr_pages(folio);
- VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
- VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
- VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
-
si = __swap_entry_to_info(entry);
- new_tb = folio_to_swp_tb(folio);
ci_start = swp_cluster_offset(entry);
ci_end = ci_start + nr_pages;
ci_off = ci_start;
@@ -168,7 +191,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
err = -EEXIST;
goto failed;
}
- if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
+ if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
err = -ENOENT;
goto failed;
}
@@ -184,20 +207,11 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
* Still need to pin the slots with SWAP_HAS_CACHE since
* swap allocator depends on that.
*/
- if (!alloc)
- __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
- __swap_table_set(ci, ci_off, new_tb);
+ __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
offset++;
} while (++ci_off < ci_end);
-
- folio_ref_add(folio, nr_pages);
- folio_set_swapcache(folio);
- folio->swap = entry;
+ __swap_cache_add_folio(ci, folio, entry);
swap_cluster_unlock(ci);
-
- node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
- lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
-
if (shadowp)
*shadowp = shadow;
return 0;
@@ -466,7 +480,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
for (;;) {
- ret = swap_cache_add_folio(folio, entry, &shadow, false);
+ ret = swap_cache_add_folio(folio, entry, &shadow);
if (!ret)
break;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 426b0b6d583f..8d98f28907bc 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -875,28 +875,53 @@ static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
}
}
-static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
- unsigned int start, unsigned char usage,
- unsigned int order)
+static bool cluster_alloc_range(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ struct folio *folio,
+ unsigned int offset)
{
- unsigned int nr_pages = 1 << order;
+ unsigned long nr_pages;
+ unsigned int order;
lockdep_assert_held(&ci->lock);
if (!(si->flags & SWP_WRITEOK))
return false;
+ /*
+ * All mm swap allocation starts with a folio (folio_alloc_swap),
+ * it's also the only allocation path for large orders allocation.
+ * Such swap slots starts with count == 0 and will be increased
+ * upon folio unmap.
+ *
+ * Else, it's a exclusive order 0 allocation for hibernation.
+ * The slot starts with count == 1 and never increases.
+ */
+ if (likely(folio)) {
+ order = folio_order(folio);
+ nr_pages = 1 << order;
+ /*
+ * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries.
+ * This is the legacy allocation behavior, will drop it very soon.
+ */
+ memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
+ __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
+ } else {
+ order = 0;
+ nr_pages = 1;
+ WARN_ON_ONCE(si->swap_map[offset]);
+ si->swap_map[offset] = 1;
+ swap_cluster_assert_table_empty(ci, offset, 1);
+ }
+
/*
* The first allocation in a cluster makes the
* cluster exclusive to this order
*/
if (cluster_is_empty(ci))
ci->order = order;
-
- memset(si->swap_map + start, usage, nr_pages);
- swap_cluster_assert_table_empty(ci, start, nr_pages);
- swap_range_alloc(si, nr_pages);
ci->count += nr_pages;
+ swap_range_alloc(si, nr_pages);
return true;
}
@@ -904,13 +929,12 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
/* Try use a new cluster for current CPU and allocate from it. */
static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- unsigned long offset,
- unsigned int order,
- unsigned char usage)
+ struct folio *folio, unsigned long offset)
{
unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
+ unsigned int order = likely(folio) ? folio_order(folio) : 0;
unsigned int nr_pages = 1 << order;
bool need_reclaim;
@@ -930,7 +954,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
continue;
offset = found;
}
- if (!cluster_alloc_range(si, ci, offset, usage, order))
+ if (!cluster_alloc_range(si, ci, folio, offset))
break;
found = offset;
offset += nr_pages;
@@ -952,8 +976,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
struct list_head *list,
- unsigned int order,
- unsigned char usage,
+ struct folio *folio,
bool scan_all)
{
unsigned int found = SWAP_ENTRY_INVALID;
@@ -965,7 +988,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
if (!ci)
break;
offset = cluster_offset(si, ci);
- found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
+ found = alloc_swap_scan_cluster(si, ci, folio, offset);
if (found)
break;
} while (scan_all);
@@ -1026,10 +1049,11 @@ static void swap_reclaim_work(struct work_struct *work)
* Try to allocate swap entries with specified order and try set a new
* cluster for current CPU too.
*/
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
- unsigned char usage)
+static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
+ struct folio *folio)
{
struct swap_cluster_info *ci;
+ unsigned int order = likely(folio) ? folio_order(folio) : 0;
unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
/*
@@ -1051,8 +1075,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
if (cluster_is_usable(ci, order)) {
if (cluster_is_empty(ci))
offset = cluster_offset(si, ci);
- found = alloc_swap_scan_cluster(si, ci, offset,
- order, usage);
+ found = alloc_swap_scan_cluster(si, ci, folio, offset);
} else {
swap_cluster_unlock(ci);
}
@@ -1066,22 +1089,19 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
* to spread out the writes.
*/
if (si->flags & SWP_PAGE_DISCARD) {
- found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
- false);
+ found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
if (found)
goto done;
}
if (order < PMD_ORDER) {
- found = alloc_swap_scan_list(si, &si->nonfull_clusters[order],
- order, usage, true);
+ found = alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, true);
if (found)
goto done;
}
if (!(si->flags & SWP_PAGE_DISCARD)) {
- found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
- false);
+ found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
if (found)
goto done;
}
@@ -1097,8 +1117,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
* failure is not critical. Scanning one cluster still
* keeps the list rotated and reclaimed (for HAS_CACHE).
*/
- found = alloc_swap_scan_list(si, &si->frag_clusters[order], order,
- usage, false);
+ found = alloc_swap_scan_list(si, &si->frag_clusters[order], folio, false);
if (found)
goto done;
}
@@ -1112,13 +1131,11 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
* Clusters here have at least one usable slots and can't fail order 0
* allocation, but reclaim may drop si->lock and race with another user.
*/
- found = alloc_swap_scan_list(si, &si->frag_clusters[o],
- 0, usage, true);
+ found = alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true);
if (found)
goto done;
- found = alloc_swap_scan_list(si, &si->nonfull_clusters[o],
- 0, usage, true);
+ found = alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true);
if (found)
goto done;
}
@@ -1309,12 +1326,12 @@ static bool get_swap_device_info(struct swap_info_struct *si)
* Fast path try to get swap entries with specified order from current
* CPU's swap entry pool (a cluster).
*/
-static bool swap_alloc_fast(swp_entry_t *entry,
- int order)
+static bool swap_alloc_fast(struct folio *folio)
{
+ unsigned int order = folio_order(folio);
struct swap_cluster_info *ci;
struct swap_info_struct *si;
- unsigned int offset, found = SWAP_ENTRY_INVALID;
+ unsigned int offset;
/*
* Once allocated, swap_info_struct will never be completely freed,
@@ -1329,22 +1346,18 @@ static bool swap_alloc_fast(swp_entry_t *entry,
if (cluster_is_usable(ci, order)) {
if (cluster_is_empty(ci))
offset = cluster_offset(si, ci);
- found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
- if (found)
- *entry = swp_entry(si->type, found);
+ alloc_swap_scan_cluster(si, ci, folio, offset);
} else {
swap_cluster_unlock(ci);
}
put_swap_device(si);
- return !!found;
+ return folio_test_swapcache(folio);
}
/* Rotate the device and switch to a new cluster */
-static bool swap_alloc_slow(swp_entry_t *entry,
- int order)
+static void swap_alloc_slow(struct folio *folio)
{
- unsigned long offset;
struct swap_info_struct *si, *next;
spin_lock(&swap_avail_lock);
@@ -1354,14 +1367,12 @@ static bool swap_alloc_slow(swp_entry_t *entry,
plist_requeue(&si->avail_list, &swap_avail_head);
spin_unlock(&swap_avail_lock);
if (get_swap_device_info(si)) {
- offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
+ cluster_alloc_swap_entry(si, folio);
put_swap_device(si);
- if (offset) {
- *entry = swp_entry(si->type, offset);
- return true;
- }
- if (order)
- return false;
+ if (folio_test_swapcache(folio))
+ return;
+ if (folio_test_large(folio))
+ return;
}
spin_lock(&swap_avail_lock);
@@ -1423,7 +1434,6 @@ int folio_alloc_swap(struct folio *folio)
{
unsigned int order = folio_order(folio);
unsigned int size = 1 << order;
- swp_entry_t entry = {};
VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
@@ -1448,39 +1458,23 @@ int folio_alloc_swap(struct folio *folio)
again:
local_lock(&percpu_swap_cluster.lock);
- if (!swap_alloc_fast(&entry, order))
- swap_alloc_slow(&entry, order);
+ if (!swap_alloc_fast(folio))
+ swap_alloc_slow(folio);
local_unlock(&percpu_swap_cluster.lock);
- if (unlikely(!order && !entry.val)) {
+ if (!order && unlikely(!folio_test_swapcache(folio))) {
if (swap_sync_discard())
goto again;
}
/* Need to call this even if allocation failed, for MEMCG_SWAP_FAIL. */
- if (mem_cgroup_try_charge_swap(folio, entry))
- goto out_free;
+ if (unlikely(mem_cgroup_try_charge_swap(folio, folio->swap)))
+ swap_cache_del_folio(folio);
- if (!entry.val)
+ if (unlikely(!folio_test_swapcache(folio)))
return -ENOMEM;
- /*
- * Allocator has pinned the slots with SWAP_HAS_CACHE
- * so it should never fail
- */
- WARN_ON_ONCE(swap_cache_add_folio(folio, entry, NULL, true));
-
- /*
- * Allocator should always allocate aligned entries so folio based
- * operations never crossed more than one cluster.
- */
- VM_WARN_ON_ONCE_FOLIO(!IS_ALIGNED(folio->swap.val, size), folio);
-
return 0;
-
-out_free:
- put_swap_folio(folio, entry);
- return -ENOMEM;
}
/**
@@ -1779,29 +1773,6 @@ static void swap_entries_free(struct swap_info_struct *si,
partial_free_cluster(si, ci);
}
-/*
- * Called after dropping swapcache to decrease refcnt to swap entries.
- */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
-{
- struct swap_info_struct *si;
- struct swap_cluster_info *ci;
- unsigned long offset = swp_offset(entry);
- int size = 1 << swap_entry_order(folio_order(folio));
-
- si = _swap_info_get(entry);
- if (!si)
- return;
-
- ci = swap_cluster_lock(si, offset);
- if (swap_only_has_cache(si, offset, size))
- swap_entries_free(si, ci, entry, size);
- else
- for (int i = 0; i < size; i++, entry.val++)
- swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
- swap_cluster_unlock(ci);
-}
-
int __swap_count(swp_entry_t entry)
{
struct swap_info_struct *si = __swap_entry_to_info(entry);
@@ -2052,7 +2023,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
* with swap table allocation.
*/
local_lock(&percpu_swap_cluster.lock);
- offset = cluster_alloc_swap_entry(si, 0, 1);
+ offset = cluster_alloc_swap_entry(si, NULL);
local_unlock(&percpu_swap_cluster.lock);
if (offset) {
entry = swp_entry(si->type, offset);
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
@ 2025-10-29 16:52 ` Kairui Song
2025-10-31 5:56 ` YoungJun Park
1 sibling, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 16:52 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Thu, Oct 30, 2025 at 12:00 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
> SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
> This pinning usage here can be dropped by adding the folio to swap
> cache directly on allocation.
>
> All swap allocations are folio-based now (except for hibernation), so
> the swap allocator can always take the folio as the parameter. And now
> both swap cache (swap table) and swap map are protected by the cluster
> lock, scanning the map and inserting the folio can be done in the same
> critical section. This eliminates the time window that a slot is pinned
> by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
> multiple times.
>
> This is both a cleanup and an optimization.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> include/linux/swap.h | 5 --
> mm/swap.h | 8 +--
> mm/swap_state.c | 56 +++++++++++-------
> mm/swapfile.c | 161 +++++++++++++++++++++------------------------------
> 4 files changed, 105 insertions(+), 125 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ac3caa4c6999..4b4b81fbc6a3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
> }
>
> extern void si_swapinfo(struct sysinfo *);
> -void put_swap_folio(struct folio *folio, swp_entry_t entry);
> extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> int swap_type_of(dev_t device, sector_t offset);
> int find_first_swap(dev_t *device);
> @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
> {
> }
>
> -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> -{
> -}
> -
> static inline int __swap_count(swp_entry_t entry)
> {
> return 0;
> diff --git a/mm/swap.h b/mm/swap.h
> index 74c61129d7b7..03694ffa662f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> */
> struct folio *swap_cache_get_folio(swp_entry_t entry);
> void *swap_cache_get_shadow(swp_entry_t entry);
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> - void **shadow, bool alloc);
> void swap_cache_del_folio(struct folio *folio);
> struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
> struct mempolicy *mpol, pgoff_t ilx,
> bool *alloced);
> /* Below helpers require the caller to lock and pass in the swap cluster. */
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> + struct folio *folio, swp_entry_t entry);
> void __swap_cache_del_folio(struct swap_cluster_info *ci,
> struct folio *folio, swp_entry_t entry, void *shadow);
> void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
> return NULL;
> }
>
> -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> - void **shadow, bool alloc)
> +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
> + struct folio *folio, swp_entry_t entry)
> {
> }
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d2bcca92b6e0..85d9f99c384f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> return NULL;
> }
>
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> + struct folio *folio, swp_entry_t entry)
> +{
> + unsigned long new_tb;
> + unsigned int ci_start, ci_off, ci_end;
> + unsigned long nr_pages = folio_nr_pages(folio);
> +
> + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> + new_tb = folio_to_swp_tb(folio);
> + ci_start = swp_cluster_offset(entry);
> + ci_off = ci_start;
> + ci_end = ci_start + nr_pages;
> + do {
> + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> + __swap_table_set(ci, ci_off, new_tb);
> + } while (++ci_off < ci_end);
> +
> + folio_ref_add(folio, nr_pages);
> + folio_set_swapcache(folio);
> + folio->swap = entry;
> +
> + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> +}
> +
> /**
> * swap_cache_add_folio - Add a folio into the swap cache.
> * @folio: The folio to be added.
> @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> * The caller also needs to update the corresponding swap_map slots with
> * SWAP_HAS_CACHE bit to avoid race or conflict.
> */
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> - void **shadowp, bool alloc)
> +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> + void **shadowp)
> {
> int err;
> void *shadow = NULL;
> + unsigned long old_tb;
> struct swap_info_struct *si;
> - unsigned long old_tb, new_tb;
> struct swap_cluster_info *ci;
> unsigned int ci_start, ci_off, ci_end, offset;
> unsigned long nr_pages = folio_nr_pages(folio);
>
> - VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> - VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> - VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> -
> si = __swap_entry_to_info(entry);
> - new_tb = folio_to_swp_tb(folio);
> ci_start = swp_cluster_offset(entry);
> ci_end = ci_start + nr_pages;
> ci_off = ci_start;
> @@ -168,7 +191,7 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> err = -EEXIST;
> goto failed;
> }
> - if (!alloc && unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
> + if (unlikely(!__swap_count(swp_entry(swp_type(entry), offset)))) {
> err = -ENOENT;
> goto failed;
> }
> @@ -184,20 +207,11 @@ int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> * Still need to pin the slots with SWAP_HAS_CACHE since
> * swap allocator depends on that.
> */
> - if (!alloc)
> - __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
> - __swap_table_set(ci, ci_off, new_tb);
> + __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
> offset++;
> } while (++ci_off < ci_end);
> -
> - folio_ref_add(folio, nr_pages);
> - folio_set_swapcache(folio);
> - folio->swap = entry;
> + __swap_cache_add_folio(ci, folio, entry);
> swap_cluster_unlock(ci);
> -
> - node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> - lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> -
> if (shadowp)
> *shadowp = shadow;
> return 0;
> @@ -466,7 +480,7 @@ static struct folio *__swap_cache_prepare_and_add(swp_entry_t entry,
> __folio_set_locked(folio);
> __folio_set_swapbacked(folio);
> for (;;) {
> - ret = swap_cache_add_folio(folio, entry, &shadow, false);
> + ret = swap_cache_add_folio(folio, entry, &shadow);
> if (!ret)
> break;
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 426b0b6d583f..8d98f28907bc 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -875,28 +875,53 @@ static void swap_cluster_assert_table_empty(struct swap_cluster_info *ci,
> }
> }
>
> -static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci,
> - unsigned int start, unsigned char usage,
> - unsigned int order)
> +static bool cluster_alloc_range(struct swap_info_struct *si,
> + struct swap_cluster_info *ci,
> + struct folio *folio,
> + unsigned int offset)
> {
> - unsigned int nr_pages = 1 << order;
> + unsigned long nr_pages;
> + unsigned int order;
>
> lockdep_assert_held(&ci->lock);
>
> if (!(si->flags & SWP_WRITEOK))
> return false;
>
> + /*
> + * All mm swap allocation starts with a folio (folio_alloc_swap),
> + * it's also the only allocation path for large orders allocation.
> + * Such swap slots starts with count == 0 and will be increased
> + * upon folio unmap.
> + *
> + * Else, it's a exclusive order 0 allocation for hibernation.
> + * The slot starts with count == 1 and never increases.
> + */
> + if (likely(folio)) {
> + order = folio_order(folio);
> + nr_pages = 1 << order;
> + /*
> + * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries.
> + * This is the legacy allocation behavior, will drop it very soon.
> + */
> + memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
> + __swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
> + } else {
> + order = 0;
> + nr_pages = 1;
> + WARN_ON_ONCE(si->swap_map[offset]);
> + si->swap_map[offset] = 1;
> + swap_cluster_assert_table_empty(ci, offset, 1);
> + }
> +
> /*
> * The first allocation in a cluster makes the
> * cluster exclusive to this order
> */
> if (cluster_is_empty(ci))
> ci->order = order;
> -
> - memset(si->swap_map + start, usage, nr_pages);
> - swap_cluster_assert_table_empty(ci, start, nr_pages);
> - swap_range_alloc(si, nr_pages);
> ci->count += nr_pages;
> + swap_range_alloc(si, nr_pages);
>
> return true;
> }
> @@ -904,13 +929,12 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster
> /* Try use a new cluster for current CPU and allocate from it. */
> static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> struct swap_cluster_info *ci,
> - unsigned long offset,
> - unsigned int order,
> - unsigned char usage)
> + struct folio *folio, unsigned long offset)
> {
> unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
> unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
> unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
> + unsigned int order = likely(folio) ? folio_order(folio) : 0;
> unsigned int nr_pages = 1 << order;
> bool need_reclaim;
>
> @@ -930,7 +954,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
> continue;
> offset = found;
> }
> - if (!cluster_alloc_range(si, ci, offset, usage, order))
> + if (!cluster_alloc_range(si, ci, folio, offset))
> break;
> found = offset;
> offset += nr_pages;
> @@ -952,8 +976,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
>
> static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
> struct list_head *list,
> - unsigned int order,
> - unsigned char usage,
> + struct folio *folio,
> bool scan_all)
> {
> unsigned int found = SWAP_ENTRY_INVALID;
> @@ -965,7 +988,7 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
> if (!ci)
> break;
> offset = cluster_offset(si, ci);
> - found = alloc_swap_scan_cluster(si, ci, offset, order, usage);
> + found = alloc_swap_scan_cluster(si, ci, folio, offset);
> if (found)
> break;
> } while (scan_all);
> @@ -1026,10 +1049,11 @@ static void swap_reclaim_work(struct work_struct *work)
> * Try to allocate swap entries with specified order and try set a new
> * cluster for current CPU too.
> */
> -static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> - unsigned char usage)
> +static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
> + struct folio *folio)
> {
> struct swap_cluster_info *ci;
> + unsigned int order = likely(folio) ? folio_order(folio) : 0;
> unsigned int offset = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
>
> /*
> @@ -1051,8 +1075,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> if (cluster_is_usable(ci, order)) {
> if (cluster_is_empty(ci))
> offset = cluster_offset(si, ci);
> - found = alloc_swap_scan_cluster(si, ci, offset,
> - order, usage);
> + found = alloc_swap_scan_cluster(si, ci, folio, offset);
> } else {
> swap_cluster_unlock(ci);
> }
> @@ -1066,22 +1089,19 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> * to spread out the writes.
> */
> if (si->flags & SWP_PAGE_DISCARD) {
> - found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
> - false);
> + found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
> if (found)
> goto done;
> }
>
> if (order < PMD_ORDER) {
> - found = alloc_swap_scan_list(si, &si->nonfull_clusters[order],
> - order, usage, true);
> + found = alloc_swap_scan_list(si, &si->nonfull_clusters[order], folio, true);
> if (found)
> goto done;
> }
>
> if (!(si->flags & SWP_PAGE_DISCARD)) {
> - found = alloc_swap_scan_list(si, &si->free_clusters, order, usage,
> - false);
> + found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
> if (found)
> goto done;
> }
> @@ -1097,8 +1117,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> * failure is not critical. Scanning one cluster still
> * keeps the list rotated and reclaimed (for HAS_CACHE).
> */
> - found = alloc_swap_scan_list(si, &si->frag_clusters[order], order,
> - usage, false);
> + found = alloc_swap_scan_list(si, &si->frag_clusters[order], folio, false);
> if (found)
> goto done;
> }
> @@ -1112,13 +1131,11 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> * Clusters here have at least one usable slots and can't fail order 0
> * allocation, but reclaim may drop si->lock and race with another user.
> */
> - found = alloc_swap_scan_list(si, &si->frag_clusters[o],
> - 0, usage, true);
> + found = alloc_swap_scan_list(si, &si->frag_clusters[o], folio, true);
> if (found)
> goto done;
>
> - found = alloc_swap_scan_list(si, &si->nonfull_clusters[o],
> - 0, usage, true);
> + found = alloc_swap_scan_list(si, &si->nonfull_clusters[o], folio, true);
> if (found)
> goto done;
> }
> @@ -1309,12 +1326,12 @@ static bool get_swap_device_info(struct swap_info_struct *si)
> * Fast path try to get swap entries with specified order from current
> * CPU's swap entry pool (a cluster).
> */
> -static bool swap_alloc_fast(swp_entry_t *entry,
> - int order)
> +static bool swap_alloc_fast(struct folio *folio)
> {
> + unsigned int order = folio_order(folio);
> struct swap_cluster_info *ci;
> struct swap_info_struct *si;
> - unsigned int offset, found = SWAP_ENTRY_INVALID;
> + unsigned int offset;
>
> /*
> * Once allocated, swap_info_struct will never be completely freed,
> @@ -1329,22 +1346,18 @@ static bool swap_alloc_fast(swp_entry_t *entry,
> if (cluster_is_usable(ci, order)) {
> if (cluster_is_empty(ci))
> offset = cluster_offset(si, ci);
> - found = alloc_swap_scan_cluster(si, ci, offset, order, SWAP_HAS_CACHE);
> - if (found)
> - *entry = swp_entry(si->type, found);
> + alloc_swap_scan_cluster(si, ci, folio, offset);
> } else {
> swap_cluster_unlock(ci);
> }
>
> put_swap_device(si);
> - return !!found;
> + return folio_test_swapcache(folio);
> }
>
> /* Rotate the device and switch to a new cluster */
> -static bool swap_alloc_slow(swp_entry_t *entry,
> - int order)
> +static void swap_alloc_slow(struct folio *folio)
> {
> - unsigned long offset;
> struct swap_info_struct *si, *next;
>
> spin_lock(&swap_avail_lock);
> @@ -1354,14 +1367,12 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> plist_requeue(&si->avail_list, &swap_avail_head);
> spin_unlock(&swap_avail_lock);
> if (get_swap_device_info(si)) {
> - offset = cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE);
> + cluster_alloc_swap_entry(si, folio);
> put_swap_device(si);
> - if (offset) {
> - *entry = swp_entry(si->type, offset);
> - return true;
> - }
> - if (order)
> - return false;
> + if (folio_test_swapcache(folio))
> + return;
> + if (folio_test_large(folio))
> + return;
> }
>
> spin_lock(&swap_avail_lock);
My bad, following diff was lost during rebase to mm-new,
swap_alloc_slow should return void now:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8d98f28907bc..0bc734eb32c4 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1391,7 +1391,6 @@ static void swap_alloc_slow(struct folio *folio)
goto start_over;
}
spin_unlock(&swap_avail_lock);
- return false;
}
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
2025-10-29 16:52 ` Kairui Song
@ 2025-10-31 5:56 ` YoungJun Park
2025-10-31 7:02 ` Kairui Song
1 sibling, 1 reply; 50+ messages in thread
From: YoungJun Park @ 2025-10-31 5:56 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:58:41PM +0800, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
Hello Kairui
> The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
> SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
> This pinning usage here can be dropped by adding the folio to swap
> cache directly on allocation.
>
> All swap allocations are folio-based now (except for hibernation), so
> the swap allocator can always take the folio as the parameter. And now
> both swap cache (swap table) and swap map are protected by the cluster
> lock, scanning the map and inserting the folio can be done in the same
> critical section. This eliminates the time window that a slot is pinned
> by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
> multiple times.
>
> This is both a cleanup and an optimization.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> include/linux/swap.h | 5 --
> mm/swap.h | 8 +--
> mm/swap_state.c | 56 +++++++++++-------
> mm/swapfile.c | 161 +++++++++++++++++++++------------------------------
> 4 files changed, 105 insertions(+), 125 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index ac3caa4c6999..4b4b81fbc6a3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
> }
>
> extern void si_swapinfo(struct sysinfo *);
> -void put_swap_folio(struct folio *folio, swp_entry_t entry);
> extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> int swap_type_of(dev_t device, sector_t offset);
> int find_first_swap(dev_t *device);
> @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
> {
> }
>
> -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> -{
> -}
> -
> static inline int __swap_count(swp_entry_t entry)
> {
> return 0;
> diff --git a/mm/swap.h b/mm/swap.h
> index 74c61129d7b7..03694ffa662f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> */
> struct folio *swap_cache_get_folio(swp_entry_t entry);
> void *swap_cache_get_shadow(swp_entry_t entry);
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> - void **shadow, bool alloc);
> void swap_cache_del_folio(struct folio *folio);
> struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
> struct mempolicy *mpol, pgoff_t ilx,
> bool *alloced);
> /* Below helpers require the caller to lock and pass in the swap cluster. */
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> + struct folio *folio, swp_entry_t entry);
> void __swap_cache_del_folio(struct swap_cluster_info *ci,
> struct folio *folio, swp_entry_t entry, void *shadow);
> void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
> return NULL;
> }
>
> -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> - void **shadow, bool alloc)
> +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
> + struct folio *folio, swp_entry_t entry)
> {
> }
Just a nit,
void* return nothing.
changed to void (original function prototype is return void)
or how about just remove If this is not used on !CONFIG_SWAP
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index d2bcca92b6e0..85d9f99c384f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> return NULL;
> }
>
> +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> + struct folio *folio, swp_entry_t entry)
> +{
> + unsigned long new_tb;
> + unsigned int ci_start, ci_off, ci_end;
> + unsigned long nr_pages = folio_nr_pages(folio);
> +
> + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> +
> + new_tb = folio_to_swp_tb(folio);
> + ci_start = swp_cluster_offset(entry);
> + ci_off = ci_start;
> + ci_end = ci_start + nr_pages;
> + do {
> + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> + __swap_table_set(ci, ci_off, new_tb);
> + } while (++ci_off < ci_end);
> +
> + folio_ref_add(folio, nr_pages);
> + folio_set_swapcache(folio);
> + folio->swap = entry;
> +
> + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> +}
> +
> /**
> * swap_cache_add_folio - Add a folio into the swap cache.
> * @folio: The folio to be added.
> @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> * The caller also needs to update the corresponding swap_map slots with
> * SWAP_HAS_CACHE bit to avoid race or conflict.
> */
> -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> - void **shadowp, bool alloc)
> +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> + void **shadowp)
It is also a small thing.
"alloc" parameter removed then the comment might be updated.
Thanks,
Youngjun Park
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation
2025-10-31 5:56 ` YoungJun Park
@ 2025-10-31 7:02 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-31 7:02 UTC (permalink / raw)
To: YoungJun Park
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, Yosry Ahmed, David Hildenbrand,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Fri, Oct 31, 2025 at 1:56 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:41PM +0800, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
>
> Hello Kairui
>
> > The allocator uses SWAP_HAS_CACHE to pin a swap slot upon allocation.
> > SWAP_HAS_CACHE is being deprecated as it caused a lot of confusion.
> > This pinning usage here can be dropped by adding the folio to swap
> > cache directly on allocation.
> >
> > All swap allocations are folio-based now (except for hibernation), so
> > the swap allocator can always take the folio as the parameter. And now
> > both swap cache (swap table) and swap map are protected by the cluster
> > lock, scanning the map and inserting the folio can be done in the same
> > critical section. This eliminates the time window that a slot is pinned
> > by SWAP_HAS_CACHE, but it has no cache, and avoids touching the lock
> > multiple times.
> >
> > This is both a cleanup and an optimization.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > include/linux/swap.h | 5 --
> > mm/swap.h | 8 +--
> > mm/swap_state.c | 56 +++++++++++-------
> > mm/swapfile.c | 161 +++++++++++++++++++++------------------------------
> > 4 files changed, 105 insertions(+), 125 deletions(-)
> >
> > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > index ac3caa4c6999..4b4b81fbc6a3 100644
> > --- a/include/linux/swap.h
> > +++ b/include/linux/swap.h
> > @@ -452,7 +452,6 @@ static inline long get_nr_swap_pages(void)
> > }
> >
> > extern void si_swapinfo(struct sysinfo *);
> > -void put_swap_folio(struct folio *folio, swp_entry_t entry);
> > extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> > int swap_type_of(dev_t device, sector_t offset);
> > int find_first_swap(dev_t *device);
> > @@ -534,10 +533,6 @@ static inline void swap_put_entries_direct(swp_entry_t ent, int nr)
> > {
> > }
> >
> > -static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
> > -{
> > -}
> > -
> > static inline int __swap_count(swp_entry_t entry)
> > {
> > return 0;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 74c61129d7b7..03694ffa662f 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -277,13 +277,13 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> > */
> > struct folio *swap_cache_get_folio(swp_entry_t entry);
> > void *swap_cache_get_shadow(swp_entry_t entry);
> > -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > - void **shadow, bool alloc);
> > void swap_cache_del_folio(struct folio *folio);
> > struct folio *swap_cache_alloc_folio(swp_entry_t entry, gfp_t gfp_flags,
> > struct mempolicy *mpol, pgoff_t ilx,
> > bool *alloced);
> > /* Below helpers require the caller to lock and pass in the swap cluster. */
> > +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> > + struct folio *folio, swp_entry_t entry);
> > void __swap_cache_del_folio(struct swap_cluster_info *ci,
> > struct folio *folio, swp_entry_t entry, void *shadow);
> > void __swap_cache_replace_folio(struct swap_cluster_info *ci,
> > @@ -459,8 +459,8 @@ static inline void *swap_cache_get_shadow(swp_entry_t entry)
> > return NULL;
> > }
> >
> > -static inline int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > - void **shadow, bool alloc)
> > +static inline void *__swap_cache_add_folio(struct swap_cluster_info *ci,
> > + struct folio *folio, swp_entry_t entry)
> > {
> > }
>
> Just a nit,
> void* return nothing.
>
> changed to void (original function prototype is return void)
> or how about just remove If this is not used on !CONFIG_SWAP
Thanks! Yeah it can be just removed, no one is using it when
!CONFIG_SWAP after this commit. Will clean it up.
>
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index d2bcca92b6e0..85d9f99c384f 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -122,6 +122,34 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> > return NULL;
> > }
> >
> > +void __swap_cache_add_folio(struct swap_cluster_info *ci,
> > + struct folio *folio, swp_entry_t entry)
> > +{
> > + unsigned long new_tb;
> > + unsigned int ci_start, ci_off, ci_end;
> > + unsigned long nr_pages = folio_nr_pages(folio);
> > +
> > + VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> > + VM_WARN_ON_ONCE_FOLIO(folio_test_swapcache(folio), folio);
> > + VM_WARN_ON_ONCE_FOLIO(!folio_test_swapbacked(folio), folio);
> > +
> > + new_tb = folio_to_swp_tb(folio);
> > + ci_start = swp_cluster_offset(entry);
> > + ci_off = ci_start;
> > + ci_end = ci_start + nr_pages;
> > + do {
> > + VM_WARN_ON_ONCE(swp_tb_is_folio(__swap_table_get(ci, ci_off)));
> > + __swap_table_set(ci, ci_off, new_tb);
> > + } while (++ci_off < ci_end);
> > +
> > + folio_ref_add(folio, nr_pages);
> > + folio_set_swapcache(folio);
> > + folio->swap = entry;
> > +
> > + node_stat_mod_folio(folio, NR_FILE_PAGES, nr_pages);
> > + lruvec_stat_mod_folio(folio, NR_SWAPCACHE, nr_pages);
> > +}
> > +
> > /**
> > * swap_cache_add_folio - Add a folio into the swap cache.
> > * @folio: The folio to be added.
> > @@ -136,23 +164,18 @@ void *swap_cache_get_shadow(swp_entry_t entry)
> > * The caller also needs to update the corresponding swap_map slots with
> > * SWAP_HAS_CACHE bit to avoid race or conflict.
> > */
> > -int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > - void **shadowp, bool alloc)
> > +static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
> > + void **shadowp)
>
> It is also a small thing.
> "alloc" parameter removed then the comment might be updated.
Nice suggestion, will cleanup the comment too.
>
> Thanks,
> Youngjun Park
>
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 16/19] mm, swap: check swap table directly for checking cache
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (14 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 15/19] mm, swap: add folio to swap cache directly on allocation Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-11-06 21:02 ` Barry Song
2025-10-29 15:58 ` [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
` (4 subsequent siblings)
20 siblings, 1 reply; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Instead of looking at the swap map, check swap table directly to tell
if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swap.h | 11 ++++++++---
mm/swap_state.c | 16 ++++++++++++++++
mm/swapfile.c | 55 +++++++++++++++++++++++++++++--------------------------
mm/userfaultfd.c | 10 +++-------
4 files changed, 56 insertions(+), 36 deletions(-)
diff --git a/mm/swap.h b/mm/swap.h
index 03694ffa662f..73f07bcea5f0 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
* swap entries in the page table, similar to locking swap cache folio.
* - See the comment of get_swap_device() for more complex usage.
*/
+bool swap_cache_check_folio(swp_entry_t entry);
struct folio *swap_cache_get_folio(swp_entry_t entry);
void *swap_cache_get_shadow(swp_entry_t entry);
void swap_cache_del_folio(struct folio *folio);
@@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
{
- struct swap_info_struct *si = __swap_entry_to_info(entry);
- pgoff_t offset = swp_offset(entry);
int i;
/*
@@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
* be in conflict with the folio in swap cache.
*/
for (i = 0; i < max_nr; i++) {
- if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
+ if (swap_cache_check_folio(entry))
return i;
+ entry.val++;
}
return i;
@@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
return 0;
}
+static inline bool swap_cache_check_folio(swp_entry_t entry)
+{
+ return false;
+}
+
static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
{
return NULL;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 85d9f99c384f..41d4fa056203 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
return NULL;
}
+/**
+ * swap_cache_check_folio - Check if a swap slot has cache.
+ * @entry: swap entry indicating the slot.
+ *
+ * Context: Caller must ensure @entry is valid and protect the swap
+ * device with reference count or locks.
+ */
+bool swap_cache_check_folio(swp_entry_t entry)
+{
+ unsigned long swp_tb;
+
+ swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
+ swp_cluster_offset(entry));
+ return swp_tb_is_folio(swp_tb);
+}
+
/**
* swap_cache_get_shadow - Looks up a shadow in the swap cache.
* @entry: swap entry used for the lookup.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8d98f28907bc..3b7df5768d7f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -788,23 +788,18 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
unsigned int nr_pages = 1 << order;
unsigned long offset = start, end = start + nr_pages;
unsigned char *map = si->swap_map;
- int nr_reclaim;
+ unsigned long swp_tb;
spin_unlock(&ci->lock);
do {
- switch (READ_ONCE(map[offset])) {
- case 0:
+ if (swap_count(READ_ONCE(map[offset])))
break;
- case SWAP_HAS_CACHE:
- nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
- if (nr_reclaim < 0)
- goto out;
- break;
- default:
- goto out;
+ swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+ if (swp_tb_is_folio(swp_tb)) {
+ if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0)
+ break;
}
} while (++offset < end);
-out:
spin_lock(&ci->lock);
/*
@@ -820,37 +815,41 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
* Recheck the range no matter reclaim succeeded or not, the slot
* could have been be freed while we are not holding the lock.
*/
- for (offset = start; offset < end; offset++)
- if (READ_ONCE(map[offset]))
+ for (offset = start; offset < end; offset++) {
+ swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+ if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb))
return SWAP_ENTRY_INVALID;
+ }
return start;
}
static bool cluster_scan_range(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- unsigned long start, unsigned int nr_pages,
+ unsigned long offset, unsigned int nr_pages,
bool *need_reclaim)
{
- unsigned long offset, end = start + nr_pages;
+ unsigned long end = offset + nr_pages;
unsigned char *map = si->swap_map;
+ unsigned long swp_tb;
if (cluster_is_empty(ci))
return true;
- for (offset = start; offset < end; offset++) {
- switch (READ_ONCE(map[offset])) {
- case 0:
- continue;
- case SWAP_HAS_CACHE:
+ do {
+ if (swap_count(map[offset]))
+ return false;
+ swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+ if (swp_tb_is_folio(swp_tb)) {
+ WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE));
if (!vm_swap_full())
return false;
*need_reclaim = true;
- continue;
- default:
- return false;
+ } else {
+ /* A entry with no count and no cache must be null */
+ VM_WARN_ON_ONCE(!swp_tb_is_null(swp_tb));
}
- }
+ } while (++offset < end);
return true;
}
@@ -1013,7 +1012,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
to_scan--;
while (offset < end) {
- if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+ if (!swap_count(READ_ONCE(map[offset])) &&
+ swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
spin_unlock(&ci->lock);
nr_reclaim = __try_to_reclaim_swap(si, offset,
TTRS_ANYWAY);
@@ -1957,6 +1957,7 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
struct swap_info_struct *si;
bool any_only_cache = false;
unsigned long offset;
+ unsigned long swp_tb;
si = get_swap_device(entry);
if (WARN_ON_ONCE(!si))
@@ -1981,7 +1982,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
*/
for (offset = start_offset; offset < end_offset; offset += nr) {
nr = 1;
- if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+ swp_tb = swap_table_get(__swap_offset_to_cluster(si, offset),
+ offset % SWAPFILE_CLUSTER);
+ if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_tb)) {
/*
* Folios are always naturally aligned in swap so
* advance forward to the next boundary. Zero means no
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 00122f42718c..5411fd340ac3 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -1184,17 +1184,13 @@ static int move_swap_pte(struct mm_struct *mm, struct vm_area_struct *dst_vma,
* Check if the swap entry is cached after acquiring the src_pte
* lock. Otherwise, we might miss a newly loaded swap cache folio.
*
- * Check swap_map directly to minimize overhead, READ_ONCE is sufficient.
* We are trying to catch newly added swap cache, the only possible case is
* when a folio is swapped in and out again staying in swap cache, using the
* same entry before the PTE check above. The PTL is acquired and released
- * twice, each time after updating the swap_map's flag. So holding
- * the PTL here ensures we see the updated value. False positive is possible,
- * e.g. SWP_SYNCHRONOUS_IO swapin may set the flag without touching the
- * cache, or during the tiny synchronization window between swap cache and
- * swap_map, but it will be gone very quickly, worst result is retry jitters.
+ * twice, each time after updating the swap table. So holding
+ * the PTL here ensures we see the updated value.
*/
- if (READ_ONCE(si->swap_map[swp_offset(entry)]) & SWAP_HAS_CACHE) {
+ if (swap_cache_check_folio(entry)) {
double_pt_unlock(dst_ptl, src_ptl);
return -EAGAIN;
}
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 16/19] mm, swap: check swap table directly for checking cache
2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
@ 2025-11-06 21:02 ` Barry Song
2025-11-07 3:13 ` Kairui Song
0 siblings, 1 reply; 50+ messages in thread
From: Barry Song @ 2025-11-06 21:02 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Thu, Oct 30, 2025 at 12:00 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> Instead of looking at the swap map, check swap table directly to tell
> if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> mm/swap.h | 11 ++++++++---
> mm/swap_state.c | 16 ++++++++++++++++
> mm/swapfile.c | 55 +++++++++++++++++++++++++++++--------------------------
> mm/userfaultfd.c | 10 +++-------
> 4 files changed, 56 insertions(+), 36 deletions(-)
>
> diff --git a/mm/swap.h b/mm/swap.h
> index 03694ffa662f..73f07bcea5f0 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> * swap entries in the page table, similar to locking swap cache folio.
> * - See the comment of get_swap_device() for more complex usage.
> */
> +bool swap_cache_check_folio(swp_entry_t entry);
> struct folio *swap_cache_get_folio(swp_entry_t entry);
> void *swap_cache_get_shadow(swp_entry_t entry);
> void swap_cache_del_folio(struct folio *folio);
> @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
>
> static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> {
> - struct swap_info_struct *si = __swap_entry_to_info(entry);
> - pgoff_t offset = swp_offset(entry);
> int i;
>
> /*
> @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> * be in conflict with the folio in swap cache.
> */
> for (i = 0; i < max_nr; i++) {
> - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> + if (swap_cache_check_folio(entry))
> return i;
> + entry.val++;
> }
>
> return i;
> @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
> return 0;
> }
>
> +static inline bool swap_cache_check_folio(swp_entry_t entry)
> +{
> + return false;
> +}
> +
> static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
> {
> return NULL;
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 85d9f99c384f..41d4fa056203 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
> return NULL;
> }
>
> +/**
> + * swap_cache_check_folio - Check if a swap slot has cache.
> + * @entry: swap entry indicating the slot.
> + *
> + * Context: Caller must ensure @entry is valid and protect the swap
> + * device with reference count or locks.
> + */
> +bool swap_cache_check_folio(swp_entry_t entry)
> +{
> + unsigned long swp_tb;
> +
> + swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
> + swp_cluster_offset(entry));
> + return swp_tb_is_folio(swp_tb);
> +}
> +
The name swap_cache_check_folio() sounds a bit odd to me — what we’re
actually doing is checking whether the swapcache contains (or is)
a folio, i.e., whether there’s a folio hit in the swapcache.
The word "check" could misleadingly suggest verifying the folio’s health
or validity instead.
what about swap_cache_has_folio() or simply:
struct folio *__swap_cache_get_folio(swp_entry_t entry);
This would return the folio without taking the lock, or NULL if not found?
Thanks
Barry
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 16/19] mm, swap: check swap table directly for checking cache
2025-11-06 21:02 ` Barry Song
@ 2025-11-07 3:13 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-11-07 3:13 UTC (permalink / raw)
To: Barry Song
Cc: linux-mm, Andrew Morton, Baoquan He, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Fri, Nov 7, 2025 at 5:03 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Thu, Oct 30, 2025 at 12:00 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > From: Kairui Song <kasong@tencent.com>
> >
> > Instead of looking at the swap map, check swap table directly to tell
> > if a swap slot is cached. Prepares for the removal of SWAP_HAS_CACHE.
> >
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> > mm/swap.h | 11 ++++++++---
> > mm/swap_state.c | 16 ++++++++++++++++
> > mm/swapfile.c | 55 +++++++++++++++++++++++++++++--------------------------
> > mm/userfaultfd.c | 10 +++-------
> > 4 files changed, 56 insertions(+), 36 deletions(-)
> >
> > diff --git a/mm/swap.h b/mm/swap.h
> > index 03694ffa662f..73f07bcea5f0 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -275,6 +275,7 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
> > * swap entries in the page table, similar to locking swap cache folio.
> > * - See the comment of get_swap_device() for more complex usage.
> > */
> > +bool swap_cache_check_folio(swp_entry_t entry);
> > struct folio *swap_cache_get_folio(swp_entry_t entry);
> > void *swap_cache_get_shadow(swp_entry_t entry);
> > void swap_cache_del_folio(struct folio *folio);
> > @@ -335,8 +336,6 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> >
> > static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> > {
> > - struct swap_info_struct *si = __swap_entry_to_info(entry);
> > - pgoff_t offset = swp_offset(entry);
> > int i;
> >
> > /*
> > @@ -345,8 +344,9 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
> > * be in conflict with the folio in swap cache.
> > */
> > for (i = 0; i < max_nr; i++) {
> > - if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
> > + if (swap_cache_check_folio(entry))
> > return i;
> > + entry.val++;
> > }
> >
> > return i;
> > @@ -449,6 +449,11 @@ static inline int swap_writeout(struct folio *folio,
> > return 0;
> > }
> >
> > +static inline bool swap_cache_check_folio(swp_entry_t entry)
> > +{
> > + return false;
> > +}
> > +
> > static inline struct folio *swap_cache_get_folio(swp_entry_t entry)
> > {
> > return NULL;
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 85d9f99c384f..41d4fa056203 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -103,6 +103,22 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
> > return NULL;
> > }
> >
> > +/**
> > + * swap_cache_check_folio - Check if a swap slot has cache.
> > + * @entry: swap entry indicating the slot.
> > + *
> > + * Context: Caller must ensure @entry is valid and protect the swap
> > + * device with reference count or locks.
> > + */
> > +bool swap_cache_check_folio(swp_entry_t entry)
> > +{
> > + unsigned long swp_tb;
> > +
> > + swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
> > + swp_cluster_offset(entry));
> > + return swp_tb_is_folio(swp_tb);
> > +}
> > +
>
> The name swap_cache_check_folio() sounds a bit odd to me — what we’re
> actually doing is checking whether the swapcache contains (or is)
> a folio, i.e., whether there’s a folio hit in the swapcache.
> The word "check" could misleadingly suggest verifying the folio’s health
> or validity instead.
>
> what about swap_cache_has_folio() or simply:
>
> struct folio *__swap_cache_get_folio(swp_entry_t entry);
I was worrying people may misuse this, the returned folio could be
invalided anytime if caller is not holding rcu lock.
I think swap_cache_has_folio seems better indeed.
^ permalink raw reply [flat|nested] 50+ messages in thread
* [PATCH 17/19] mm, swap: clean up and improve swap entries freeing
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (15 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 16/19] mm, swap: check swap table directly for checking cache Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
` (3 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
There are a few problems with the current freeing of swap entries.
When freeing a set of swap entries directly (swap_put_entries_direct,
typically from zapping the page table), it scans the whole swap region
multiple times. First, it scans the whole region to check if it can be
batch freed and if there is any cached folio. Then do a batch free only
if the whole region's swap count equals 1. And if any entry is cached,
even if only one, it will have to walk the whole region again to clean
up the cache.
And if any entry is not in a consistent status with other entries, it
will fall back to order 0 freeing. For example, if only one of them is
cached, the batch free will fall back.
And the current batch freeing workflow relies on the swap map's
SWAP_HAS_CACHE bit for both continuous checking and batch freeing, which
isn't compatible with the swap table design.
Tidy this up, introduce a new cluster scoped helper for all swap entry
freeing job. It will batch frees all continuous entries, and just start
a new batch if any inconsistent entry is found. This may improve the
batch size when the clusters are fragmented. This should also be more
robust with more sanity checks, and make it clear that a slot pinned by
swap cache will be cleared upon cache reclaim.
And the cache reclaim scan is also now limited to each cluster. If a
cluster has any clean swap cache left after putting the swap count,
reclaim the cluster only instead of the whole region.
And since a folio's entries are always in the same cluster, putting swap
entries from a folio can also use the new helper directly.
This should be both an optimization and a cleanup, and the new helper is
adapted to the swap table.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 238 +++++++++++++++++++++++-----------------------------------
1 file changed, 96 insertions(+), 142 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 3b7df5768d7f..12a1ab6f7b32 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -55,12 +55,14 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
static void free_swap_count_continuations(struct swap_info_struct *);
static void swap_entries_free(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- swp_entry_t entry, unsigned int nr_pages);
+ unsigned long start, unsigned int nr_pages);
static void swap_range_alloc(struct swap_info_struct *si,
unsigned int nr_entries);
static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
-static bool swap_entries_put_map(struct swap_info_struct *si,
- swp_entry_t entry, int nr);
+static void swap_put_entry_locked(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long offset,
+ unsigned char usage);
static bool folio_swapcache_freeable(struct folio *folio);
static void move_cluster(struct swap_info_struct *si,
struct swap_cluster_info *ci, struct list_head *list,
@@ -197,25 +199,6 @@ static bool swap_only_has_cache(struct swap_info_struct *si,
return true;
}
-static bool swap_is_last_map(struct swap_info_struct *si,
- unsigned long offset, int nr_pages, bool *has_cache)
-{
- unsigned char *map = si->swap_map + offset;
- unsigned char *map_end = map + nr_pages;
- unsigned char count = *map;
-
- if (swap_count(count) != 1)
- return false;
-
- while (++map < map_end) {
- if (*map != count)
- return false;
- }
-
- *has_cache = !!(count & SWAP_HAS_CACHE);
- return true;
-}
-
/*
* returns number of pages in the folio that backs the swap entry. If positive,
* the folio was reclaimed. If negative, the folio was not reclaimed. If 0, no
@@ -1420,6 +1403,76 @@ static bool swap_sync_discard(void)
return false;
}
+/**
+ * swap_put_entries_cluster - Decrease the swap count of a set of slots.
+ * @si: The swap device.
+ * @start: start offset of slots.
+ * @nr: number of slots.
+ * @reclaim_cache: if true, also reclaim the swap cache.
+ *
+ * This helper decreases the swap count of a set of slots and tries to
+ * batch free them. Also reclaims the swap cache if @reclaim_cache is true.
+ * Context: The caller must ensure that all slots belong to the same
+ * cluster and their swap count doesn't go underflow.
+ */
+static void swap_put_entries_cluster(struct swap_info_struct *si,
+ unsigned long start, int nr,
+ bool reclaim_cache)
+{
+ unsigned long offset = start, end = start + nr;
+ unsigned long batch_start = SWAP_ENTRY_INVALID;
+ struct swap_cluster_info *ci;
+ bool need_reclaim = false;
+ unsigned int nr_reclaimed;
+ unsigned long swp_tb;
+ unsigned int count;
+
+ ci = swap_cluster_lock(si, offset);
+ do {
+ swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
+ count = si->swap_map[offset];
+ VM_WARN_ON(swap_count(count) < 1 || count == SWAP_MAP_BAD);
+ if (swap_count(count) == 1) {
+ /* count == 1 and non-cached slots will be batch freed. */
+ if (!swp_tb_is_folio(swp_tb)) {
+ if (!batch_start)
+ batch_start = offset;
+ continue;
+ }
+ /* count will be 0 after put, slot can be reclaimed */
+ VM_WARN_ON(!(count & SWAP_HAS_CACHE));
+ need_reclaim = true;
+ }
+ /*
+ * A count != 1 or cached slot can't be freed. Put its swap
+ * count and then free the interrupted pending batch. Cached
+ * slots will be freed when folio is removed from swap cache
+ * (__swap_cache_del_folio).
+ */
+ swap_put_entry_locked(si, ci, offset, 1);
+ if (batch_start) {
+ swap_entries_free(si, ci, batch_start, offset - batch_start);
+ batch_start = SWAP_ENTRY_INVALID;
+ }
+ } while (++offset < end);
+
+ if (batch_start)
+ swap_entries_free(si, ci, batch_start, offset - batch_start);
+ swap_cluster_unlock(ci);
+
+ if (!need_reclaim || !reclaim_cache)
+ return;
+
+ offset = start;
+ do {
+ nr_reclaimed = __try_to_reclaim_swap(si, offset,
+ TTRS_UNMAPPED | TTRS_FULL);
+ offset++;
+ if (nr_reclaimed)
+ offset = round_up(offset, abs(nr_reclaimed));
+ } while (offset < end);
+}
+
/**
* folio_alloc_swap - allocate swap space for a folio
* @folio: folio we want to move to swap
@@ -1521,6 +1574,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
{
swp_entry_t entry = folio->swap;
unsigned long nr_pages = folio_nr_pages(folio);
+ struct swap_info_struct *si = __swap_entry_to_info(entry);
VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
VM_WARN_ON_FOLIO(!folio_test_swapcache(folio), folio);
@@ -1530,7 +1584,7 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
nr_pages = 1;
}
- swap_entries_put_map(__swap_entry_to_info(entry), entry, nr_pages);
+ swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
}
static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
@@ -1567,12 +1621,11 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
return NULL;
}
-static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- swp_entry_t entry,
- unsigned char usage)
+static void swap_put_entry_locked(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long offset,
+ unsigned char usage)
{
- unsigned long offset = swp_offset(entry);
unsigned char count;
unsigned char has_cache;
@@ -1598,9 +1651,7 @@ static unsigned char swap_entry_put_locked(struct swap_info_struct *si,
if (usage)
WRITE_ONCE(si->swap_map[offset], usage);
else
- swap_entries_free(si, ci, entry, 1);
-
- return usage;
+ swap_entries_free(si, ci, offset, 1);
}
/*
@@ -1668,70 +1719,6 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
return NULL;
}
-static bool swap_entries_put_map(struct swap_info_struct *si,
- swp_entry_t entry, int nr)
-{
- unsigned long offset = swp_offset(entry);
- struct swap_cluster_info *ci;
- bool has_cache = false;
- unsigned char count;
- int i;
-
- if (nr <= 1)
- goto fallback;
- count = swap_count(data_race(si->swap_map[offset]));
- if (count != 1)
- goto fallback;
-
- ci = swap_cluster_lock(si, offset);
- if (!swap_is_last_map(si, offset, nr, &has_cache)) {
- goto locked_fallback;
- }
- if (!has_cache)
- swap_entries_free(si, ci, entry, nr);
- else
- for (i = 0; i < nr; i++)
- WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
- swap_cluster_unlock(ci);
-
- return has_cache;
-
-fallback:
- ci = swap_cluster_lock(si, offset);
-locked_fallback:
- for (i = 0; i < nr; i++, entry.val++) {
- count = swap_entry_put_locked(si, ci, entry, 1);
- if (count == SWAP_HAS_CACHE)
- has_cache = true;
- }
- swap_cluster_unlock(ci);
- return has_cache;
-}
-
-/*
- * Only functions with "_nr" suffix are able to free entries spanning
- * cross multi clusters, so ensure the range is within a single cluster
- * when freeing entries with functions without "_nr" suffix.
- */
-static bool swap_entries_put_map_nr(struct swap_info_struct *si,
- swp_entry_t entry, int nr)
-{
- int cluster_nr, cluster_rest;
- unsigned long offset = swp_offset(entry);
- bool has_cache = false;
-
- cluster_rest = SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER;
- while (nr) {
- cluster_nr = min(nr, cluster_rest);
- has_cache |= swap_entries_put_map(si, entry, cluster_nr);
- cluster_rest = SWAPFILE_CLUSTER;
- nr -= cluster_nr;
- entry.val += cluster_nr;
- }
-
- return has_cache;
-}
-
/*
* Check if it's the last ref of swap entry in the freeing path.
*/
@@ -1746,9 +1733,9 @@ static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
*/
static void swap_entries_free(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- swp_entry_t entry, unsigned int nr_pages)
+ unsigned long offset, unsigned int nr_pages)
{
- unsigned long offset = swp_offset(entry);
+ swp_entry_t entry = swp_entry(si->type, offset);
unsigned char *map = si->swap_map + offset;
unsigned char *map_end = map + nr_pages;
@@ -1954,10 +1941,8 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
{
const unsigned long start_offset = swp_offset(entry);
const unsigned long end_offset = start_offset + nr;
+ unsigned long offset, cluster_end;
struct swap_info_struct *si;
- bool any_only_cache = false;
- unsigned long offset;
- unsigned long swp_tb;
si = get_swap_device(entry);
if (WARN_ON_ONCE(!si))
@@ -1965,44 +1950,13 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
if (WARN_ON_ONCE(end_offset > si->max))
goto out;
- /*
- * First free all entries in the range.
- */
- any_only_cache = swap_entries_put_map_nr(si, entry, nr);
-
- /*
- * Short-circuit the below loop if none of the entries had their
- * reference drop to zero.
- */
- if (!any_only_cache)
- goto out;
-
- /*
- * Now go back over the range trying to reclaim the swap cache.
- */
- for (offset = start_offset; offset < end_offset; offset += nr) {
- nr = 1;
- swp_tb = swap_table_get(__swap_offset_to_cluster(si, offset),
- offset % SWAPFILE_CLUSTER);
- if (!swap_count(READ_ONCE(si->swap_map[offset])) && swp_tb_is_folio(swp_tb)) {
- /*
- * Folios are always naturally aligned in swap so
- * advance forward to the next boundary. Zero means no
- * folio was found for the swap entry, so advance by 1
- * in this case. Negative value means folio was found
- * but could not be reclaimed. Here we can still advance
- * to the next boundary.
- */
- nr = __try_to_reclaim_swap(si, offset,
- TTRS_UNMAPPED | TTRS_FULL);
- if (nr == 0)
- nr = 1;
- else if (nr < 0)
- nr = -nr;
- nr = ALIGN(offset + 1, nr) - offset;
- }
- }
-
+ /* Put entries and reclaim cache in each cluster */
+ offset = start_offset;
+ do {
+ cluster_end = min(round_up(offset + 1, SWAPFILE_CLUSTER), end_offset);
+ swap_put_entries_cluster(si, offset, cluster_end - offset, true);
+ offset = cluster_end;
+ } while (offset < end_offset);
out:
put_swap_device(si);
}
@@ -2051,7 +2005,7 @@ void swap_free_hibernation_slot(swp_entry_t entry)
return;
ci = swap_cluster_lock(si, offset);
- swap_entry_put_locked(si, ci, entry, 1);
+ swap_put_entry_locked(si, ci, offset, 1);
WARN_ON(swap_entry_swapped(si, offset));
swap_cluster_unlock(ci);
@@ -3799,10 +3753,10 @@ void __swapcache_clear_cached(struct swap_info_struct *si,
swp_entry_t entry, unsigned int nr)
{
if (swap_only_has_cache(si, swp_offset(entry), nr)) {
- swap_entries_free(si, ci, entry, nr);
+ swap_entries_free(si, ci, swp_offset(entry), nr);
} else {
for (int i = 0; i < nr; i++, entry.val++)
- swap_entry_put_locked(si, ci, entry, SWAP_HAS_CACHE);
+ swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE);
}
}
@@ -3923,7 +3877,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
* into, carry if so, or else fail until a new continuation page is allocated;
* when the original swap_map count is decremented from 0 with continuation,
* borrow from the continuation and report whether it still holds more.
- * Called while __swap_duplicate() or caller of swap_entry_put_locked()
+ * Called while __swap_duplicate() or caller of swap_put_entry_locked()
* holds cluster lock.
*/
static bool swap_count_continued(struct swap_info_struct *si,
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (16 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 17/19] mm, swap: clean up and improve swap entries freeing Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-29 15:58 ` [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
` (2 subsequent siblings)
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
Now, the swap cache is managed by the swap table. All swap cache users
are checking the swap table directly to check the swap cache state.
SWAP_HAS_CACHE is now just a temporary pin before the first increase
from 0 to 1 of a slot's swap count (swap_dup_entries), or before the
final free of slots pinned by folio in swap cache (put_swap_folio).
Drop these two usages. For the first dup, SWAP_HAS_CACHE pinning was
hard to kill because it used to have multiple meanings, more than just
"a slot is cached". We have simplified that and just defined that the
first dup is always done with folio locked in swap cache (folio_dup_swap),
so it can just check the swap cache (swap table) directly.
As for freeing, just let the swap cache free all swap entries of a folio
that have a swap count of zero directly upon folio removal. We have also
just cleaned up freeing to cover the swap cache usage in the swap table,
a slot with swap cache will not be freed until its cache is gone.
Now, making the removal of a folio and freeing the slots being done in
the same critical section, this should improve the performance and gets
rid of the SWAP_HAS_CACHE pin.
After these two changes, SWAP_HAS_CACHE no longer has any users. Remove
all related logic and helpers. swap_map is now only used for tracking
the count, so all swap_map users can just need to read it directly,
ignoring the swap_count helper, which was previously used to filter out
the SWAP_HAS_CACHE bit.
The idea of dropping SWAP_HAS_CACHE and using the swap table directly
was initially from Chris's idea of merging all the metadata usage of all
swaps into one place.
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
include/linux/swap.h | 1 -
mm/swap.h | 13 ++--
mm/swap_state.c | 28 +++++----
mm/swapfile.c | 163 ++++++++++++++++-----------------------------------
4 files changed, 71 insertions(+), 134 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4b4b81fbc6a3..dcb1760e36c3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -224,7 +224,6 @@ enum {
#define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
/* Bit flag in swap_map */
-#define SWAP_HAS_CACHE 0x40 /* Flag page is cached, in first swap_map */
#define COUNT_CONTINUED 0x80 /* Flag swap_map continuation for full count */
/* Special value in first swap_map */
diff --git a/mm/swap.h b/mm/swap.h
index 73f07bcea5f0..331424366487 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -205,6 +205,11 @@ int folio_alloc_swap(struct folio *folio);
int folio_dup_swap(struct folio *folio, struct page *subpage);
void folio_put_swap(struct folio *folio, struct page *subpage);
+/* For internal use */
+extern void swap_entries_free(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long offset, unsigned int nr_pages);
+
/* linux/mm/page_io.c */
int sio_pool_init(void);
struct swap_iocb;
@@ -256,14 +261,6 @@ static inline bool folio_matches_swap_entry(const struct folio *folio,
return folio_entry.val == round_down(entry.val, nr_pages);
}
-/* Temporary internal helpers */
-void __swapcache_set_cached(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- swp_entry_t entry);
-void __swapcache_clear_cached(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- swp_entry_t entry, unsigned int nr);
-
/*
* All swap cache helpers below require the caller to ensure the swap entries
* used are valid and stablize the device by any of the following ways:
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 41d4fa056203..2bf72d58f6ee 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -215,17 +215,6 @@ static int swap_cache_add_folio(struct folio *folio, swp_entry_t entry,
shadow = swp_tb_to_shadow(old_tb);
offset++;
} while (++ci_off < ci_end);
-
- ci_off = ci_start;
- offset = swp_offset(entry);
- do {
- /*
- * Still need to pin the slots with SWAP_HAS_CACHE since
- * swap allocator depends on that.
- */
- __swapcache_set_cached(si, ci, swp_entry(swp_type(entry), offset));
- offset++;
- } while (++ci_off < ci_end);
__swap_cache_add_folio(ci, folio, entry);
swap_cluster_unlock(ci);
if (shadowp)
@@ -256,6 +245,7 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
struct swap_info_struct *si;
unsigned long old_tb, new_tb;
unsigned int ci_start, ci_off, ci_end;
+ bool folio_swapped = false, need_free = false;
unsigned long nr_pages = folio_nr_pages(folio);
VM_WARN_ON_ONCE(__swap_entry_to_cluster(entry) != ci);
@@ -273,13 +263,27 @@ void __swap_cache_del_folio(struct swap_cluster_info *ci, struct folio *folio,
old_tb = __swap_table_xchg(ci, ci_off, new_tb);
WARN_ON_ONCE(!swp_tb_is_folio(old_tb) ||
swp_tb_to_folio(old_tb) != folio);
+ if (__swap_count(swp_entry(si->type,
+ swp_offset(entry) + ci_off - ci_start)))
+ folio_swapped = true;
+ else
+ need_free = true;
} while (++ci_off < ci_end);
folio->swap.val = 0;
folio_clear_swapcache(folio);
node_stat_mod_folio(folio, NR_FILE_PAGES, -nr_pages);
lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr_pages);
- __swapcache_clear_cached(si, ci, entry, nr_pages);
+
+ if (!folio_swapped) {
+ swap_entries_free(si, ci, swp_offset(entry), nr_pages);
+ } else if (need_free) {
+ do {
+ if (!__swap_count(entry))
+ swap_entries_free(si, ci, swp_offset(entry), 1);
+ entry.val++;
+ } while (--nr_pages);
+ }
}
/**
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 12a1ab6f7b32..49916fdb8b70 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -48,21 +48,18 @@
#include <linux/swap_cgroup.h>
#include "swap_table.h"
#include "internal.h"
+#include "swap_table.h"
#include "swap.h"
static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
unsigned char);
static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entries_free(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- unsigned long start, unsigned int nr_pages);
static void swap_range_alloc(struct swap_info_struct *si,
unsigned int nr_entries);
static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr);
static void swap_put_entry_locked(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- unsigned long offset,
- unsigned char usage);
+ unsigned long offset);
static bool folio_swapcache_freeable(struct folio *folio);
static void move_cluster(struct swap_info_struct *si,
struct swap_cluster_info *ci, struct list_head *list,
@@ -149,11 +146,6 @@ static struct swap_info_struct *swap_entry_to_info(swp_entry_t entry)
return swap_type_to_info(swp_type(entry));
}
-static inline unsigned char swap_count(unsigned char ent)
-{
- return ent & ~SWAP_HAS_CACHE; /* may include COUNT_CONTINUED flag */
-}
-
/*
* Use the second highest bit of inuse_pages counter as the indicator
* if one swap device is on the available plist, so the atomic can
@@ -185,15 +177,20 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
#define TTRS_FULL 0x4
static bool swap_only_has_cache(struct swap_info_struct *si,
- unsigned long offset, int nr_pages)
+ struct swap_cluster_info *ci,
+ unsigned long offset, int nr_pages)
{
+ unsigned int ci_off = offset % SWAPFILE_CLUSTER;
unsigned char *map = si->swap_map + offset;
unsigned char *map_end = map + nr_pages;
+ unsigned long swp_tb;
do {
- VM_BUG_ON(!(*map & SWAP_HAS_CACHE));
- if (*map != SWAP_HAS_CACHE)
+ swp_tb = __swap_table_get(ci, ci_off);
+ VM_WARN_ON_ONCE(!swp_tb_is_folio(swp_tb));
+ if (*map)
return false;
+ ++ci_off;
} while (++map < map_end);
return true;
@@ -254,7 +251,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
* reference or pending writeback, and can't be allocated to others.
*/
ci = swap_cluster_lock(si, offset);
- need_reclaim = swap_only_has_cache(si, offset, nr_pages);
+ need_reclaim = swap_only_has_cache(si, ci, offset, nr_pages);
swap_cluster_unlock(ci);
if (!need_reclaim)
goto out_unlock;
@@ -775,7 +772,7 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
spin_unlock(&ci->lock);
do {
- if (swap_count(READ_ONCE(map[offset])))
+ if (READ_ONCE(map[offset]))
break;
swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
if (swp_tb_is_folio(swp_tb)) {
@@ -800,7 +797,7 @@ static unsigned int cluster_reclaim_range(struct swap_info_struct *si,
*/
for (offset = start; offset < end; offset++) {
swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
- if (swap_count(map[offset]) || !swp_tb_is_null(swp_tb))
+ if (map[offset] || !swp_tb_is_null(swp_tb))
return SWAP_ENTRY_INVALID;
}
@@ -820,11 +817,10 @@ static bool cluster_scan_range(struct swap_info_struct *si,
return true;
do {
- if (swap_count(map[offset]))
+ if (map[offset])
return false;
swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
if (swp_tb_is_folio(swp_tb)) {
- WARN_ON_ONCE(!(map[offset] & SWAP_HAS_CACHE));
if (!vm_swap_full())
return false;
*need_reclaim = true;
@@ -882,11 +878,6 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
if (likely(folio)) {
order = folio_order(folio);
nr_pages = 1 << order;
- /*
- * Pin the slot with SWAP_HAS_CACHE to satisfy swap_dup_entries.
- * This is the legacy allocation behavior, will drop it very soon.
- */
- memset(si->swap_map + offset, SWAP_HAS_CACHE, nr_pages);
__swap_cache_add_folio(ci, folio, swp_entry(si->type, offset));
} else {
order = 0;
@@ -995,8 +986,8 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
to_scan--;
while (offset < end) {
- if (!swap_count(READ_ONCE(map[offset])) &&
- swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
+ if (!READ_ONCE(map[offset]) &&
+ swp_tb_is_folio(swap_table_get(ci, offset % SWAPFILE_CLUSTER))) {
spin_unlock(&ci->lock);
nr_reclaim = __try_to_reclaim_swap(si, offset,
TTRS_ANYWAY);
@@ -1431,8 +1422,8 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
do {
swp_tb = __swap_table_get(ci, offset % SWAPFILE_CLUSTER);
count = si->swap_map[offset];
- VM_WARN_ON(swap_count(count) < 1 || count == SWAP_MAP_BAD);
- if (swap_count(count) == 1) {
+ VM_WARN_ON(count < 1 || count == SWAP_MAP_BAD);
+ if (count == 1) {
/* count == 1 and non-cached slots will be batch freed. */
if (!swp_tb_is_folio(swp_tb)) {
if (!batch_start)
@@ -1440,7 +1431,6 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
continue;
}
/* count will be 0 after put, slot can be reclaimed */
- VM_WARN_ON(!(count & SWAP_HAS_CACHE));
need_reclaim = true;
}
/*
@@ -1449,7 +1439,7 @@ static void swap_put_entries_cluster(struct swap_info_struct *si,
* slots will be freed when folio is removed from swap cache
* (__swap_cache_del_folio).
*/
- swap_put_entry_locked(si, ci, offset, 1);
+ swap_put_entry_locked(si, ci, offset);
if (batch_start) {
swap_entries_free(si, ci, batch_start, offset - batch_start);
batch_start = SWAP_ENTRY_INVALID;
@@ -1602,13 +1592,8 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
offset = swp_offset(entry);
if (offset >= si->max)
goto bad_offset;
- if (data_race(!si->swap_map[swp_offset(entry)]))
- goto bad_free;
return si;
-bad_free:
- pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
- goto out;
bad_offset:
pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
goto out;
@@ -1623,21 +1608,12 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
static void swap_put_entry_locked(struct swap_info_struct *si,
struct swap_cluster_info *ci,
- unsigned long offset,
- unsigned char usage)
+ unsigned long offset)
{
unsigned char count;
- unsigned char has_cache;
count = si->swap_map[offset];
-
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
-
- if (usage == SWAP_HAS_CACHE) {
- VM_BUG_ON(!has_cache);
- has_cache = 0;
- } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
+ if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
if (count == COUNT_CONTINUED) {
if (swap_count_continued(si, offset, count))
count = SWAP_MAP_MAX | COUNT_CONTINUED;
@@ -1647,10 +1623,8 @@ static void swap_put_entry_locked(struct swap_info_struct *si,
count--;
}
- usage = count | has_cache;
- if (usage)
- WRITE_ONCE(si->swap_map[offset], usage);
- else
+ WRITE_ONCE(si->swap_map[offset], count);
+ if (!count && !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)))
swap_entries_free(si, ci, offset, 1);
}
@@ -1719,21 +1693,13 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
return NULL;
}
-/*
- * Check if it's the last ref of swap entry in the freeing path.
- */
-static inline bool __maybe_unused swap_is_last_ref(unsigned char count)
-{
- return (count == SWAP_HAS_CACHE) || (count == 1);
-}
-
/*
* Drop the last ref of swap entries, caller have to ensure all entries
* belong to the same cgroup and cluster.
*/
-static void swap_entries_free(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- unsigned long offset, unsigned int nr_pages)
+void swap_entries_free(struct swap_info_struct *si,
+ struct swap_cluster_info *ci,
+ unsigned long offset, unsigned int nr_pages)
{
swp_entry_t entry = swp_entry(si->type, offset);
unsigned char *map = si->swap_map + offset;
@@ -1746,7 +1712,7 @@ static void swap_entries_free(struct swap_info_struct *si,
ci->count -= nr_pages;
do {
- VM_BUG_ON(!swap_is_last_ref(*map));
+ VM_WARN_ON(*map > 1);
*map = 0;
} while (++map < map_end);
@@ -1765,7 +1731,7 @@ int __swap_count(swp_entry_t entry)
struct swap_info_struct *si = __swap_entry_to_info(entry);
pgoff_t offset = swp_offset(entry);
- return swap_count(si->swap_map[offset]);
+ return si->swap_map[offset];
}
/**
@@ -1779,7 +1745,7 @@ bool swap_entry_swapped(struct swap_info_struct *si, unsigned long offset)
int count;
ci = swap_cluster_lock(si, offset);
- count = swap_count(si->swap_map[offset]);
+ count = si->swap_map[offset];
swap_cluster_unlock(ci);
return count && count != SWAP_MAP_BAD;
@@ -1806,7 +1772,7 @@ int swp_swapcount(swp_entry_t entry)
ci = swap_cluster_lock(si, offset);
- count = swap_count(si->swap_map[offset]);
+ count = si->swap_map[offset];
if (!(count & COUNT_CONTINUED))
goto out;
@@ -1844,12 +1810,12 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
ci = swap_cluster_lock(si, offset);
if (nr_pages == 1) {
- if (swap_count(map[roffset]))
+ if (map[roffset])
ret = true;
goto unlock_out;
}
for (i = 0; i < nr_pages; i++) {
- if (swap_count(map[offset + i])) {
+ if (map[offset + i]) {
ret = true;
break;
}
@@ -2005,7 +1971,7 @@ void swap_free_hibernation_slot(swp_entry_t entry)
return;
ci = swap_cluster_lock(si, offset);
- swap_put_entry_locked(si, ci, offset, 1);
+ swap_put_entry_locked(si, ci, offset);
WARN_ON(swap_entry_swapped(si, offset));
swap_cluster_unlock(ci);
@@ -2412,6 +2378,7 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
unsigned int prev)
{
unsigned int i;
+ unsigned long swp_tb;
unsigned char count;
/*
@@ -2422,7 +2389,11 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
*/
for (i = prev + 1; i < si->max; i++) {
count = READ_ONCE(si->swap_map[i]);
- if (count && swap_count(count) != SWAP_MAP_BAD)
+ swp_tb = swap_table_get(__swap_offset_to_cluster(si, i),
+ i % SWAPFILE_CLUSTER);
+ if (count == SWAP_MAP_BAD)
+ continue;
+ if (count || swp_tb_is_folio(swp_tb))
break;
if ((i % LATENCY_LIMIT) == 0)
cond_resched();
@@ -3649,39 +3620,26 @@ static int swap_dup_entries(struct swap_info_struct *si,
unsigned char usage, int nr)
{
int i;
- unsigned char count, has_cache;
+ unsigned char count;
for (i = 0; i < nr; i++) {
count = si->swap_map[offset + i];
-
/*
* Allocator never allocates bad slots, and readahead is guarded
* by swap_entry_swapped.
*/
- if (WARN_ON(swap_count(count) == SWAP_MAP_BAD))
- return -ENOENT;
-
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
-
- if (!count && !has_cache) {
- return -ENOENT;
- } else if (usage == SWAP_HAS_CACHE) {
- if (has_cache)
- return -EEXIST;
- } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
- return -EINVAL;
- }
+ VM_WARN_ON(count == SWAP_MAP_BAD);
+ /*
+ * Swap count duplication is guranteed by either locked swap cache
+ * folio (folio_dup_swap) or external lock (swap_dup_entry_direct).
+ */
+ VM_WARN_ON(!count &&
+ !swp_tb_is_folio(__swap_table_get(ci, offset % SWAPFILE_CLUSTER)));
}
for (i = 0; i < nr; i++) {
count = si->swap_map[offset + i];
- has_cache = count & SWAP_HAS_CACHE;
- count &= ~SWAP_HAS_CACHE;
-
- if (usage == SWAP_HAS_CACHE)
- has_cache = SWAP_HAS_CACHE;
- else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
+ if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
count += usage;
else if (swap_count_continued(si, offset + i, count))
count = COUNT_CONTINUED;
@@ -3693,7 +3651,7 @@ static int swap_dup_entries(struct swap_info_struct *si,
return -ENOMEM;
}
- WRITE_ONCE(si->swap_map[offset + i], count | has_cache);
+ WRITE_ONCE(si->swap_map[offset + i], count);
}
return 0;
@@ -3739,27 +3697,6 @@ int swap_dup_entry_direct(swp_entry_t entry)
return err;
}
-/* Mark the swap map as HAS_CACHE, caller need to hold the cluster lock */
-void __swapcache_set_cached(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- swp_entry_t entry)
-{
- WARN_ON(swap_dup_entries(si, ci, swp_offset(entry), SWAP_HAS_CACHE, 1));
-}
-
-/* Clear the swap map as !HAS_CACHE, caller need to hold the cluster lock */
-void __swapcache_clear_cached(struct swap_info_struct *si,
- struct swap_cluster_info *ci,
- swp_entry_t entry, unsigned int nr)
-{
- if (swap_only_has_cache(si, swp_offset(entry), nr)) {
- swap_entries_free(si, ci, swp_offset(entry), nr);
- } else {
- for (int i = 0; i < nr; i++, entry.val++)
- swap_put_entry_locked(si, ci, swp_offset(entry), SWAP_HAS_CACHE);
- }
-}
-
/*
* add_swap_count_continuation - called when a swap count is duplicated
* beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
@@ -3805,7 +3742,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
ci = swap_cluster_lock(si, offset);
- count = swap_count(si->swap_map[offset]);
+ count = si->swap_map[offset];
if ((count & ~COUNT_CONTINUED) != SWAP_MAP_MAX) {
/*
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (17 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 18/19] mm, swap: drop the SWAP_HAS_CACHE flag Kairui Song
@ 2025-10-29 15:58 ` Kairui Song
2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
2025-11-05 7:39 ` Chris Li
20 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-29 15:58 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
From: Kairui Song <kasong@tencent.com>
There are now only two users of _swap_info_get after consolidating
these callers, folio_try_reclaim_swap and swp_swapcount.
folio_free_swap already holds the folio lock, and the folio
is in swap cache, _swap_info_get is redundant.
For swp_swapcount, it can just use get_swap_device instead. It only
wants to check the swap count, both are fine except get_swap_device
increases the device ref count, which is actually a bit safer. The
only current use is smap walking, and the performance change here
is tiny.
And after these changes, _swap_info_get is no longer used, so we can
safely remove it.
Signed-off-by: Kairui Song <kasong@tencent.com>
---
mm/swapfile.c | 39 ++++++---------------------------------
1 file changed, 6 insertions(+), 33 deletions(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 49916fdb8b70..150916f4640c 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1577,35 +1577,6 @@ void folio_put_swap(struct folio *folio, struct page *subpage)
swap_put_entries_cluster(si, swp_offset(entry), nr_pages, false);
}
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
-{
- struct swap_info_struct *si;
- unsigned long offset;
-
- if (!entry.val)
- goto out;
- si = swap_entry_to_info(entry);
- if (!si)
- goto bad_nofile;
- if (data_race(!(si->flags & SWP_USED)))
- goto bad_device;
- offset = swp_offset(entry);
- if (offset >= si->max)
- goto bad_offset;
- return si;
-
-bad_offset:
- pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
- goto out;
-bad_device:
- pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
- goto out;
-bad_nofile:
- pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
-out:
- return NULL;
-}
-
static void swap_put_entry_locked(struct swap_info_struct *si,
struct swap_cluster_info *ci,
unsigned long offset)
@@ -1764,7 +1735,7 @@ int swp_swapcount(swp_entry_t entry)
pgoff_t offset;
unsigned char *map;
- si = _swap_info_get(entry);
+ si = get_swap_device(entry);
if (!si)
return 0;
@@ -1794,6 +1765,7 @@ int swp_swapcount(swp_entry_t entry)
} while (tmp_count & COUNT_CONTINUED);
out:
swap_cluster_unlock(ci);
+ put_swap_device(si);
return count;
}
@@ -1828,11 +1800,12 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
static bool folio_swapped(struct folio *folio)
{
swp_entry_t entry = folio->swap;
- struct swap_info_struct *si = _swap_info_get(entry);
+ struct swap_info_struct *si;
- if (!si)
- return false;
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+ VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
+ si = __swap_entry_to_info(entry);
if (!IS_ENABLED(CONFIG_THP_SWAP) || likely(!folio_test_large(folio)))
return swap_entry_swapped(si, swp_offset(entry));
--
2.51.1
^ permalink raw reply related [flat|nested] 50+ messages in thread* Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (18 preceding siblings ...)
2025-10-29 15:58 ` [PATCH 19/19] mm, swap: remove no longer needed _swap_info_get Kairui Song
@ 2025-10-30 23:04 ` Yosry Ahmed
2025-10-31 6:58 ` Kairui Song
2025-11-05 7:39 ` Chris Li
20 siblings, 1 reply; 50+ messages in thread
From: Yosry Ahmed @ 2025-10-30 23:04 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> special swap bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.
>
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
>
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
>
> Test results:
>
> Redis / Valkey bench:
> =====================
>
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> no persistence with BGSAVE
> Before: 460475.84 RPS 311591.19 RPS
> After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%)
>
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> no persistence with BGSAVE
> Before: 306044.38 RPS 102745.88 RPS
> After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%)
>
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
>
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
>
> Before: After:
> System time: 282.22s 283.47s
> Sum Throughput: 5677.35 MB/s 5688.78 MB/s
> Single process Throughput: 176.41 MB/s 176.23 MB/s
> Free latency: 518477.96 us 521488.06 us
>
> Which is almost identical.
>
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
> Before After:
> System time: 1379.91s 1364.22s (-0.11%)
>
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
> Before After:
> System time: 1822.52s 1803.33s (-0.11%)
>
> Which is almost identical.
>
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
>
> Before: 318162.18 qps
> After: 318512.01 qps (+0.01%)
>
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
>
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. It's still limited to SYNC_IO
> devices, though, this limitation can will be removed later. This may
> cause more serious thrashing for certain workloads, but that's not an
> issue caused by this series, it's a common THP issue we should resolve
> separately.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
Unfortunately I don't have time to go through the series and review it,
but I wanted to just say awesome work here. The special cases in the
swap code to avoid using the swapcache have always been a pain.
In fact, there's one more special case that we can probably remove in
zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
fix data loss on SWP_SYNCHRONOUS_IO devices").
> ---
> Kairui Song (18):
> mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
> mm, swap: split swap cache preparation loop into a standalone helper
> mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
> mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
> mm, swap: simplify the code and reduce indention
> mm, swap: free the swap cache after folio is mapped
> mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
> mm, swap: swap entry of a bad slot should not be considered as swapped out
> mm, swap: consolidate cluster reclaim and check logic
> mm, swap: split locked entry duplicating into a standalone helper
> mm, swap: use swap cache as the swap in synchronize layer
> mm, swap: remove workaround for unsynchronized swap map cache state
> mm, swap: sanitize swap entry management workflow
> mm, swap: add folio to swap cache directly on allocation
> mm, swap: check swap table directly for checking cache
> mm, swap: clean up and improve swap entries freeing
> mm, swap: drop the SWAP_HAS_CACHE flag
> mm, swap: remove no longer needed _swap_info_get
>
> Nhat Pham (1):
> mm/shmem, swap: remove SWAP_MAP_SHMEM
>
> arch/s390/mm/pgtable.c | 2 +-
> include/linux/swap.h | 77 ++---
> kernel/power/swap.c | 10 +-
> mm/madvise.c | 2 +-
> mm/memory.c | 270 +++++++---------
> mm/rmap.c | 7 +-
> mm/shmem.c | 75 ++---
> mm/swap.h | 69 +++-
> mm/swap_state.c | 341 +++++++++++++-------
> mm/swapfile.c | 849 +++++++++++++++++++++----------------------------
> mm/userfaultfd.c | 10 +-
> mm/vmscan.c | 1 -
> mm/zswap.c | 4 +-
> 13 files changed, 840 insertions(+), 877 deletions(-)
> ---
> base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
> change-id: 20251007-swap-table-p2-7d3086e5c38a
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
^ permalink raw reply [flat|nested] 50+ messages in thread* Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
@ 2025-10-31 6:58 ` Kairui Song
0 siblings, 0 replies; 50+ messages in thread
From: Kairui Song @ 2025-10-31 6:58 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Chris Li,
Nhat Pham, Johannes Weiner, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel
On Fri, Oct 31, 2025 at 7:05 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> > This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> > special swap bits including SWAP_HAS_CACHE, along with many historical
> > issues. The performance is about ~20% better for some workloads, like
> > Redis with persistence. This also cleans up the code to prepare for
> > later phases, some patches are from a previously posted series.
> >
> > Swap cache bypassing and swap synchronization in general had many
> > issues. Some are solved as workarounds, and some are still there [1]. To
> > resolve them in a clean way, one good solution is to always use swap
> > cache as the synchronization layer [2]. So we have to remove the swap
> > cache bypass swap-in path first. It wasn't very doable due to
> > performance issues, but now combined with the swap table, removing
> > the swap cache bypass path will instead improve the performance,
> > there is no reason to keep it.
> >
> > Now we can rework the swap entry and cache synchronization following
> > the new design. Swap cache synchronization was heavily relying on
> > SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> > of special swap map bits and related workarounds, we get a cleaner code
> > base and prepare for merging the swap count into the swap table in the
> > next step.
> >
> > Test results:
> >
> > Redis / Valkey bench:
> > =====================
> >
> > Testing on a ARM64 VM 1.5G memory:
> > Server: valkey-server --maxmemory 2560M
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> > no persistence with BGSAVE
> > Before: 460475.84 RPS 311591.19 RPS
> > After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%)
> >
> > Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> > Server:
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> > no persistence with BGSAVE
> > Before: 306044.38 RPS 102745.88 RPS
> > After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%)
> >
> > The performance is a lot better when persistence is applied. This should
> > apply to many other workloads that involve sharing memory and COW. A
> > slight performance drop was observed for the ARM64 Redis test: We are
> > still using swap_map to track the swap count, which is causing redundant
> > cache and CPU overhead and is not very performance-friendly for some
> > arches. This will be improved once we merge the swap map into the swap
> > table (as already demonstrated previously [3]).
> >
> > vm-scabiity
> > ===========
> > usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> > simulated PMEM as swap), average result of 6 test run:
> >
> > Before: After:
> > System time: 282.22s 283.47s
> > Sum Throughput: 5677.35 MB/s 5688.78 MB/s
> > Single process Throughput: 176.41 MB/s 176.23 MB/s
> > Free latency: 518477.96 us 521488.06 us
> >
> > Which is almost identical.
> >
> > Build kernel test:
> > ==================
> > Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> > Before After:
> > System time: 1379.91s 1364.22s (-0.11%)
> >
> > Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> > Before After:
> > System time: 1822.52s 1803.33s (-0.11%)
> >
> > Which is almost identical.
> >
> > MySQL:
> > ======
> > sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> > --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> > 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> >
> > Before: 318162.18 qps
> > After: 318512.01 qps (+0.01%)
> >
> > In conclusion, the result is looking better or identical for most cases,
> > and it's especially better for workloads with swap count > 1 on SYNC_IO
> > devices, about ~20% gain in above test. Next phases will start to merge
> > swap count into swap table and reduce memory usage.
> >
> > One more gain here is that we now have better support for THP swapin.
> > Previously, the THP swapin was bound with swap cache bypassing, which
> > only works for single-mapped folios. Removing the bypassing path also
> > enabled THP swapin for all folios. It's still limited to SYNC_IO
> > devices, though, this limitation can will be removed later. This may
> > cause more serious thrashing for certain workloads, but that's not an
> > issue caused by this series, it's a common THP issue we should resolve
> > separately.
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> > Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Unfortunately I don't have time to go through the series and review it,
> but I wanted to just say awesome work here. The special cases in the
> swap code to avoid using the swapcache have always been a pain.
>
> In fact, there's one more special case that we can probably remove in
> zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
> fix data loss on SWP_SYNCHRONOUS_IO devices").
Thanks! Oh, now I remember that one, it can be removed indeed. There
are several more cleanup and optimizations that can be done after this
series, it's getting too long already so I didn't include everything.
But removing 25cd241408a2 is easy to do and easy to review, I can
include it in the next update.
^ permalink raw reply [flat|nested] 50+ messages in thread
* Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)
2025-10-29 15:58 [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Kairui Song
` (19 preceding siblings ...)
2025-10-30 23:04 ` [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II) Yosry Ahmed
@ 2025-11-05 7:39 ` Chris Li
20 siblings, 0 replies; 50+ messages in thread
From: Chris Li @ 2025-11-05 7:39 UTC (permalink / raw)
To: Kairui Song
Cc: linux-mm, Andrew Morton, Baoquan He, Barry Song, Nhat Pham,
Johannes Weiner, Yosry Ahmed, David Hildenbrand, Youngjun Park,
Hugh Dickins, Baolin Wang, Huang, Ying, Kemeng Shi,
Lorenzo Stoakes, Matthew Wilcox (Oracle), linux-kernel,
Kairui Song
Sorry I have been super busy and late to the review party.
I am still catching up on my backlogs.
The cover letter title is a bit too long, I suggest you put the swap
table phase II in the beginning of the title rather than the end. The
title is too long and "phase II" gets wrapped to another line. Maybe
just use "swap table phase II" as the cover letter title is good
enough. You can explain what this series does in more detail in the
body of the cover letter.
Also we can mention the total estimate of phases for the swap tables
(4-5 phases?). Does not need to be precise, just serves as an overall
indication of the swap table progress bar.
On Wed, Oct 29, 2025 at 8:59 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
Great job!
> special swap bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.
That is wonderful we can remove SWAP_HAS_CACHE and remove sync IO swap
cache bypass. Swap table is so fast the bypass does not make any sense
any more.
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
>
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
>
> Test results:
>
> Redis / Valkey bench:
> =====================
>
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> no persistence with BGSAVE
> Before: 460475.84 RPS 311591.19 RPS
> After: 451943.34 RPS (-1.9%) 371379.06 RPS (+19.2%)
>
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> no persistence with BGSAVE
> Before: 306044.38 RPS 102745.88 RPS
> After: 309645.44 RPS (+1.2%) 125313.28 RPS (+22.0%)
>
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
>
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
>
> Before: After:
> System time: 282.22s 283.47s
> Sum Throughput: 5677.35 MB/s 5688.78 MB/s
> Single process Throughput: 176.41 MB/s 176.23 MB/s
> Free latency: 518477.96 us 521488.06 us
>
> Which is almost identical.
>
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
> Before After:
> System time: 1379.91s 1364.22s (-0.11%)
>
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
> Before After:
> System time: 1822.52s 1803.33s (-0.11%)
>
> Which is almost identical.
>
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
>
> Before: 318162.18 qps
> After: 318512.01 qps (+0.01%)
>
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
>
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. It's still limited to SYNC_IO
> devices, though, this limitation can will be removed later. This may
Grammer. "though, this" "can will be"
The THP swapin is still limited to SYNC_IO devices. This limitation
can be removed later.
Chris
> cause more serious thrashing for certain workloads, but that's not an
> issue caused by this series, it's a common THP issue we should resolve
> separately.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Kairui Song (18):
> mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
> mm, swap: split swap cache preparation loop into a standalone helper
> mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
> mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
> mm, swap: simplify the code and reduce indention
> mm, swap: free the swap cache after folio is mapped
> mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
> mm, swap: swap entry of a bad slot should not be considered as swapped out
> mm, swap: consolidate cluster reclaim and check logic
> mm, swap: split locked entry duplicating into a standalone helper
> mm, swap: use swap cache as the swap in synchronize layer
> mm, swap: remove workaround for unsynchronized swap map cache state
> mm, swap: sanitize swap entry management workflow
> mm, swap: add folio to swap cache directly on allocation
> mm, swap: check swap table directly for checking cache
> mm, swap: clean up and improve swap entries freeing
> mm, swap: drop the SWAP_HAS_CACHE flag
> mm, swap: remove no longer needed _swap_info_get
>
> Nhat Pham (1):
> mm/shmem, swap: remove SWAP_MAP_SHMEM
>
> arch/s390/mm/pgtable.c | 2 +-
> include/linux/swap.h | 77 ++---
> kernel/power/swap.c | 10 +-
> mm/madvise.c | 2 +-
> mm/memory.c | 270 +++++++---------
> mm/rmap.c | 7 +-
> mm/shmem.c | 75 ++---
> mm/swap.h | 69 +++-
> mm/swap_state.c | 341 +++++++++++++-------
> mm/swapfile.c | 849 +++++++++++++++++++++----------------------------
> mm/userfaultfd.c | 10 +-
> mm/vmscan.c | 1 -
> mm/zswap.c | 4 +-
> 13 files changed, 840 insertions(+), 877 deletions(-)
> ---
> base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
> change-id: 20251007-swap-table-p2-7d3086e5c38a
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
^ permalink raw reply [flat|nested] 50+ messages in thread