[PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less
@ 2025-07-16  8:05 Hugh Dickins
  2025-07-16  8:08 ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates Hugh Dickins
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Hugh Dickins @ 2025-07-16  8:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Baolin Wang, Baoquan He, Barry Song, Chris Li, David Rientjes,
	Kairui Song, Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm

A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the
global shmem_swaplist_mutex worryingly hot: improvement is long overdue.

3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap")
apologized for extending shmem_swaplist_mutex across add_to_swap_cache(),
and hoped to find another way: yes, there may be lots of work to allocate
radix tree nodes in there.  Then 6.15 commit b487a2da3575 ("mm, swap:
simplify folio swap allocation") will have made it worse, by moving
shmem_writeout()'s swap allocation under that mutex too (but the worrying
flamegraph was observed even before that change).

There's a useful comment about pagelock no longer protecting from eviction
once moved to swap cache: but it's good till shmem_delete_from_page_cache()
replaces page pointer by swap entry, so move the swaplist add between them.

We would much prefer to take the global lock once per inode than once per
page: given the possible races with shmem_unuse() pruning when !swapped
(and other tasks racing to swap other pages out or in), try the swaplist
add whenever swapped was incremented from 0 (but inode may already be on
the list - only unuse and evict bother to remove it).

This technique is more subtle than it looks (we're avoiding the very lock
which would make it easy), but works: whereas an unlocked list_empty()
check runs a risk of the inode being unqueued and left off the swaplist
forever, swapoff only completing when the page is faulted in or removed.

The need for a sleepable mutex went away in 5.1 commit b56a2d8af914
("mm: rid swapoff of quadratic complexity"): a spinlock works better now.

This commit is certain to take shmem_swaplist_mutex out of contention,
and has been seen to make a practical improvement (but there is likely
to have been an underlying issue which made its contention so visible).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 59 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 60247dc48505..33675361031b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -292,7 +292,7 @@ bool vma_is_shmem(struct vm_area_struct *vma)
 }
 
 static LIST_HEAD(shmem_swaplist);
-static DEFINE_MUTEX(shmem_swaplist_mutex);
+static DEFINE_SPINLOCK(shmem_swaplist_lock);
 
 #ifdef CONFIG_TMPFS_QUOTA
 
@@ -432,10 +432,13 @@ static void shmem_free_inode(struct super_block *sb, size_t freed_ispace)
  *
  * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
  * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
+ *
+ * Return: true if swapped was incremented from 0, for shmem_writeout().
  */
-static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
+static bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
 {
 	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool first_swapped = false;
 	long freed;
 
 	spin_lock(&info->lock);
@@ -450,8 +453,11 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
 	 * to stop a racing shmem_recalc_inode() from thinking that a page has
 	 * been freed.  Compensate here, to avoid the need for a followup call.
 	 */
-	if (swapped > 0)
+	if (swapped > 0) {
+		if (info->swapped == swapped)
+			first_swapped = true;
 		freed += swapped;
+	}
 	if (freed > 0)
 		info->alloced -= freed;
 	spin_unlock(&info->lock);
@@ -459,6 +465,7 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
 	/* The quota case may block */
 	if (freed > 0)
 		shmem_inode_unacct_blocks(inode, freed);
+	return first_swapped;
 }
 
 bool shmem_charge(struct inode *inode, long pages)
@@ -1399,11 +1406,11 @@ static void shmem_evict_inode(struct inode *inode)
 			/* Wait while shmem_unuse() is scanning this inode... */
 			wait_var_event(&info->stop_eviction,
 				       !atomic_read(&info->stop_eviction));
-			mutex_lock(&shmem_swaplist_mutex);
+			spin_lock(&shmem_swaplist_lock);
 			/* ...but beware of the race if we peeked too early */
 			if (!atomic_read(&info->stop_eviction))
 				list_del_init(&info->swaplist);
-			mutex_unlock(&shmem_swaplist_mutex);
+			spin_unlock(&shmem_swaplist_lock);
 		}
 	}
 
@@ -1526,7 +1533,7 @@ int shmem_unuse(unsigned int type)
 	if (list_empty(&shmem_swaplist))
 		return 0;
 
-	mutex_lock(&shmem_swaplist_mutex);
+	spin_lock(&shmem_swaplist_lock);
 start_over:
 	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
 		if (!info->swapped) {
@@ -1540,12 +1547,12 @@ int shmem_unuse(unsigned int type)
 		 * (igrab() would protect from unlink, but not from unmount).
 		 */
 		atomic_inc(&info->stop_eviction);
-		mutex_unlock(&shmem_swaplist_mutex);
+		spin_unlock(&shmem_swaplist_lock);
 
 		error = shmem_unuse_inode(&info->vfs_inode, type);
 		cond_resched();
 
-		mutex_lock(&shmem_swaplist_mutex);
+		spin_lock(&shmem_swaplist_lock);
 		if (atomic_dec_and_test(&info->stop_eviction))
 			wake_up_var(&info->stop_eviction);
 		if (error)
@@ -1556,7 +1563,7 @@ int shmem_unuse(unsigned int type)
 		if (!info->swapped)
 			list_del_init(&info->swaplist);
 	}
-	mutex_unlock(&shmem_swaplist_mutex);
+	spin_unlock(&shmem_swaplist_lock);
 
 	return error;
 }
@@ -1646,30 +1653,30 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		folio_mark_uptodate(folio);
 	}
 
-	/*
-	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
-	 * if it's not already there.  Do it now before the folio is
-	 * moved to swap cache, when its pagelock no longer protects
-	 * the inode from eviction.  But don't unlock the mutex until
-	 * we've incremented swapped, because shmem_unuse_inode() will
-	 * prune a !swapped inode from the swaplist under this mutex.
-	 */
-	mutex_lock(&shmem_swaplist_mutex);
-	if (list_empty(&info->swaplist))
-		list_add(&info->swaplist, &shmem_swaplist);
-
 	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
-		shmem_recalc_inode(inode, 0, nr_pages);
+		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
+
+		/*
+		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
+		 * if it's not already there.  Do it now before the folio is
+		 * removed from page cache, when its pagelock no longer
+		 * protects the inode from eviction.  And do it now, after
+		 * we've incremented swapped, because shmem_unuse() will
+		 * prune a !swapped inode from the swaplist.
+		 */
+		if (first_swapped) {
+			spin_lock(&shmem_swaplist_lock);
+			if (list_empty(&info->swaplist))
+				list_add(&info->swaplist, &shmem_swaplist);
+			spin_unlock(&shmem_swaplist_lock);
+		}
+
 		swap_shmem_alloc(folio->swap, nr_pages);
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
-		mutex_unlock(&shmem_swaplist_mutex);
 		BUG_ON(folio_mapped(folio));
 		return swap_writeout(folio, plug);
 	}
-	if (!info->swapped)
-		list_del_init(&info->swaplist);
-	mutex_unlock(&shmem_swaplist_mutex);
 	if (nr_pages > 1)
 		goto try_split;
 redirty:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates
  2025-07-16  8:05 [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Hugh Dickins
@ 2025-07-16  8:08 ` Hugh Dickins
  2025-07-17  9:44   ` Baolin Wang
  2025-07-20  7:07   ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates David Rientjes
  2025-07-17  8:46 ` [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Baolin Wang
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 10+ messages in thread
From: Hugh Dickins @ 2025-07-16  8:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Baolin Wang, Baoquan He, Barry Song, Chris Li, David Rientjes,
	Kairui Song, Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm

If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because
zswap cannot compress and memcg disables writeback), there is no virtue
in keeping that folio in swap cache and holding the swap allocation:
shmem_writeout() switch it back to shmem page cache before returning.

Folio lock is held, and folio->memcg_data remains set throughout, so
there is no need to get into any memcg or memsw charge complications:
swap_free_nr() and delete_from_swap_cache() do as much as is needed (but
beware the race with shmem_free_swap() when inode truncated or evicted).

Doing the same for an anonymous folio is harder, since it will usually
have been unmapped, with references to the swap left in the page tables.
Adding a function to remap the folio would be fun, but not worthwhile
unless it has other uses, or an urgent bug with anon is demonstrated.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 33 ++++++++++++++++++++++++++++++++-
 1 file changed, 32 insertions(+), 1 deletion(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 33675361031b..5a7ce4c8bad6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1655,6 +1655,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 
 	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
 		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
+		int error;
 
 		/*
 		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
@@ -1675,7 +1676,37 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
 
 		BUG_ON(folio_mapped(folio));
-		return swap_writeout(folio, plug);
+		error = swap_writeout(folio, plug);
+		if (error != AOP_WRITEPAGE_ACTIVATE) {
+			/* folio has been unlocked */
+			return error;
+		}
+
+		/*
+		 * The intention here is to avoid holding on to the swap when
+		 * zswap was unable to compress and unable to writeback; but
+		 * it will be appropriate if other reactivate cases are added.
+		 */
+		error = shmem_add_to_page_cache(folio, mapping, index,
+				swp_to_radix_entry(folio->swap),
+				__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
+		/* Swap entry might be erased by racing shmem_free_swap() */
+		if (!error) {
+			spin_lock(&info->lock);
+			info->swapped -= nr_pages;
+			spin_unlock(&info->lock);
+			swap_free_nr(folio->swap, nr_pages);
+		}
+
+		/*
+		 * The delete_from_swap_cache() below could be left for
+		 * shrink_folio_list()'s folio_free_swap() to dispose of;
+		 * but I'm a little nervous about letting this folio out of
+		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
+		 * e.g. folio_mapping(folio) might give an unexpected answer.
+		 */
+		delete_from_swap_cache(folio);
+		goto redirty;
 	}
 	if (nr_pages > 1)
 		goto try_split;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less
  2025-07-16  8:05 [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Hugh Dickins
  2025-07-16  8:08 ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates Hugh Dickins
@ 2025-07-17  8:46 ` Baolin Wang
  2025-07-20  7:07 ` David Rientjes
  2025-07-21 17:54 ` Kairui Song
  3 siblings, 0 replies; 10+ messages in thread
From: Baolin Wang @ 2025-07-17  8:46 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Baoquan He, Barry Song, Chris Li, David Rientjes, Kairui Song,
	Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm



On 2025/7/16 16:05, Hugh Dickins wrote:
> A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the
> global shmem_swaplist_mutex worryingly hot: improvement is long overdue.
> 
> 3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap")
> apologized for extending shmem_swaplist_mutex across add_to_swap_cache(),
> and hoped to find another way: yes, there may be lots of work to allocate
> radix tree nodes in there.  Then 6.15 commit b487a2da3575 ("mm, swap:
> simplify folio swap allocation") will have made it worse, by moving
> shmem_writeout()'s swap allocation under that mutex too (but the worrying
> flamegraph was observed even before that change).
> 
> There's a useful comment about pagelock no longer protecting from eviction
> once moved to swap cache: but it's good till shmem_delete_from_page_cache()
> replaces page pointer by swap entry, so move the swaplist add between them.
> 
> We would much prefer to take the global lock once per inode than once per
> page: given the possible races with shmem_unuse() pruning when !swapped
> (and other tasks racing to swap other pages out or in), try the swaplist
> add whenever swapped was incremented from 0 (but inode may already be on
> the list - only unuse and evict bother to remove it).
> 
> This technique is more subtle than it looks (we're avoiding the very lock
> which would make it easy), but works: whereas an unlocked list_empty()
> check runs a risk of the inode being unqueued and left off the swaplist
> forever, swapoff only completing when the page is faulted in or removed.
> 
> The need for a sleepable mutex went away in 5.1 commit b56a2d8af914
> ("mm: rid swapoff of quadratic complexity"): a spinlock works better now.
> 
> This commit is certain to take shmem_swaplist_mutex out of contention,
> and has been seen to make a practical improvement (but there is likely
> to have been an underlying issue which made its contention so visible).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Makes sense to me. Thanks.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>

> ---
>   mm/shmem.c | 59 ++++++++++++++++++++++++++++++------------------------
>   1 file changed, 33 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 60247dc48505..33675361031b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -292,7 +292,7 @@ bool vma_is_shmem(struct vm_area_struct *vma)
>   }
>   
>   static LIST_HEAD(shmem_swaplist);
> -static DEFINE_MUTEX(shmem_swaplist_mutex);
> +static DEFINE_SPINLOCK(shmem_swaplist_lock);
>   
>   #ifdef CONFIG_TMPFS_QUOTA
>   
> @@ -432,10 +432,13 @@ static void shmem_free_inode(struct super_block *sb, size_t freed_ispace)
>    *
>    * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
>    * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
> + *
> + * Return: true if swapped was incremented from 0, for shmem_writeout().
>    */
> -static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
> +static bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>   {
>   	struct shmem_inode_info *info = SHMEM_I(inode);
> +	bool first_swapped = false;
>   	long freed;
>   
>   	spin_lock(&info->lock);
> @@ -450,8 +453,11 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>   	 * to stop a racing shmem_recalc_inode() from thinking that a page has
>   	 * been freed.  Compensate here, to avoid the need for a followup call.
>   	 */
> -	if (swapped > 0)
> +	if (swapped > 0) {
> +		if (info->swapped == swapped)
> +			first_swapped = true;
>   		freed += swapped;
> +	}
>   	if (freed > 0)
>   		info->alloced -= freed;
>   	spin_unlock(&info->lock);
> @@ -459,6 +465,7 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>   	/* The quota case may block */
>   	if (freed > 0)
>   		shmem_inode_unacct_blocks(inode, freed);
> +	return first_swapped;
>   }
>   
>   bool shmem_charge(struct inode *inode, long pages)
> @@ -1399,11 +1406,11 @@ static void shmem_evict_inode(struct inode *inode)
>   			/* Wait while shmem_unuse() is scanning this inode... */
>   			wait_var_event(&info->stop_eviction,
>   				       !atomic_read(&info->stop_eviction));
> -			mutex_lock(&shmem_swaplist_mutex);
> +			spin_lock(&shmem_swaplist_lock);
>   			/* ...but beware of the race if we peeked too early */
>   			if (!atomic_read(&info->stop_eviction))
>   				list_del_init(&info->swaplist);
> -			mutex_unlock(&shmem_swaplist_mutex);
> +			spin_unlock(&shmem_swaplist_lock);
>   		}
>   	}
>   
> @@ -1526,7 +1533,7 @@ int shmem_unuse(unsigned int type)
>   	if (list_empty(&shmem_swaplist))
>   		return 0;
>   
> -	mutex_lock(&shmem_swaplist_mutex);
> +	spin_lock(&shmem_swaplist_lock);
>   start_over:
>   	list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
>   		if (!info->swapped) {
> @@ -1540,12 +1547,12 @@ int shmem_unuse(unsigned int type)
>   		 * (igrab() would protect from unlink, but not from unmount).
>   		 */
>   		atomic_inc(&info->stop_eviction);
> -		mutex_unlock(&shmem_swaplist_mutex);
> +		spin_unlock(&shmem_swaplist_lock);
>   
>   		error = shmem_unuse_inode(&info->vfs_inode, type);
>   		cond_resched();
>   
> -		mutex_lock(&shmem_swaplist_mutex);
> +		spin_lock(&shmem_swaplist_lock);
>   		if (atomic_dec_and_test(&info->stop_eviction))
>   			wake_up_var(&info->stop_eviction);
>   		if (error)
> @@ -1556,7 +1563,7 @@ int shmem_unuse(unsigned int type)
>   		if (!info->swapped)
>   			list_del_init(&info->swaplist);
>   	}
> -	mutex_unlock(&shmem_swaplist_mutex);
> +	spin_unlock(&shmem_swaplist_lock);
>   
>   	return error;
>   }
> @@ -1646,30 +1653,30 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   		folio_mark_uptodate(folio);
>   	}
>   
> -	/*
> -	 * Add inode to shmem_unuse()'s list of swapped-out inodes,
> -	 * if it's not already there.  Do it now before the folio is
> -	 * moved to swap cache, when its pagelock no longer protects
> -	 * the inode from eviction.  But don't unlock the mutex until
> -	 * we've incremented swapped, because shmem_unuse_inode() will
> -	 * prune a !swapped inode from the swaplist under this mutex.
> -	 */
> -	mutex_lock(&shmem_swaplist_mutex);
> -	if (list_empty(&info->swaplist))
> -		list_add(&info->swaplist, &shmem_swaplist);
> -
>   	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
> -		shmem_recalc_inode(inode, 0, nr_pages);
> +		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
> +
> +		/*
> +		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
> +		 * if it's not already there.  Do it now before the folio is
> +		 * removed from page cache, when its pagelock no longer
> +		 * protects the inode from eviction.  And do it now, after
> +		 * we've incremented swapped, because shmem_unuse() will
> +		 * prune a !swapped inode from the swaplist.
> +		 */
> +		if (first_swapped) {
> +			spin_lock(&shmem_swaplist_lock);
> +			if (list_empty(&info->swaplist))
> +				list_add(&info->swaplist, &shmem_swaplist);
> +			spin_unlock(&shmem_swaplist_lock);
> +		}
> +
>   		swap_shmem_alloc(folio->swap, nr_pages);
>   		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>   
> -		mutex_unlock(&shmem_swaplist_mutex);
>   		BUG_ON(folio_mapped(folio));
>   		return swap_writeout(folio, plug);
>   	}
> -	if (!info->swapped)
> -		list_del_init(&info->swaplist);
> -	mutex_unlock(&shmem_swaplist_mutex);
>   	if (nr_pages > 1)
>   		goto try_split;
>   redirty:



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates
  2025-07-16  8:08 ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates Hugh Dickins
@ 2025-07-17  9:44   ` Baolin Wang
  2025-07-19  0:51     ` Hugh Dickins
  2025-07-19  0:56     ` [PATCH mm-unstable] mm/shmem: writeout free swap if swap_writeout() reactivates fix Hugh Dickins
  2025-07-20  7:07   ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates David Rientjes
  1 sibling, 2 replies; 10+ messages in thread
From: Baolin Wang @ 2025-07-17  9:44 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Baoquan He, Barry Song, Chris Li, David Rientjes, Kairui Song,
	Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm

Hi Hugh,

On 2025/7/16 16:08, Hugh Dickins wrote:
> If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because
> zswap cannot compress and memcg disables writeback), there is no virtue
> in keeping that folio in swap cache and holding the swap allocation:
> shmem_writeout() switch it back to shmem page cache before returning.
> 
> Folio lock is held, and folio->memcg_data remains set throughout, so
> there is no need to get into any memcg or memsw charge complications:
> swap_free_nr() and delete_from_swap_cache() do as much as is needed (but
> beware the race with shmem_free_swap() when inode truncated or evicted).
> 
> Doing the same for an anonymous folio is harder, since it will usually
> have been unmapped, with references to the swap left in the page tables.
> Adding a function to remap the folio would be fun, but not worthwhile
> unless it has other uses, or an urgent bug with anon is demonstrated.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   mm/shmem.c | 33 ++++++++++++++++++++++++++++++++-
>   1 file changed, 32 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 33675361031b..5a7ce4c8bad6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1655,6 +1655,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   
>   	if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
>   		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
> +		int error;
>   
>   		/*
>   		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
> @@ -1675,7 +1676,37 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>   		shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>   
>   		BUG_ON(folio_mapped(folio));
> -		return swap_writeout(folio, plug);
> +		error = swap_writeout(folio, plug);
> +		if (error != AOP_WRITEPAGE_ACTIVATE) {
> +			/* folio has been unlocked */
> +			return error;
> +		}
> +
> +		/*
> +		 * The intention here is to avoid holding on to the swap when
> +		 * zswap was unable to compress and unable to writeback; but
> +		 * it will be appropriate if other reactivate cases are added.
> +		 */
> +		error = shmem_add_to_page_cache(folio, mapping, index,
> +				swp_to_radix_entry(folio->swap),
> +				__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> +		/* Swap entry might be erased by racing shmem_free_swap() */
> +		if (!error) {
> +			spin_lock(&info->lock);
> +			info->swapped -= nr_pages;
> +			spin_unlock(&info->lock);

Using the helper 'shmem_recalc_inode(inode, 0, -nr_pages)' seems more 
readable?

> +			swap_free_nr(folio->swap, nr_pages);
> +		}
> +
> +		/*
> +		 * The delete_from_swap_cache() below could be left for
> +		 * shrink_folio_list()'s folio_free_swap() to dispose of;
> +		 * but I'm a little nervous about letting this folio out of
> +		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
> +		 * e.g. folio_mapping(folio) might give an unexpected answer.
> +		 */
> +		delete_from_swap_cache(folio);

IIUC, Should the delete_from_swap_cache() also be moved into the 'if 
(!error)' branch? Since if shmem_free_swap() has freed the swap entry, 
it would also reclaim the swap cache, no?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates
  2025-07-17  9:44   ` Baolin Wang
@ 2025-07-19  0:51     ` Hugh Dickins
  2025-07-19  4:32       ` Baolin Wang
  2025-07-19  0:56     ` [PATCH mm-unstable] mm/shmem: writeout free swap if swap_writeout() reactivates fix Hugh Dickins
  1 sibling, 1 reply; 10+ messages in thread
From: Hugh Dickins @ 2025-07-19  0:51 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Hugh Dickins, Andrew Morton, Baoquan He, Barry Song, Chris Li,
	David Rientjes, Kairui Song, Kemeng Shi, Shakeel Butt,
	linux-kernel, linux-mm

On Thu, 17 Jul 2025, Baolin Wang wrote:

> Hi Hugh,
> 
> On 2025/7/16 16:08, Hugh Dickins wrote:
> > If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because
> > zswap cannot compress and memcg disables writeback), there is no virtue
> > in keeping that folio in swap cache and holding the swap allocation:
> > shmem_writeout() switch it back to shmem page cache before returning.
> > 
> > Folio lock is held, and folio->memcg_data remains set throughout, so
> > there is no need to get into any memcg or memsw charge complications:
> > swap_free_nr() and delete_from_swap_cache() do as much as is needed (but
> > beware the race with shmem_free_swap() when inode truncated or evicted).
> > 
> > Doing the same for an anonymous folio is harder, since it will usually
> > have been unmapped, with references to the swap left in the page tables.
> > Adding a function to remap the folio would be fun, but not worthwhile
> > unless it has other uses, or an urgent bug with anon is demonstrated.
> > 
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >   mm/shmem.c | 33 ++++++++++++++++++++++++++++++++-
> >   1 file changed, 32 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 33675361031b..5a7ce4c8bad6 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1655,6 +1655,7 @@ int shmem_writeout(struct folio *folio, struct
> > swap_iocb **plug,
> >   
> >    if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC |
> >    __GFP_NOWARN)) {
> >   		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
> > +		int error;
> >   
> >     /*
> >   		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
> > @@ -1675,7 +1676,37 @@ int shmem_writeout(struct folio *folio, struct
> > swap_iocb **plug,
> >     shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
> >   
> >   		BUG_ON(folio_mapped(folio));
> > -		return swap_writeout(folio, plug);
> > +		error = swap_writeout(folio, plug);
> > +		if (error != AOP_WRITEPAGE_ACTIVATE) {
> > +			/* folio has been unlocked */
> > +			return error;
> > +		}
> > +
> > +		/*
> > +		 * The intention here is to avoid holding on to the swap when
> > +		 * zswap was unable to compress and unable to writeback; but
> > +		 * it will be appropriate if other reactivate cases are added.
> > +		 */
> > +		error = shmem_add_to_page_cache(folio, mapping, index,
> > +				swp_to_radix_entry(folio->swap),
> > +				__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
> > +		/* Swap entry might be erased by racing shmem_free_swap() */
> > +		if (!error) {
> > +			spin_lock(&info->lock);
> > +			info->swapped -= nr_pages;
> > +			spin_unlock(&info->lock);
> 
> Using the helper 'shmem_recalc_inode(inode, 0, -nr_pages)' seems more
> readable?

Yes, that's better, thanks: I don't know if I'd say "more readable",
but it is much more in the spirit of shmem_recalc_inode(), bringing
the counts into balance sooner rather than later.

I'll follow up with a "fix" patch to Andrew.

> 
> > +			swap_free_nr(folio->swap, nr_pages);
> > +		}
> > +
> > +		/*
> > +		 * The delete_from_swap_cache() below could be left for
> > +		 * shrink_folio_list()'s folio_free_swap() to dispose of;
> > +		 * but I'm a little nervous about letting this folio out of
> > +		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
> > +		 * e.g. folio_mapping(folio) might give an unexpected answer.
> > +		 */
> > +		delete_from_swap_cache(folio);
> 
> IIUC, Should the delete_from_swap_cache() also be moved into the 'if (!error)'
> branch? Since if shmem_free_swap() has freed the swap entry, it would also
> reclaim the swap cache, no?

No, but it was a good point to raise, and led into more research than
I had anticipated.

No: because shmem_free_swap->free_swap_and_cache_nr->__try_to_reclaim_swap
has to return after doing nothing if its folio_trylock fails: it cannot do
the delete_from_swap_cache() part of the job, which we do here - on this
AOP_WRITEPAGE_ACTIVATE path, we hold the folio_lock throughout.

But it led into more research, because I wanted to point you to the
equivalent coding in shmem_swapin_folio(): but, to my initial alarm,
the equivalent is not there; but used to be.

See 5.8 commit 14235ab36019 ("mm: shmem: remove rare optimization when
swapin races with hole punching").  There (in the deleted lines) you can
see the helpful comment on this case, with its delete_from_swap_cache()
when shmem_add_to_page_cache() fails.  But for memcg-charging reasons,
5.8 found it simpler to drop that, and just let shrink_page_list()
clear up the debris later.

Here in shmem_writeout(), holding folio_lock throughout, we have no
memcg complications, and can go ahead with delete_from_swap_cache(),
both when successfully added back to page cache, and when that fails.

Thanks for checking!
Hugh


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH mm-unstable] mm/shmem: writeout free swap if swap_writeout() reactivates fix
  2025-07-17  9:44   ` Baolin Wang
  2025-07-19  0:51     ` Hugh Dickins
@ 2025-07-19  0:56     ` Hugh Dickins
  1 sibling, 0 replies; 10+ messages in thread
From: Hugh Dickins @ 2025-07-19  0:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Baolin Wang, Hugh Dickins, Baoquan He, Barry Song, Chris Li,
	David Rientjes, Kairui Song, Kemeng Shi, Shakeel Butt,
	linux-kernel, linux-mm

Per Baolin: use shmem_recalc_inode() rather than open coding.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/shmem.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 5a7ce4c8bad6..927ccc4a6002 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1692,9 +1692,7 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
 				__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
 		/* Swap entry might be erased by racing shmem_free_swap() */
 		if (!error) {
-			spin_lock(&info->lock);
-			info->swapped -= nr_pages;
-			spin_unlock(&info->lock);
+			shmem_recalc_inode(inode, 0, -nr_pages);
 			swap_free_nr(folio->swap, nr_pages);
 		}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates
  2025-07-19  0:51     ` Hugh Dickins
@ 2025-07-19  4:32       ` Baolin Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Baolin Wang @ 2025-07-19  4:32 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Baoquan He, Barry Song, Chris Li, David Rientjes,
	Kairui Song, Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm



On 2025/7/19 08:51, Hugh Dickins wrote:
> On Thu, 17 Jul 2025, Baolin Wang wrote:
> 
>> Hi Hugh,
>>
>> On 2025/7/16 16:08, Hugh Dickins wrote:
>>> If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because
>>> zswap cannot compress and memcg disables writeback), there is no virtue
>>> in keeping that folio in swap cache and holding the swap allocation:
>>> shmem_writeout() switch it back to shmem page cache before returning.
>>>
>>> Folio lock is held, and folio->memcg_data remains set throughout, so
>>> there is no need to get into any memcg or memsw charge complications:
>>> swap_free_nr() and delete_from_swap_cache() do as much as is needed (but
>>> beware the race with shmem_free_swap() when inode truncated or evicted).
>>>
>>> Doing the same for an anonymous folio is harder, since it will usually
>>> have been unmapped, with references to the swap left in the page tables.
>>> Adding a function to remap the folio would be fun, but not worthwhile
>>> unless it has other uses, or an urgent bug with anon is demonstrated.
>>>
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> ---
>>>    mm/shmem.c | 33 ++++++++++++++++++++++++++++++++-
>>>    1 file changed, 32 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 33675361031b..5a7ce4c8bad6 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -1655,6 +1655,7 @@ int shmem_writeout(struct folio *folio, struct
>>> swap_iocb **plug,
>>>    
>>>     if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC |
>>>     __GFP_NOWARN)) {
>>>    		bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
>>> +		int error;
>>>    
>>>      /*
>>>    		 * Add inode to shmem_unuse()'s list of swapped-out inodes,
>>> @@ -1675,7 +1676,37 @@ int shmem_writeout(struct folio *folio, struct
>>> swap_iocb **plug,
>>>      shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>>>    
>>>    		BUG_ON(folio_mapped(folio));
>>> -		return swap_writeout(folio, plug);
>>> +		error = swap_writeout(folio, plug);
>>> +		if (error != AOP_WRITEPAGE_ACTIVATE) {
>>> +			/* folio has been unlocked */
>>> +			return error;
>>> +		}
>>> +
>>> +		/*
>>> +		 * The intention here is to avoid holding on to the swap when
>>> +		 * zswap was unable to compress and unable to writeback; but
>>> +		 * it will be appropriate if other reactivate cases are added.
>>> +		 */
>>> +		error = shmem_add_to_page_cache(folio, mapping, index,
>>> +				swp_to_radix_entry(folio->swap),
>>> +				__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN);
>>> +		/* Swap entry might be erased by racing shmem_free_swap() */
>>> +		if (!error) {
>>> +			spin_lock(&info->lock);
>>> +			info->swapped -= nr_pages;
>>> +			spin_unlock(&info->lock);
>>
>> Using the helper 'shmem_recalc_inode(inode, 0, -nr_pages)' seems more
>> readable?
> 
> Yes, that's better, thanks: I don't know if I'd say "more readable",
> but it is much more in the spirit of shmem_recalc_inode(), bringing
> the counts into balance sooner rather than later.
> 
> I'll follow up with a "fix" patch to Andrew.
> 
>>
>>> +			swap_free_nr(folio->swap, nr_pages);
>>> +		}
>>> +
>>> +		/*
>>> +		 * The delete_from_swap_cache() below could be left for
>>> +		 * shrink_folio_list()'s folio_free_swap() to dispose of;
>>> +		 * but I'm a little nervous about letting this folio out of
>>> +		 * shmem_writeout() in a hybrid half-tmpfs-half-swap state
>>> +		 * e.g. folio_mapping(folio) might give an unexpected answer.
>>> +		 */
>>> +		delete_from_swap_cache(folio);
>>
>> IIUC, Should the delete_from_swap_cache() also be moved into the 'if (!error)'
>> branch? Since if shmem_free_swap() has freed the swap entry, it would also
>> reclaim the swap cache, no?
> 
> No, but it was a good point to raise, and led into more research than
> I had anticipated.
> 
> No: because shmem_free_swap->free_swap_and_cache_nr->__try_to_reclaim_swap
> has to return after doing nothing if its folio_trylock fails: it cannot do
> the delete_from_swap_cache() part of the job, which we do here - on this
> AOP_WRITEPAGE_ACTIVATE path, we hold the folio_lock throughout.

I missed the 'folio_trylock', yes, you are right. Thanks for explanation.

> But it led into more research, because I wanted to point you to the
> equivalent coding in shmem_swapin_folio(): but, to my initial alarm,
> the equivalent is not there; but used to be.
> 
> See 5.8 commit 14235ab36019 ("mm: shmem: remove rare optimization when
> swapin races with hole punching").  There (in the deleted lines) you can
> see the helpful comment on this case, with its delete_from_swap_cache()
> when shmem_add_to_page_cache() fails.  But for memcg-charging reasons,
> 5.8 found it simpler to drop that, and just let shrink_page_list()
> clear up the debris later.
> 
> Here in shmem_writeout(), holding folio_lock throughout, we have no
> memcg complications, and can go ahead with delete_from_swap_cache(),
> both when successfully added back to page cache, and when that fails.

OK. Thanks for pointing out the change history here, and I have no 
further questions.

With your following changes, feel free to add:

Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less
  2025-07-16  8:05 [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Hugh Dickins
  2025-07-16  8:08 ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates Hugh Dickins
  2025-07-17  8:46 ` [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Baolin Wang
@ 2025-07-20  7:07 ` David Rientjes
  2025-07-21 17:54 ` Kairui Song
  3 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2025-07-20  7:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Baolin Wang, Baoquan He, Barry Song, Chris Li,
	Kairui Song, Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm

On Wed, 16 Jul 2025, Hugh Dickins wrote:

> A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the
> global shmem_swaplist_mutex worryingly hot: improvement is long overdue.
> 
> 3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap")
> apologized for extending shmem_swaplist_mutex across add_to_swap_cache(),
> and hoped to find another way: yes, there may be lots of work to allocate
> radix tree nodes in there.  Then 6.15 commit b487a2da3575 ("mm, swap:
> simplify folio swap allocation") will have made it worse, by moving
> shmem_writeout()'s swap allocation under that mutex too (but the worrying
> flamegraph was observed even before that change).
> 
> There's a useful comment about pagelock no longer protecting from eviction
> once moved to swap cache: but it's good till shmem_delete_from_page_cache()
> replaces page pointer by swap entry, so move the swaplist add between them.
> 
> We would much prefer to take the global lock once per inode than once per
> page: given the possible races with shmem_unuse() pruning when !swapped
> (and other tasks racing to swap other pages out or in), try the swaplist
> add whenever swapped was incremented from 0 (but inode may already be on
> the list - only unuse and evict bother to remove it).
> 
> This technique is more subtle than it looks (we're avoiding the very lock
> which would make it easy), but works: whereas an unlocked list_empty()
> check runs a risk of the inode being unqueued and left off the swaplist
> forever, swapoff only completing when the page is faulted in or removed.
> 
> The need for a sleepable mutex went away in 5.1 commit b56a2d8af914
> ("mm: rid swapoff of quadratic complexity"): a spinlock works better now.
> 
> This commit is certain to take shmem_swaplist_mutex out of contention,
> and has been seen to make a practical improvement (but there is likely
> to have been an underlying issue which made its contention so visible).
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Tested-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates
  2025-07-16  8:08 ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates Hugh Dickins
  2025-07-17  9:44   ` Baolin Wang
@ 2025-07-20  7:07   ` David Rientjes
  1 sibling, 0 replies; 10+ messages in thread
From: David Rientjes @ 2025-07-20  7:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Baolin Wang, Baoquan He, Barry Song, Chris Li,
	Kairui Song, Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm

On Wed, 16 Jul 2025, Hugh Dickins wrote:

> If swap_writeout() returns AOP_WRITEPAGE_ACTIVATE (for example, because
> zswap cannot compress and memcg disables writeback), there is no virtue
> in keeping that folio in swap cache and holding the swap allocation:
> shmem_writeout() switch it back to shmem page cache before returning.
> 
> Folio lock is held, and folio->memcg_data remains set throughout, so
> there is no need to get into any memcg or memsw charge complications:
> swap_free_nr() and delete_from_swap_cache() do as much as is needed (but
> beware the race with shmem_free_swap() when inode truncated or evicted).
> 
> Doing the same for an anonymous folio is harder, since it will usually
> have been unmapped, with references to the swap left in the page tables.
> Adding a function to remap the folio would be fun, but not worthwhile
> unless it has other uses, or an urgent bug with anon is demonstrated.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>

Tested-by: David Rientjes <rientjes@google.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less
  2025-07-16  8:05 [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Hugh Dickins
                   ` (2 preceding siblings ...)
  2025-07-20  7:07 ` David Rientjes
@ 2025-07-21 17:54 ` Kairui Song
  3 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-07-21 17:54 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Baolin Wang, Baoquan He, Barry Song, Chris Li,
	David Rientjes, Kemeng Shi, Shakeel Butt, linux-kernel, linux-mm

On Wed, Jul 16, 2025 at 4:06 PM Hugh Dickins <hughd@google.com> wrote:
>
> A flamegraph (from an MGLRU load) showed shmem_writeout()'s use of the
> global shmem_swaplist_mutex worryingly hot: improvement is long overdue.
>
> 3.1 commit 6922c0c7abd3 ("tmpfs: convert shmem_writepage and enable swap")
> apologized for extending shmem_swaplist_mutex across add_to_swap_cache(),
> and hoped to find another way: yes, there may be lots of work to allocate
> radix tree nodes in there.  Then 6.15 commit b487a2da3575 ("mm, swap:
> simplify folio swap allocation") will have made it worse, by moving
> shmem_writeout()'s swap allocation under that mutex too (but the worrying
> flamegraph was observed even before that change).
>
> There's a useful comment about pagelock no longer protecting from eviction
> once moved to swap cache: but it's good till shmem_delete_from_page_cache()
> replaces page pointer by swap entry, so move the swaplist add between them.
>
> We would much prefer to take the global lock once per inode than once per
> page: given the possible races with shmem_unuse() pruning when !swapped
> (and other tasks racing to swap other pages out or in), try the swaplist
> add whenever swapped was incremented from 0 (but inode may already be on
> the list - only unuse and evict bother to remove it).
>
> This technique is more subtle than it looks (we're avoiding the very lock
> which would make it easy), but works: whereas an unlocked list_empty()
> check runs a risk of the inode being unqueued and left off the swaplist
> forever, swapoff only completing when the page is faulted in or removed.
>
> The need for a sleepable mutex went away in 5.1 commit b56a2d8af914
> ("mm: rid swapoff of quadratic complexity"): a spinlock works better now.
>
> This commit is certain to take shmem_swaplist_mutex out of contention,
> and has been seen to make a practical improvement (but there is likely
> to have been an underlying issue which made its contention so visible).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/shmem.c | 59 ++++++++++++++++++++++++++++++------------------------
>  1 file changed, 33 insertions(+), 26 deletions(-)

Thanks a lot! I've also seen this issue, and we observed this
contention on 5.4 kernels and I wasn't sure about how we can optimize
it. This is very helpful.

> diff --git a/mm/shmem.c b/mm/shmem.c
> index 60247dc48505..33675361031b 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -292,7 +292,7 @@ bool vma_is_shmem(struct vm_area_struct *vma)
>  }
>
>  static LIST_HEAD(shmem_swaplist);
> -static DEFINE_MUTEX(shmem_swaplist_mutex);
> +static DEFINE_SPINLOCK(shmem_swaplist_lock);
>
>  #ifdef CONFIG_TMPFS_QUOTA
>
> @@ -432,10 +432,13 @@ static void shmem_free_inode(struct super_block *sb, size_t freed_ispace)
>   *
>   * But normally   info->alloced == inode->i_mapping->nrpages + info->swapped
>   * So mm freed is info->alloced - (inode->i_mapping->nrpages + info->swapped)
> + *
> + * Return: true if swapped was incremented from 0, for shmem_writeout().
>   */
> -static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
> +static bool shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>  {
>         struct shmem_inode_info *info = SHMEM_I(inode);
> +       bool first_swapped = false;
>         long freed;
>
>         spin_lock(&info->lock);
> @@ -450,8 +453,11 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>          * to stop a racing shmem_recalc_inode() from thinking that a page has
>          * been freed.  Compensate here, to avoid the need for a followup call.
>          */
> -       if (swapped > 0)
> +       if (swapped > 0) {
> +               if (info->swapped == swapped)
> +                       first_swapped = true;
>                 freed += swapped;
> +       }
>         if (freed > 0)
>                 info->alloced -= freed;
>         spin_unlock(&info->lock);
> @@ -459,6 +465,7 @@ static void shmem_recalc_inode(struct inode *inode, long alloced, long swapped)
>         /* The quota case may block */
>         if (freed > 0)
>                 shmem_inode_unacct_blocks(inode, freed);
> +       return first_swapped;
>  }
>
>  bool shmem_charge(struct inode *inode, long pages)
> @@ -1399,11 +1406,11 @@ static void shmem_evict_inode(struct inode *inode)
>                         /* Wait while shmem_unuse() is scanning this inode... */
>                         wait_var_event(&info->stop_eviction,
>                                        !atomic_read(&info->stop_eviction));
> -                       mutex_lock(&shmem_swaplist_mutex);
> +                       spin_lock(&shmem_swaplist_lock);
>                         /* ...but beware of the race if we peeked too early */
>                         if (!atomic_read(&info->stop_eviction))
>                                 list_del_init(&info->swaplist);
> -                       mutex_unlock(&shmem_swaplist_mutex);
> +                       spin_unlock(&shmem_swaplist_lock);
>                 }
>         }
>
> @@ -1526,7 +1533,7 @@ int shmem_unuse(unsigned int type)
>         if (list_empty(&shmem_swaplist))
>                 return 0;
>
> -       mutex_lock(&shmem_swaplist_mutex);
> +       spin_lock(&shmem_swaplist_lock);
>  start_over:
>         list_for_each_entry_safe(info, next, &shmem_swaplist, swaplist) {
>                 if (!info->swapped) {
> @@ -1540,12 +1547,12 @@ int shmem_unuse(unsigned int type)
>                  * (igrab() would protect from unlink, but not from unmount).
>                  */
>                 atomic_inc(&info->stop_eviction);
> -               mutex_unlock(&shmem_swaplist_mutex);
> +               spin_unlock(&shmem_swaplist_lock);
>
>                 error = shmem_unuse_inode(&info->vfs_inode, type);
>                 cond_resched();
>
> -               mutex_lock(&shmem_swaplist_mutex);
> +               spin_lock(&shmem_swaplist_lock);
>                 if (atomic_dec_and_test(&info->stop_eviction))
>                         wake_up_var(&info->stop_eviction);
>                 if (error)
> @@ -1556,7 +1563,7 @@ int shmem_unuse(unsigned int type)
>                 if (!info->swapped)
>                         list_del_init(&info->swaplist);
>         }
> -       mutex_unlock(&shmem_swaplist_mutex);
> +       spin_unlock(&shmem_swaplist_lock);
>
>         return error;
>  }
> @@ -1646,30 +1653,30 @@ int shmem_writeout(struct folio *folio, struct swap_iocb **plug,
>                 folio_mark_uptodate(folio);
>         }
>
> -       /*
> -        * Add inode to shmem_unuse()'s list of swapped-out inodes,
> -        * if it's not already there.  Do it now before the folio is
> -        * moved to swap cache, when its pagelock no longer protects
> -        * the inode from eviction.  But don't unlock the mutex until
> -        * we've incremented swapped, because shmem_unuse_inode() will
> -        * prune a !swapped inode from the swaplist under this mutex.
> -        */
> -       mutex_lock(&shmem_swaplist_mutex);
> -       if (list_empty(&info->swaplist))
> -               list_add(&info->swaplist, &shmem_swaplist);
> -
>         if (!folio_alloc_swap(folio, __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN)) {
> -               shmem_recalc_inode(inode, 0, nr_pages);
> +               bool first_swapped = shmem_recalc_inode(inode, 0, nr_pages);
> +
> +               /*
> +                * Add inode to shmem_unuse()'s list of swapped-out inodes,
> +                * if it's not already there.  Do it now before the folio is
> +                * removed from page cache, when its pagelock no longer
> +                * protects the inode from eviction.  And do it now, after
> +                * we've incremented swapped, because shmem_unuse() will
> +                * prune a !swapped inode from the swaplist.
> +                */
> +               if (first_swapped) {
> +                       spin_lock(&shmem_swaplist_lock);
> +                       if (list_empty(&info->swaplist))
> +                               list_add(&info->swaplist, &shmem_swaplist);
> +                       spin_unlock(&shmem_swaplist_lock);
> +               }
> +
>                 swap_shmem_alloc(folio->swap, nr_pages);
>                 shmem_delete_from_page_cache(folio, swp_to_radix_entry(folio->swap));
>
> -               mutex_unlock(&shmem_swaplist_mutex);
>                 BUG_ON(folio_mapped(folio));
>                 return swap_writeout(folio, plug);
>         }
> -       if (!info->swapped)
> -               list_del_init(&info->swaplist);
> -       mutex_unlock(&shmem_swaplist_mutex);
>         if (nr_pages > 1)
>                 goto try_split;
>  redirty:
> --
> 2.43.0

Reviewed-by: Kairui Song <kasong@tencent.com>


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-07-21 17:55 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-16  8:05 [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Hugh Dickins
2025-07-16  8:08 ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates Hugh Dickins
2025-07-17  9:44   ` Baolin Wang
2025-07-19  0:51     ` Hugh Dickins
2025-07-19  4:32       ` Baolin Wang
2025-07-19  0:56     ` [PATCH mm-unstable] mm/shmem: writeout free swap if swap_writeout() reactivates fix Hugh Dickins
2025-07-20  7:07   ` [PATCH mm-new 2/2] mm/shmem: writeout free swap if swap_writeout() reactivates David Rientjes
2025-07-17  8:46 ` [PATCH mm-new 1/2] mm/shmem: hold shmem_swaplist spinlock (not mutex) much less Baolin Wang
2025-07-20  7:07 ` David Rientjes
2025-07-21 17:54 ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).