All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/4] mm: drop redundant lru_add_drain in anon folio reuse paths
@ 2026-06-23 23:16 Barry Song (Xiaomi)
  2026-06-23 23:16 ` [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio() Barry Song (Xiaomi)
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-06-23 23:16 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baoquan.he, chrisl, david, jp.kobryn, kasong, liam, linux-kernel,
	ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Barry Song (Xiaomi)

We are doing a large number of redundant lru_add_drain() calls in
both wp_can_reuse_anon_folio() and do_swap_page(), leading to LRU
lock contention and unnecessary overhead.

In wp_can_reuse_anon_folio(), we can check the refcount against the
lru_cache before deciding to drain. In do_swap_page(), the drain is
now entirely redundant after Kairui's work to route SYNC I/O through
the swapcache in the same way as ASYNC I/O.

Build the kernel within a 1GB memcg using 20 threads with zRAM swap.
The number of lru_add_drain() calls is reduced from 276,787 to
230,283, while sys time decreases slightly from 3m40.125s to
3m37.128s.

Build the kernel within an 800MB memcg using 20 threads with zRAM
swap. The number of lru_add_drain() calls is reduced from 796,661 to
537,262, while sys time decreases slightly from 6m25.981s to
6m22.678s.

-v2:
 * collect the reviewed-by and acked-by tags from Usama, Baoquan,
   Shakeel, Kairui, thanks!
 * add patch4 to free swapcache for non-LRU folios, as suggested
   by Kairui, thanks!

-RFC:
 https://lore.kernel.org/linux-mm/20260611105124.98668-1-baohua@kernel.org/ 

Barry Song (Xiaomi) (4):
  mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic
  mm: entirely remove lru_add_drain in do_swap_page
  mm: try to free swapcache for non-LRU folios

 mm/memory.c | 30 +++++++++++++-----------------
 1 file changed, 13 insertions(+), 17 deletions(-)

-- 
2.39.3 (Apple Git-146)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-23 23:16 [PATCH v2 0/4] mm: drop redundant lru_add_drain in anon folio reuse paths Barry Song (Xiaomi)
@ 2026-06-23 23:16 ` Barry Song (Xiaomi)
  2026-06-24 10:14   ` Kairui Song
  2026-06-24 15:02   ` David Hildenbrand (Arm)
  2026-06-23 23:16 ` [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic Barry Song (Xiaomi)
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-06-23 23:16 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baoquan.he, chrisl, david, jp.kobryn, kasong, liam, linux-kernel,
	ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Barry Song (Xiaomi)

We always unconditionally drain the LRU before retrying anon folio
reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
are in lru_cache, and use the refcount to avoid many unnecessary LRU
drains.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..f6848f4234a6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
 	 */
 	if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
 		return false;
-	if (!folio_test_lru(folio))
+	if (!folio_test_lru(folio)) {
+		/*
+		 * Assume folio is on lru_cache and holds a cache reference.
+		 */
+		if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
+			return false;
 		/*
 		 * We cannot easily detect+handle references from
 		 * remote LRU caches or references to LRU folios.
 		 */
 		lru_add_drain();
+	}
 	if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
 		return false;
 	if (!folio_trylock(folio))
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic
  2026-06-23 23:16 [PATCH v2 0/4] mm: drop redundant lru_add_drain in anon folio reuse paths Barry Song (Xiaomi)
  2026-06-23 23:16 ` [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio() Barry Song (Xiaomi)
@ 2026-06-23 23:16 ` Barry Song (Xiaomi)
  2026-06-24 15:07   ` David Hildenbrand (Arm)
  2026-06-23 23:16 ` [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page Barry Song (Xiaomi)
  2026-06-23 23:16 ` [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios Barry Song (Xiaomi)
  3 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-06-23 23:16 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baoquan.he, chrisl, david, jp.kobryn, kasong, liam, linux-kernel,
	ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Barry Song (Xiaomi)

The "we just allocated them without exposing them to the swapcache"
case no longer exists, as Kairui has routed synchronous I/O through
the swapcache as well in his series "unify swapin use swap cache and
cleanup flags"[1]. As a result, folio_ref_count() should never be 1
in this path, since at least two references are held (base ref plus
swapcache). Remove the folio_ref_count()==1 check and update the
comment accordingly.

[1] https://lore.kernel.org/all/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com/

Acked-by: Usama Arif <usama.arif@linux.dev>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f6848f4234a6..abd0adcf65f0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5049,12 +5049,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 	/*
 	 * Same logic as in do_wp_page(); however, optimize for pages that are
-	 * certainly not shared either because we just allocated them without
-	 * exposing them to the swapcache or because the swap entry indicates
-	 * exclusivity.
+	 * certainly not because the swap entry indicates exclusivity.
 	 */
-	if (!folio_test_ksm(folio) &&
-	    (exclusive || folio_ref_count(folio) == 1)) {
+	if (!folio_test_ksm(folio) && exclusive) {
 		if ((vma->vm_flags & VM_WRITE) && !userfaultfd_pte_wp(vma, pte) &&
 		    !pte_needs_soft_dirty_wp(vma, pte)) {
 			pte = pte_mkwrite(pte, vma);
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page
  2026-06-23 23:16 [PATCH v2 0/4] mm: drop redundant lru_add_drain in anon folio reuse paths Barry Song (Xiaomi)
  2026-06-23 23:16 ` [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio() Barry Song (Xiaomi)
  2026-06-23 23:16 ` [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic Barry Song (Xiaomi)
@ 2026-06-23 23:16 ` Barry Song (Xiaomi)
  2026-06-24 10:16   ` Kairui Song
  2026-06-24 15:10   ` David Hildenbrand (Arm)
  2026-06-23 23:16 ` [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios Barry Song (Xiaomi)
  3 siblings, 2 replies; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-06-23 23:16 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baoquan.he, chrisl, david, jp.kobryn, kasong, liam, linux-kernel,
	ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Barry Song (Xiaomi)

We are doing a lot of redundant lru_add_drain() calls in
do_swap_page(), especially for synchronous I/O devices. For
example, the test program below currently ends up draining
lru_cache 100% of the time:

int main(int argc, char *argv[])
{
        int i;
 #define SIZE 100*1024*1024
	while(1) {
		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
                        MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

		for (int i = 0; i < SIZE/sizeof(int); i++)
			p[i] =  i%64;
		madvise((void *)p, SIZE, MADV_PAGEOUT);
		for (int i = 0; i < SIZE/sizeof(int); i++)
			p[i] =  i%64;
		munmap(p, SIZE);
	}
	return 0;
}

Folio reuse now relies primarily on the exclusive hint, making
lru_cache draining to drop the refcount in lru_cache largely
irrelevant.

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index abd0adcf65f0..2983a6baf474 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4903,16 +4903,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	} else if (folio != swapcache)
 		page = folio_page(folio, 0);
 
-	/*
-	 * If we want to map a page that's in the swapcache writable, we
-	 * have to detect via the refcount if we're really the exclusive
-	 * owner. Try removing the extra reference from the local LRU
-	 * caches if required.
-	 */
-	if ((vmf->flags & FAULT_FLAG_WRITE) &&
-	    !folio_test_ksm(folio) && !folio_test_lru(folio))
-		lru_add_drain();
-
 	folio_throttle_swaprate(folio, GFP_KERNEL);
 
 	/*
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios
  2026-06-23 23:16 [PATCH v2 0/4] mm: drop redundant lru_add_drain in anon folio reuse paths Barry Song (Xiaomi)
                   ` (2 preceding siblings ...)
  2026-06-23 23:16 ` [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page Barry Song (Xiaomi)
@ 2026-06-23 23:16 ` Barry Song (Xiaomi)
  2026-06-24 15:20   ` David Hildenbrand (Arm)
  3 siblings, 1 reply; 19+ messages in thread
From: Barry Song (Xiaomi) @ 2026-06-23 23:16 UTC (permalink / raw)
  To: akpm, linux-mm
  Cc: baoquan.he, chrisl, david, jp.kobryn, kasong, liam, linux-kernel,
	ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Barry Song (Xiaomi),
	Kairui Song

Originally, we unconditionally called lru_add_drain() for write
swap-in page faults. This might drop the reference held by the per-CPU
LRU cache if the folio happened to reside there. However, there was no
guarantee that the folio was actually cached on the current CPU.

Now that lru_add_drain() has been removed, we have lost one
opportunity to drop a reference held by the LRU cache. We could
instead incorporate that possibility into the condition evaluated by
should_try_to_free_swap().

Suggested-by: Kairui Song <ryncsn@gmail.com>
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
---
 mm/memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2983a6baf474..14577c67c61a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5087,8 +5087,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 * Remove the swap entry and conditionally try to free up the swapcache.
 	 * Do it after mapping, so raced page faults will likely see the folio
 	 * in swap cache and wait on the folio lock.
+	 * Assume non-LRU folios may be queued in the LRU cache, which contributes
+	 * an additional reference to the folio.
 	 */
-	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
+	if (should_try_to_free_swap(si, folio, vma, nr_pages +
+			!folio_test_lru(folio), vmf->flags))
 		folio_free_swap(folio);
 
 	folio_unlock(folio);
-- 
2.39.3 (Apple Git-146)


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-23 23:16 ` [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio() Barry Song (Xiaomi)
@ 2026-06-24 10:14   ` Kairui Song
  2026-06-24 15:02   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 19+ messages in thread
From: Kairui Song @ 2026-06-24 10:14 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, baoquan.he, chrisl, david, jp.kobryn, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

On Wed, Jun 24, 2026 at 7:16 AM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
>
> We always unconditionally drain the LRU before retrying anon folio
> reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
> are in lru_cache, and use the refcount to avoid many unnecessary LRU
> drains.
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..f6848f4234a6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
>          */
>         if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
>                 return false;
> -       if (!folio_test_lru(folio))
> +       if (!folio_test_lru(folio)) {
> +               /*
> +                * Assume folio is on lru_cache and holds a cache reference.
> +                */
> +               if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
> +                       return false;
>                 /*
>                  * We cannot easily detect+handle references from
>                  * remote LRU caches or references to LRU folios.
>                  */
>                 lru_add_drain();
> +       }
>         if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
>                 return false;
>         if (!folio_trylock(folio))
> --
> 2.39.3 (Apple Git-146)
>

Reviewed-by: Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page
  2026-06-23 23:16 ` [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page Barry Song (Xiaomi)
@ 2026-06-24 10:16   ` Kairui Song
  2026-06-24 15:10   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 19+ messages in thread
From: Kairui Song @ 2026-06-24 10:16 UTC (permalink / raw)
  To: Barry Song (Xiaomi)
  Cc: akpm, linux-mm, baoquan.he, chrisl, david, jp.kobryn, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

On Wed, Jun 24, 2026 at 7:18 AM Barry Song (Xiaomi) <baohua@kernel.org> wrote:
>
> We are doing a lot of redundant lru_add_drain() calls in
> do_swap_page(), especially for synchronous I/O devices. For
> example, the test program below currently ends up draining
> lru_cache 100% of the time:
>
> int main(int argc, char *argv[])
> {
>         int i;
>  #define SIZE 100*1024*1024
>         while(1) {
>                 volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
>                 for (int i = 0; i < SIZE/sizeof(int); i++)
>                         p[i] =  i%64;
>                 madvise((void *)p, SIZE, MADV_PAGEOUT);
>                 for (int i = 0; i < SIZE/sizeof(int); i++)
>                         p[i] =  i%64;
>                 munmap(p, SIZE);
>         }
>         return 0;
> }
>
> Folio reuse now relies primarily on the exclusive hint, making
> lru_cache draining to drop the refcount in lru_cache largely
> irrelevant.
>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 10 ----------
>  1 file changed, 10 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index abd0adcf65f0..2983a6baf474 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4903,16 +4903,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>         } else if (folio != swapcache)
>                 page = folio_page(folio, 0);
>
> -       /*
> -        * If we want to map a page that's in the swapcache writable, we
> -        * have to detect via the refcount if we're really the exclusive
> -        * owner. Try removing the extra reference from the local LRU
> -        * caches if required.
> -        */
> -       if ((vmf->flags & FAULT_FLAG_WRITE) &&
> -           !folio_test_ksm(folio) && !folio_test_lru(folio))
> -               lru_add_drain();
> -
>         folio_throttle_swaprate(folio, GFP_KERNEL);
>
>         /*
> --
> 2.39.3 (Apple Git-146)
>

Thanks, I saw the previous discussed problem is address in the next
patch, it's totally fine so:

Reviewed-by: Kairui Song <kasong@tencent.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-23 23:16 ` [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio() Barry Song (Xiaomi)
  2026-06-24 10:14   ` Kairui Song
@ 2026-06-24 15:02   ` David Hildenbrand (Arm)
  2026-06-24 21:04     ` Barry Song
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-24 15:02 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> We always unconditionally drain the LRU before retrying anon folio
> reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
> are in lru_cache, and use the refcount to avoid many unnecessary LRU
> drains.
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index ff338c2abe92..f6848f4234a6 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
>  	 */
>  	if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
>  		return false;
> -	if (!folio_test_lru(folio))
> +	if (!folio_test_lru(folio)) {
> +		/*
> +		 * Assume folio is on lru_cache and holds a cache reference.
> +		 */
> +		if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
> +			return false;

I'm not keen on making this function even uglier, so no, not like that.

We have the earlier "folio_ref_count(folio) > 3" check.

In which scenarios can you trigger this such that we would care?

If the answer is "I don't know" there is no reason for a change.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic
  2026-06-23 23:16 ` [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic Barry Song (Xiaomi)
@ 2026-06-24 15:07   ` David Hildenbrand (Arm)
  2026-06-24 21:29     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-24 15:07 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> The "we just allocated them without exposing them to the swapcache"
> case no longer exists, as Kairui has routed synchronous I/O through
> the swapcache as well in his series "unify swapin use swap cache and
> cleanup flags"[1]. As a result, folio_ref_count() should never be 1
> in this path, since at least two references are held (base ref plus
> swapcache). Remove the folio_ref_count()==1 check and update the
> comment accordingly.
> 
> [1] https://lore.kernel.org/all/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com/
> 
> Acked-by: Usama Arif <usama.arif@linux.dev>
> Reviewed-by: Kairui Song <kasong@tencent.com>
> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f6848f4234a6..abd0adcf65f0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5049,12 +5049,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  
>  	/*
>  	 * Same logic as in do_wp_page(); however, optimize for pages that are

s/Same/Similar/ ?

> -	 * certainly not shared either because we just allocated them without
> -	 * exposing them to the swapcache or because the swap entry indicates
> -	 * exclusivity.
> +	 * certainly not because the swap entry indicates exclusivity.
>  	 */
> -	if (!folio_test_ksm(folio) &&
> -	    (exclusive || folio_ref_count(folio) == 1)) {
> +	if (!folio_test_ksm(folio) && exclusive) {

Hmm, but KSM folios should never have "exclusive" set. So I think you can drop
that as well (was only relevant with folio_ref_count==1 check IIRC).

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page
  2026-06-23 23:16 ` [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page Barry Song (Xiaomi)
  2026-06-24 10:16   ` Kairui Song
@ 2026-06-24 15:10   ` David Hildenbrand (Arm)
  1 sibling, 0 replies; 19+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-24 15:10 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> We are doing a lot of redundant lru_add_drain() calls in
> do_swap_page(), especially for synchronous I/O devices. For
> example, the test program below currently ends up draining
> lru_cache 100% of the time:
> 
> int main(int argc, char *argv[])
> {
>         int i;
>  #define SIZE 100*1024*1024
> 	while(1) {
> 		volatile int *p = mmap(0, SIZE, PROT_READ | PROT_WRITE,
>                         MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 		for (int i = 0; i < SIZE/sizeof(int); i++)
> 			p[i] =  i%64;
> 		madvise((void *)p, SIZE, MADV_PAGEOUT);
> 		for (int i = 0; i < SIZE/sizeof(int); i++)
> 			p[i] =  i%64;
> 		munmap(p, SIZE);
> 	}
> 	return 0;
> }
> 
> Folio reuse now relies primarily on the exclusive hint, making
> lru_cache draining to drop the refcount in lru_cache largely
> irrelevant.

Makes sense, we'll fallback to do_wp_page() where we handle the non-exclusive
either way.

Acked-by: David Hildenbrand (Arm) <david@kernel.org>

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios
  2026-06-23 23:16 ` [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios Barry Song (Xiaomi)
@ 2026-06-24 15:20   ` David Hildenbrand (Arm)
  2026-06-24 21:14     ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-24 15:20 UTC (permalink / raw)
  To: Barry Song (Xiaomi), akpm, linux-mm
  Cc: baoquan.he, chrisl, jp.kobryn, kasong, liam, linux-kernel, ljs,
	mhocko, nphamcs, rppt, shakeel.butt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park, Kairui Song

On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> Originally, we unconditionally called lru_add_drain() for write
> swap-in page faults. This might drop the reference held by the per-CPU
> LRU cache if the folio happened to reside there. However, there was no
> guarantee that the folio was actually cached on the current CPU.
> 
> Now that lru_add_drain() has been removed, we have lost one
> opportunity to drop a reference held by the LRU cache. We could
> instead incorporate that possibility into the condition evaluated by
> should_try_to_free_swap().
> 
> Suggested-by: Kairui Song <ryncsn@gmail.com>
> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> ---
>  mm/memory.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2983a6baf474..14577c67c61a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5087,8 +5087,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	 * Remove the swap entry and conditionally try to free up the swapcache.
>  	 * Do it after mapping, so raced page faults will likely see the folio
>  	 * in swap cache and wait on the folio lock.
> +	 * Assume non-LRU folios may be queued in the LRU cache, which contributes
> +	 * an additional reference to the folio.
>  	 */
> -	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> +	if (should_try_to_free_swap(si, folio, vma, nr_pages +
> +			!folio_test_lru(folio), vmf->flags))
>  		folio_free_swap(folio);
>  
>  	folio_unlock(folio);

Hm, in wp_can_reuse_anon_folio() we'll try dropping the swapcache ourselves.

So I wonder if we still need that handling ("If we want to map a page that's in
the swapcache writable, we ...") at all?


Ahh, I see the problem now:

commit 4b34f1d82c6549837b2061096dea249e881a4495
Author: Kairui Song <kasong@tencent.com>
Date:   Sat Dec 20 03:43:35 2025 +0800

    mm, swap: free the swap cache after folio is mapped

    Currently, we remove the folio from the swap cache and free the swap cache
    before mapping the PTE.  To reduce repeated faults due to parallel swapins
    of the same PTE, change it to remove the folio from the swap cache after
    it is mapped.  So new faults from the swap PTE will be much more likely to
    see the folio in the swap cache and wait on it.

    This does not eliminate all swapin races: an ongoing swapin fault may
    still see an empty swap cache.  That's harmless, as the PTE is changed
    before the swap cache is cleared, so it will just return and not trigger
    any repeated faults.  This does help to reduce the chance.


That changed that behavior such that we *must* now always fallback to do_wp_page().

What a mess (I didn't ack)

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-24 15:02   ` David Hildenbrand (Arm)
@ 2026-06-24 21:04     ` Barry Song
  2026-06-26 16:25       ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-06-24 21:04 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

On Wed, Jun 24, 2026 at 11:02 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> > We always unconditionally drain the LRU before retrying anon folio
> > reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
> > are in lru_cache, and use the refcount to avoid many unnecessary LRU
> > drains.
> >
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  mm/memory.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index ff338c2abe92..f6848f4234a6 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
> >        */
> >       if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
> >               return false;
> > -     if (!folio_test_lru(folio))
> > +     if (!folio_test_lru(folio)) {
> > +             /*
> > +              * Assume folio is on lru_cache and holds a cache reference.
> > +              */
> > +             if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
> > +                     return false;
>
> I'm not keen on making this function even uglier, so no, not like that.
>
> We have the earlier "folio_ref_count(folio) > 3" check.
>
> In which scenarios can you trigger this such that we would care?
>
> If the answer is "I don't know" there is no reason for a change.

As I replied to Shakeel in the v1 discussion [1], this can avoid a
large number of drains, both during Ubuntu boot and under normal
workloads:

"I booted the system into Ubuntu, and after

boot completed I observed:

wp_reuse_skipped_drain: 5542
do_swap_skipped_drain: 0

Then I built the kernel in a 1GB memcg using zRAM swap, and observed:

wp_reuse_skipped_drain: 25017
do_swap_skipped_drain: 43595

So in summary, even without swap-in we can save a significant number of
drains in wp_can_reuse_anon_folio(). With heavy swap-in workloads, most
of the savings come from avoiding drains in do_swap_page(), while we
still see a substantial number of skipped drains in wp_can_reuse_anon_folio()."

This is exactly the case where folio_ref_count(folio) == 3 and
!folio_test_swapcache(folio). That is why checking
folio_ref_count(folio) > 3 does not help.

[1] https://lore.kernel.org/linux-mm/CAGsJ_4yCeAmpfo_8ef4MrRyRD_UtV-xGm6F4GpSjvLVPQsbiEg@mail.gmail.com/

Best Regards
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios
  2026-06-24 15:20   ` David Hildenbrand (Arm)
@ 2026-06-24 21:14     ` Barry Song
  2026-06-25 14:40       ` Kairui Song
  0 siblings, 1 reply; 19+ messages in thread
From: Barry Song @ 2026-06-24 21:14 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park, Kairui Song

On Wed, Jun 24, 2026 at 11:20 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> > Originally, we unconditionally called lru_add_drain() for write
> > swap-in page faults. This might drop the reference held by the per-CPU
> > LRU cache if the folio happened to reside there. However, there was no
> > guarantee that the folio was actually cached on the current CPU.
> >
> > Now that lru_add_drain() has been removed, we have lost one
> > opportunity to drop a reference held by the LRU cache. We could
> > instead incorporate that possibility into the condition evaluated by
> > should_try_to_free_swap().
> >
> > Suggested-by: Kairui Song <ryncsn@gmail.com>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  mm/memory.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 2983a6baf474..14577c67c61a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5087,8 +5087,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >        * Remove the swap entry and conditionally try to free up the swapcache.
> >        * Do it after mapping, so raced page faults will likely see the folio
> >        * in swap cache and wait on the folio lock.
> > +      * Assume non-LRU folios may be queued in the LRU cache, which contributes
> > +      * an additional reference to the folio.
> >        */
> > -     if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> > +     if (should_try_to_free_swap(si, folio, vma, nr_pages +
> > +                     !folio_test_lru(folio), vmf->flags))
> >               folio_free_swap(folio);
> >
> >       folio_unlock(folio);
>
> Hm, in wp_can_reuse_anon_folio() we'll try dropping the swapcache ourselves.
>
> So I wonder if we still need that handling ("If we want to map a page that's in
> the swapcache writable, we ...") at all?
>
>
> Ahh, I see the problem now:
>
> commit 4b34f1d82c6549837b2061096dea249e881a4495
> Author: Kairui Song <kasong@tencent.com>
> Date:   Sat Dec 20 03:43:35 2025 +0800
>
>     mm, swap: free the swap cache after folio is mapped
>
>     Currently, we remove the folio from the swap cache and free the swap cache
>     before mapping the PTE.  To reduce repeated faults due to parallel swapins
>     of the same PTE, change it to remove the folio from the swap cache after
>     it is mapped.  So new faults from the swap PTE will be much more likely to
>     see the folio in the swap cache and wait on it.
>
>     This does not eliminate all swapin races: an ongoing swapin fault may
>     still see an empty swap cache.  That's harmless, as the PTE is changed
>     before the swap cache is cleared, so it will just return and not trigger
>     any repeated faults.  This does help to reduce the chance.
>
>
> That changed that behavior such that we *must* now always fallback to do_wp_page().
>
> What a mess (I didn't ack)

So it seems we need a patch before patch 4 to fix the missed
swapcache freeing on write faults, because the write bit has already
been cleared by the time the folio is mapped and reused?

diff --git a/mm/memory.c b/mm/memory.c
index 14577c67c61a..e8fc5a86f6ea 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4752,6 +4752,7 @@ static void check_swap_exclusive(struct folio
*folio, swp_entry_t entry,
  */
 vm_fault_t do_swap_page(struct vm_fault *vmf)
 {
+       enum fault_flag fault_flags = vmf->flags;
        struct vm_area_struct *vma = vmf->vma;
        struct folio *swapcache = NULL, *folio;
        struct page *page;
@@ -5091,7 +5092,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
         * an additional reference to the folio.
         */
        if (should_try_to_free_swap(si, folio, vma, nr_pages +
-                       !folio_test_lru(folio), vmf->flags))
+                       !folio_test_lru(folio), fault_flags))
                folio_free_swap(folio);

        folio_unlock(folio);


Best Regards
Barry

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic
  2026-06-24 15:07   ` David Hildenbrand (Arm)
@ 2026-06-24 21:29     ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2026-06-24 21:29 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

On Wed, Jun 24, 2026 at 11:08 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> > The "we just allocated them without exposing them to the swapcache"
> > case no longer exists, as Kairui has routed synchronous I/O through
> > the swapcache as well in his series "unify swapin use swap cache and
> > cleanup flags"[1]. As a result, folio_ref_count() should never be 1
> > in this path, since at least two references are held (base ref plus
> > swapcache). Remove the folio_ref_count()==1 check and update the
> > comment accordingly.
> >
> > [1] https://lore.kernel.org/all/20251220-swap-table-p2-v5-0-8862a265a033@tencent.com/
> >
> > Acked-by: Usama Arif <usama.arif@linux.dev>
> > Reviewed-by: Kairui Song <kasong@tencent.com>
> > Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > ---
> >  mm/memory.c | 7 ++-----
> >  1 file changed, 2 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f6848f4234a6..abd0adcf65f0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5049,12 +5049,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> >
> >       /*
> >        * Same logic as in do_wp_page(); however, optimize for pages that are
>
> s/Same/Similar/ ?

will change to similar in v3.

>
> > -      * certainly not shared either because we just allocated them without
> > -      * exposing them to the swapcache or because the swap entry indicates
> > -      * exclusivity.
> > +      * certainly not because the swap entry indicates exclusivity.
> >        */
> > -     if (!folio_test_ksm(folio) &&
> > -         (exclusive || folio_ref_count(folio) == 1)) {
> > +     if (!folio_test_ksm(folio) && exclusive) {
>
> Hmm, but KSM folios should never have "exclusive" set. So I think you can drop
> that as well (was only relevant with folio_ref_count==1 check IIRC).

right. will drop !folio_test_ksm(folio) in v3.

Best Regards
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios
  2026-06-24 21:14     ` Barry Song
@ 2026-06-25 14:40       ` Kairui Song
  2026-06-26 16:35         ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 19+ messages in thread
From: Kairui Song @ 2026-06-25 14:40 UTC (permalink / raw)
  To: Barry Song, David Hildenbrand (Arm)
  Cc: akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

On Thu, Jun 25, 2026 at 05:14:45AM +0800, Barry Song wrote:
> On Wed, Jun 24, 2026 at 11:20 PM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> > > Originally, we unconditionally called lru_add_drain() for write
> > > swap-in page faults. This might drop the reference held by the per-CPU
> > > LRU cache if the folio happened to reside there. However, there was no
> > > guarantee that the folio was actually cached on the current CPU.
> > >
> > > Now that lru_add_drain() has been removed, we have lost one
> > > opportunity to drop a reference held by the LRU cache. We could
> > > instead incorporate that possibility into the condition evaluated by
> > > should_try_to_free_swap().
> > >
> > > Suggested-by: Kairui Song <ryncsn@gmail.com>
> > > Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > > ---
> > >  mm/memory.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 2983a6baf474..14577c67c61a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -5087,8 +5087,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> > >        * Remove the swap entry and conditionally try to free up the swapcache.
> > >        * Do it after mapping, so raced page faults will likely see the folio
> > >        * in swap cache and wait on the folio lock.
> > > +      * Assume non-LRU folios may be queued in the LRU cache, which contributes
> > > +      * an additional reference to the folio.
> > >        */
> > > -     if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
> > > +     if (should_try_to_free_swap(si, folio, vma, nr_pages +
> > > +                     !folio_test_lru(folio), vmf->flags))
> > >               folio_free_swap(folio);
> > >
> > >       folio_unlock(folio);
> >
> > Hm, in wp_can_reuse_anon_folio() we'll try dropping the swapcache ourselves.
> >
> > So I wonder if we still need that handling ("If we want to map a page that's in
> > the swapcache writable, we ...") at all?
> >
> >
> > Ahh, I see the problem now:
> >
> > commit 4b34f1d82c6549837b2061096dea249e881a4495
> > Author: Kairui Song <kasong@tencent.com>
> > Date:   Sat Dec 20 03:43:35 2025 +0800
> >
> >     mm, swap: free the swap cache after folio is mapped
> >
> >     Currently, we remove the folio from the swap cache and free the swap cache
> >     before mapping the PTE.  To reduce repeated faults due to parallel swapins
> >     of the same PTE, change it to remove the folio from the swap cache after
> >     it is mapped.  So new faults from the swap PTE will be much more likely to
> >     see the folio in the swap cache and wait on it.
> >
> >     This does not eliminate all swapin races: an ongoing swapin fault may
> >     still see an empty swap cache.  That's harmless, as the PTE is changed
> >     before the swap cache is cleared, so it will just return and not trigger
> >     any repeated faults.  This does help to reduce the chance.
> >
> >
> > That changed that behavior such that we *must* now always fallback to do_wp_page().
> >
> > What a mess (I didn't ack)

That's awkward, my bad, terribly sorry about this :(

I sincerely apologize for the oversight and the trouble this has caused.

I haven't seen any performance regression in any workload recently
though, or any correctness issue, perhaps the round trip of
do_wp_page wasn't that bad. It should still catch the reuse folios,
just more costly than doing things in-place.

I think we should restore the original check first. We might also want to
avoid dropping the swap cache if the folio will not be reused, which
was discussed here:

https://lore.kernel.org/linux-mm/CAMgjq7BDfvNXdWH0cqarsujjUn3i3tDDhDkmSg01TR4h-tDorQ@mail.gmail.com/

Maybe extracting some common part into a helper can help make this
cleaner.

> 
> So it seems we need a patch before patch 4 to fix the missed
> swapcache freeing on write faults, because the write bit has already
> been cleared by the time the folio is mapped and reused?
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 14577c67c61a..e8fc5a86f6ea 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4752,6 +4752,7 @@ static void check_swap_exclusive(struct folio
> *folio, swp_entry_t entry,
>   */
>  vm_fault_t do_swap_page(struct vm_fault *vmf)
>  {
> +       enum fault_flag fault_flags = vmf->flags;
>         struct vm_area_struct *vma = vmf->vma;
>         struct folio *swapcache = NULL, *folio;
>         struct page *page;
> @@ -5091,7 +5092,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>          * an additional reference to the folio.
>          */
>         if (should_try_to_free_swap(si, folio, vma, nr_pages +
> -                       !folio_test_lru(folio), vmf->flags))
> +                       !folio_test_lru(folio), fault_flags))
>                 folio_free_swap(folio);
> 
>         folio_unlock(folio);
> 
> 
> Best Regards
> Barry

Hi Barry,

The problem is more than that, the `exclusive || folio_ref_count(folio) == 1`
in do_swap_page is also ineffective now.

But that earlier commit 4b34f1d82c65 is kind of meaningless after recent
refactor of swapin. It was trying to avoid redundant folio allocation
and thrashing from concurrent faulting process of the same PTE, but recent
swap table rework already gated that with unified folio allocation. We
are freeing the swap entry first here, that is effective enough so that
commit is no longer needed.

So perhaps we can first just revert it, how do you think?:

From 8009328e0782fc22365ef32474bae749beda637b Mon Sep 17 00:00:00 2001
From: Kairui Song <kasong@tencent.com>
Date: Thu, 25 Jun 2026 11:44:26 +0800
Subject: [PATCH] mm/swap: free swap cache before mapping to fix folio reuse

This is effectively a revert of commit 4b34f1d82c65 ("mm, swap: free the
swap cache after folio is mapped").

That commit was trying to reduce wasted fault and folio allocation
from parallel swapins of the same entry: keeping the folio in the swap
cache until the PTE is installed makes a racing fault more likely to find
it and wait on the folio lock, rather than allocate and charge a new folio
only to discover the race and throw it away, which leads to thrashing.

That benefit is marginal now. Folio allocation and swapin are both gated
on the swap entry count in the swap cache layer since commit 02d733a7ec1d
("mm, swap: unify large folio allocation"). Once the winning fault
drops the entry count, a racing fault that misses the cache will fail
the swap entry count check too and back off at the page table lock before
allocating anything. Note there is still a tiny window with or without
either patch, but it's already minimized now and should be fixed in
other ways later if needed.

Meanwhile the earlier commit has a real problem, as David pointed out [1].
The swap cache reference is still held during the write fault folio
reference check, so a sole-owned but non-exclusive write fault always
falls back to do_wp_page() instead of being reused in place. The
write-fault path of should_try_to_free_swap() is broken too as the
FAULT_FLAG_WRITE flag is missing.

There is no correctness problem though, only the reuse and cleanup
optimizations were lost.

So revert it. This makes the sole-owner check effective again and
recovers the reuse fast path. Also slightly adjust a sanity check
to be more rigorous.

Fixes: 4b34f1d82c65 ("mm, swap: free the swap cache after folio is mapped")
Reported-by: David Hildenbrand (Arm) <david@kernel.org>
Closes: https://lore.kernel.org/linux-mm/e56c4d73-ed4a-48bb-8d0a-97b1200d4a35@kernel.org/ [1]
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/memory.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index ff338c2abe92..f31d6b8e8c0b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4512,7 +4512,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 static inline bool should_try_to_free_swap(struct swap_info_struct *si,
 					   struct folio *folio,
 					   struct vm_area_struct *vma,
-					   unsigned int extra_refs,
 					   unsigned int fault_flags)
 {
 	if (!folio_test_swapcache(folio))
@@ -4535,7 +4534,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
 	 * reference only in case it's likely that we'll be the exclusive user.
 	 */
 	return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-		folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+		folio_ref_count(folio) == (1 + folio_nr_pages(folio));
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
@@ -5033,6 +5032,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	 */
 	arch_swap_restore(folio_swap(entry, folio), folio);
 
+	/*
+	 * Remove the swap entry and conditionally try to free up the swapcache.
+	 * We're already holding a reference on the page but haven't mapped it
+	 * yet.
+	 *
+	 * The swap count has to be freed to 0 first so folio_free_swap
+	 * can free exclusive clean cache. Concurrent fault is very unlikely
+	 * to trigger redundant folio alloc, nor will it cause redundant IO,
+	 * as the swap entry count gates it.
+	 */
+	VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
+	if (folio != swapcache)
+		folio_put_swap(swapcache, NULL);
+	else
+		folio_put_swap(folio, nr_pages < folio_nr_pages(folio) ? page : NULL);
+	if (should_try_to_free_swap(si, folio, vma, vmf->flags))
+		folio_free_swap(folio);
+
 	add_mm_counter(vma->vm_mm, MM_ANONPAGES, nr_pages);
 	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -nr_pages);
 	pte = mk_pte(page, vma->vm_page_prot);
@@ -5067,7 +5084,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (unlikely(folio != swapcache)) {
 		folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
 		folio_add_lru_vma(folio, vma);
-		folio_put_swap(swapcache, NULL);
 	} else if (!folio_test_anon(folio)) {
 		/*
 		 * We currently only expect !anon folios that are fully
@@ -5076,12 +5092,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		VM_WARN_ON_ONCE_FOLIO(folio_nr_pages(folio) != nr_pages, folio);
 		VM_WARN_ON_ONCE_FOLIO(folio_mapped(folio), folio);
 		folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
-		folio_put_swap(folio, NULL);
 	} else {
-		VM_WARN_ON_ONCE(nr_pages != 1 && nr_pages != folio_nr_pages(folio));
 		folio_add_anon_rmap_ptes(folio, page, nr_pages, vma, address,
 					 rmap_flags);
-		folio_put_swap(folio, nr_pages == 1 ? page : NULL);
 	}
 
 	VM_BUG_ON(!folio_test_anon(folio) ||
@@ -5090,14 +5103,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	arch_do_swap_page_nr(vma->vm_mm, vma, address,
 			pte, pte, nr_pages);
 
-	/*
-	 * Remove the swap entry and conditionally try to free up the swapcache.
-	 * Do it after mapping, so raced page faults will likely see the folio
-	 * in swap cache and wait on the folio lock.
-	 */
-	if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
-		folio_free_swap(folio);
-
 	folio_unlock(folio);
 	if (unlikely(folio != swapcache)) {
 		/*
-- 
2.54.0


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-24 21:04     ` Barry Song
@ 2026-06-26 16:25       ` David Hildenbrand (Arm)
  2026-06-27  2:44         ` Shakeel Butt
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-26 16:25 UTC (permalink / raw)
  To: Barry Song
  Cc: akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

On 6/24/26 23:04, Barry Song wrote:
> On Wed, Jun 24, 2026 at 11:02 PM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
>>> We always unconditionally drain the LRU before retrying anon folio
>>> reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
>>> are in lru_cache, and use the refcount to avoid many unnecessary LRU
>>> drains.
>>>
>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
>>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
>>> ---
>>>  mm/memory.c | 8 +++++++-
>>>  1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index ff338c2abe92..f6848f4234a6 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
>>>        */
>>>       if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
>>>               return false;
>>> -     if (!folio_test_lru(folio))
>>> +     if (!folio_test_lru(folio)) {
>>> +             /*
>>> +              * Assume folio is on lru_cache and holds a cache reference.
>>> +              */
>>> +             if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
>>> +                     return false;
>>
>> I'm not keen on making this function even uglier, so no, not like that.
>>
>> We have the earlier "folio_ref_count(folio) > 3" check.
>>
>> In which scenarios can you trigger this such that we would care?
>>
>> If the answer is "I don't know" there is no reason for a change.
> 
> As I replied to Shakeel in the v1 discussion [1], this can avoid a
> large number of drains, both during Ubuntu boot and under normal
> workloads:
> 
> "I booted the system into Ubuntu, and after

is this with this patch only?

> 
> boot completed I observed:
> 
> wp_reuse_skipped_drain: 5542
> do_swap_skipped_drain: 0
> 
> Then I built the kernel in a 1GB memcg using zRAM swap, and observed:
> 
> wp_reuse_skipped_drain: 25017
> do_swap_skipped_drain: 43595

This is all data that belongs into this patch description, not hidden
somewhere on the internet :)


And why do we care about local draining? This is not a drain-all.

Really, this patch needs more motivation clearly spelled out.

> 
> So in summary, even without swap-in we can save a significant number of
> drains in wp_can_reuse_anon_folio(). With heavy swap-in workloads, most
> of the savings come from avoiding drains in do_swap_page(), while we
> still see a substantial number of skipped drains in wp_can_reuse_anon_folio()."
> 
> This is exactly the case where folio_ref_count(folio) == 3 and
> !folio_test_swapcache(folio). That is why checking
> folio_ref_count(folio) > 3 does not help.

diff --git a/mm/memory.c b/mm/memory.c
index fec2e9f2858a6..153a6166beeb4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4181,6 +4181,9 @@ static bool __wp_can_reuse_large_anon_folio(struct folio *folio,
 static bool wp_can_reuse_anon_folio(struct folio *folio,
                                    struct vm_area_struct *vma)
 {
+       const bool in_lru_cache = !folio_test_lru(folio);
+       const bool in_swapcache = folio_test_swapcache(folio);
+
        if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && folio_test_large(folio))
                return __wp_can_reuse_large_anon_folio(folio, vma);
 
@@ -4191,15 +4194,16 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
         *
         * KSM doesn't necessarily raise the folio refcount.
         */
-       if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
+       if (folio_test_ksm(folio) ||
+           folio_ref_count(folio) > 1 + in_lru_cache + in_swapcache)
                return false;
-       if (!folio_test_lru(folio))
+       if (in_lru_cache)
                /*
                 * We cannot easily detect+handle references from
                 * remote LRU caches or references to LRU folios.
                 */
                lru_add_drain();
-       if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
+       if (folio_ref_count(folio) > 1 + in_swapcache)
                return false;
        if (!folio_trylock(folio))
                return false;

?

But I also want to hear why we care about the local draining.

-- 
Cheers,

David


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios
  2026-06-25 14:40       ` Kairui Song
@ 2026-06-26 16:35         ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-26 16:35 UTC (permalink / raw)
  To: Kairui Song, Barry Song
  Cc: akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong, liam,
	linux-kernel, ljs, mhocko, nphamcs, rppt, shakeel.butt, shikemeng,
	surenb, usama.arif, vbabka, youngjun.park

>>> That changed that behavior such that we *must* now always fallback to do_wp_page().
>>>
>>> What a mess (I didn't ack)
> 
> That's awkward, my bad, terribly sorry about this :(
> 
> I sincerely apologize for the oversight and the trouble this has caused.
> 

All good, I was just surprised to see a previous optimization partially
reverted without a clear reasoning :)

Because it should have removed the handling in should_try_to_free_swap() as well.

It's good that we are discussing it now!

> I haven't seen any performance regression in any workload recently
> though, or any correctness issue, perhaps the round trip of
> do_wp_page wasn't that bad. It should still catch the reuse folios,
> just more costly than doing things in-place.

Right, do_wp_page() handles it, after the page was mapped. It adds some overhead,
but fortunately no TLB flush if we're just upgrading write permissions.

The optimization dates back to pre PageAnonExclusive handling.

> 
> I think we should restore the original check first. We might also want to
> avoid dropping the swap cache if the folio will not be reused, which
> was discussed here:
> 
> https://lore.kernel.org/linux-mm/CAMgjq7BDfvNXdWH0cqarsujjUn3i3tDDhDkmSg01TR4h-tDorQ@mail.gmail.com/
> 
> Maybe extracting some common part into a helper can help make this
> cleaner.
> 
>>
>> So it seems we need a patch before patch 4 to fix the missed
>> swapcache freeing on write faults, because the write bit has already
>> been cleared by the time the folio is mapped and reused?
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 14577c67c61a..e8fc5a86f6ea 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -4752,6 +4752,7 @@ static void check_swap_exclusive(struct folio
>> *folio, swp_entry_t entry,
>>   */
>>  vm_fault_t do_swap_page(struct vm_fault *vmf)
>>  {
>> +       enum fault_flag fault_flags = vmf->flags;
>>         struct vm_area_struct *vma = vmf->vma;
>>         struct folio *swapcache = NULL, *folio;
>>         struct page *page;
>> @@ -5091,7 +5092,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>>          * an additional reference to the folio.
>>          */
>>         if (should_try_to_free_swap(si, folio, vma, nr_pages +
>> -                       !folio_test_lru(folio), vmf->flags))
>> +                       !folio_test_lru(folio), fault_flags))
>>                 folio_free_swap(folio);
>>
>>         folio_unlock(folio);
>>
>>
>> Best Regards
>> Barry
> 
> Hi Barry,
> 
> The problem is more than that, the `exclusive || folio_ref_count(folio) == 1`
> in do_swap_page is also ineffective now.

Exactly.

If the roundtrip through do_wp_page() is good enough today, we can just do

@@ -4512,7 +4516,6 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 static inline bool should_try_to_free_swap(struct swap_info_struct *si,
                                           struct folio *folio,
                                           struct vm_area_struct *vma,
-                                          unsigned int extra_refs,
                                           unsigned int fault_flags)
 {
        if (!folio_test_swapcache(folio))
@@ -4528,14 +4531,7 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
        if (mem_cgroup_swap_full(folio) || (vma->vm_flags & VM_LOCKED) ||
            folio_test_mlocked(folio))
                return true;
-       /*
-        * If we want to map a page that's in the swapcache writable, we
-        * have to detect via the refcount if we're really the exclusive
-        * user. Try freeing the swapcache to get rid of the swapcache
-        * reference only in case it's likely that we'll be the exclusive user.
-        */
-       return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-               folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+       return false;
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
@@ -5095,7 +5091,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
         * Do it after mapping, so raced page faults will likely see the folio
         * in swap cache and wait on the folio lock.
         */
-       if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
+       if (should_try_to_free_swap(si, folio, vma, vmf->flags))
                folio_free_swap(folio);
 
        folio_unlock(folio);

But then, one question is whether we'd actually want to try removing the swapcache when
we mapped the page writable (iow: exclusive)?


@@ -4512,7 +4516,7 @@ static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
 static inline bool should_try_to_free_swap(struct swap_info_struct *si,
                                           struct folio *folio,
                                           struct vm_area_struct *vma,
-                                          unsigned int extra_refs,
+                                          bool exclusive,
                                           unsigned int fault_flags)
 {
        if (!folio_test_swapcache(folio))
@@ -4529,13 +4533,11 @@ static inline bool should_try_to_free_swap(struct swap_info_struct *si,
            folio_test_mlocked(folio))
                return true;
        /*
-        * If we want to map a page that's in the swapcache writable, we
-        * have to detect via the refcount if we're really the exclusive
-        * user. Try freeing the swapcache to get rid of the swapcache
-        * reference only in case it's likely that we'll be the exclusive user.
+        * We have an exclusive page that was mapped writable or will soon
+        * be mapped writable (as we are in a write fault). Let's just try
+        * to reclaim swap immediately.
         */
-       return (fault_flags & FAULT_FLAG_WRITE) && !folio_test_ksm(folio) &&
-               folio_ref_count(folio) == (extra_refs + folio_nr_pages(folio));
+       return (fault_flags & FAULT_FLAG_WRITE) && exclusive;
 }
 
 static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
@@ -5095,7 +5097,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
         * Do it after mapping, so raced page faults will likely see the folio
         * in swap cache and wait on the folio lock.
         */
-       if (should_try_to_free_swap(si, folio, vma, nr_pages, vmf->flags))
+       if (should_try_to_free_swap(si, folio, vma, exclusive, vmf->flags))
                folio_free_swap(folio);
 
        folio_unlock(folio);


-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-26 16:25       ` David Hildenbrand (Arm)
@ 2026-06-27  2:44         ` Shakeel Butt
  2026-06-27  7:20           ` Barry Song
  0 siblings, 1 reply; 19+ messages in thread
From: Shakeel Butt @ 2026-06-27  2:44 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Barry Song, akpm, linux-mm, baoquan.he, chrisl, jp.kobryn, kasong,
	liam, linux-kernel, ljs, mhocko, nphamcs, rppt, shikemeng, surenb,
	usama.arif, vbabka, youngjun.park

On Fri, Jun 26, 2026 at 06:25:48PM +0200, David Hildenbrand (Arm) wrote:
> On 6/24/26 23:04, Barry Song wrote:
> > On Wed, Jun 24, 2026 at 11:02 PM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>
> >> On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> >>> We always unconditionally drain the LRU before retrying anon folio
> >>> reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
> >>> are in lru_cache, and use the refcount to avoid many unnecessary LRU
> >>> drains.
> >>>
> >>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> >>> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> >>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> >>> ---
> >>>  mm/memory.c | 8 +++++++-
> >>>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/mm/memory.c b/mm/memory.c
> >>> index ff338c2abe92..f6848f4234a6 100644
> >>> --- a/mm/memory.c
> >>> +++ b/mm/memory.c
> >>> @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
> >>>        */
> >>>       if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
> >>>               return false;
> >>> -     if (!folio_test_lru(folio))
> >>> +     if (!folio_test_lru(folio)) {
> >>> +             /*
> >>> +              * Assume folio is on lru_cache and holds a cache reference.
> >>> +              */
> >>> +             if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
> >>> +                     return false;
> >>
> >> I'm not keen on making this function even uglier, so no, not like that.
> >>
> >> We have the earlier "folio_ref_count(folio) > 3" check.
> >>
> >> In which scenarios can you trigger this such that we would care?
> >>
> >> If the answer is "I don't know" there is no reason for a change.
> > 
> > As I replied to Shakeel in the v1 discussion [1], this can avoid a
> > large number of drains, both during Ubuntu boot and under normal
> > workloads:
> > 
> > "I booted the system into Ubuntu, and after
> 
> is this with this patch only?
> 
> > 
> > boot completed I observed:
> > 
> > wp_reuse_skipped_drain: 5542
> > do_swap_skipped_drain: 0
> > 
> > Then I built the kernel in a 1GB memcg using zRAM swap, and observed:
> > 
> > wp_reuse_skipped_drain: 25017
> > do_swap_skipped_drain: 43595
> 
> This is all data that belongs into this patch description, not hidden
> somewhere on the internet :)

I asked the same i.e. to include this information in the commit message [1].

[1] https://lore.kernel.org/linux-mm/ajK4N2zK3sPZFQuf@linux.dev/

> 
> 
> And why do we care about local draining? This is not a drain-all.

Let me take a stab at it. Local draining can potentially make the lru cache's
batching ineffective. The whole reason for lru caches is to batch the lru
operations to keep the lru lock contention low but if we keep draining
prematurely we will continue to make lru lock contention worse. This is
particularly bad for workloads which chrun a lot of LRU pages through
allocation, free and/or reclaim.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio()
  2026-06-27  2:44         ` Shakeel Butt
@ 2026-06-27  7:20           ` Barry Song
  0 siblings, 0 replies; 19+ messages in thread
From: Barry Song @ 2026-06-27  7:20 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: David Hildenbrand (Arm), akpm, linux-mm, baoquan.he, chrisl,
	jp.kobryn, kasong, liam, linux-kernel, ljs, mhocko, nphamcs, rppt,
	shikemeng, surenb, usama.arif, vbabka, youngjun.park

On Sat, Jun 27, 2026 at 10:44 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Fri, Jun 26, 2026 at 06:25:48PM +0200, David Hildenbrand (Arm) wrote:
> > On 6/24/26 23:04, Barry Song wrote:
> > > On Wed, Jun 24, 2026 at 11:02 PM David Hildenbrand (Arm)
> > > <david@kernel.org> wrote:
> > >>
> > >> On 6/24/26 01:16, Barry Song (Xiaomi) wrote:
> > >>> We always unconditionally drain the LRU before retrying anon folio
> > >>> reuse in wp_can_reuse_anon_folio(). Instead, assume !LRU anon folios
> > >>> are in lru_cache, and use the refcount to avoid many unnecessary LRU
> > >>> drains.
> > >>>
> > >>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > >>> Reviewed-by: Baoquan He <baoquan.he@linux.dev>
> > >>> Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
> > >>> ---
> > >>>  mm/memory.c | 8 +++++++-
> > >>>  1 file changed, 7 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/mm/memory.c b/mm/memory.c
> > >>> index ff338c2abe92..f6848f4234a6 100644
> > >>> --- a/mm/memory.c
> > >>> +++ b/mm/memory.c
> > >>> @@ -4193,12 +4193,18 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
> > >>>        */
> > >>>       if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
> > >>>               return false;
> > >>> -     if (!folio_test_lru(folio))
> > >>> +     if (!folio_test_lru(folio)) {
> > >>> +             /*
> > >>> +              * Assume folio is on lru_cache and holds a cache reference.
> > >>> +              */
> > >>> +             if (folio_ref_count(folio) > 2 + folio_test_swapcache(folio))
> > >>> +                     return false;
> > >>
> > >> I'm not keen on making this function even uglier, so no, not like that.
> > >>
> > >> We have the earlier "folio_ref_count(folio) > 3" check.
> > >>
> > >> In which scenarios can you trigger this such that we would care?
> > >>
> > >> If the answer is "I don't know" there is no reason for a change.
> > >
> > > As I replied to Shakeel in the v1 discussion [1], this can avoid a
> > > large number of drains, both during Ubuntu boot and under normal
> > > workloads:
> > >
> > > "I booted the system into Ubuntu, and after
> >
> > is this with this patch only?

Right. wp_reuse_skipped_drain accounts for the behavior
introduced by this patch, while do_swap_skipped_drain comes
from patch 3/3.

> >
> > >
> > > boot completed I observed:
> > >
> > > wp_reuse_skipped_drain: 5542
> > > do_swap_skipped_drain: 0
> > >
> > > Then I built the kernel in a 1GB memcg using zRAM swap, and observed:
> > >
> > > wp_reuse_skipped_drain: 25017
> > > do_swap_skipped_drain: 43595
> >
> > This is all data that belongs into this patch description, not hidden
> > somewhere on the internet :)

Agreed. Sorry, I was being a bit lazy.

>
> I asked the same i.e. to include this information in the commit message [1].
>
> [1] https://lore.kernel.org/linux-mm/ajK4N2zK3sPZFQuf@linux.dev/

Sorry, I missed this one. I'll add it in v3.

>
> >
> >
> > And why do we care about local draining? This is not a drain-all.
>
> Let me take a stab at it. Local draining can potentially make the lru cache's
> batching ineffective. The whole reason for lru caches is to batch the lru
> operations to keep the lru lock contention low but if we keep draining
> prematurely we will continue to make lru lock contention worse. This is
> particularly bad for workloads which chrun a lot of LRU pages through
> allocation, free and/or reclaim.

I actually asked the same question myself in the RFC here[1]:

"On the other hand, the folio may be sitting in the lru_cache of
another CPU, which the current drain cannot flush. As things stand
today, the drain only succeeds if the folio happens to be queued on
the current CPU's lru_cache, giving it roughly a 1/nr_cpus chance of
working. drain_all would clearly be too expensive to avoid.
So another possibility is to drop this drain as well. The folio can
be released later, at the cost of missing some opportunities for
reuse."

After thinking about it for a couple of days, my gut feeling is
that keeping it might increase the chances of reuse in at least
two cases:
1. It could be a newly allocated folio in do_swap_page(), and
do_swap_page() may subsequently call do_wp_page(). In
that case, task migration is less likely to have occurred.
2. On systems with fewer CPUs (and typically less memory), the
chance of having the folio in the same CPU's lru_cache,
which is roughly 1 / nr_cpus, may still be reasonably high.

I'm not entirely sure about this, so I chose a relatively
conservative approach: making an obviously correct improvement
without introducing dramatic changes. Another reason is that it's
also hard to evaluate the impact of removing the local drain
across all systems and workloads.

Shakeel, David, what are your thoughts on this? Should we go
with a code refinement in v3 as David suggested, or completely
remove this drain?

[1] https://lore.kernel.org/linux-mm/CAGsJ_4yFgcvpTAZ_YV73+U=BEnOiSxaFtGNFpbf0DWBkVcOQ3Q@mail.gmail.com/

Best Regards
Barry

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2026-06-27  7:21 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-23 23:16 [PATCH v2 0/4] mm: drop redundant lru_add_drain in anon folio reuse paths Barry Song (Xiaomi)
2026-06-23 23:16 ` [PATCH v2 1/4] mm: avoid unnecessary lru drain for wp_can_reuse_anon_folio() Barry Song (Xiaomi)
2026-06-24 10:14   ` Kairui Song
2026-06-24 15:02   ` David Hildenbrand (Arm)
2026-06-24 21:04     ` Barry Song
2026-06-26 16:25       ` David Hildenbrand (Arm)
2026-06-27  2:44         ` Shakeel Butt
2026-06-27  7:20           ` Barry Song
2026-06-23 23:16 ` [PATCH v2 2/4] mm: drop stale folio_ref_count()==1 check in do_swap_page reuse logic Barry Song (Xiaomi)
2026-06-24 15:07   ` David Hildenbrand (Arm)
2026-06-24 21:29     ` Barry Song
2026-06-23 23:16 ` [PATCH v2 3/4] mm: entirely remove lru_add_drain in do_swap_page Barry Song (Xiaomi)
2026-06-24 10:16   ` Kairui Song
2026-06-24 15:10   ` David Hildenbrand (Arm)
2026-06-23 23:16 ` [PATCH v2 4/4] mm: try to free swapcache for non-LRU folios Barry Song (Xiaomi)
2026-06-24 15:20   ` David Hildenbrand (Arm)
2026-06-24 21:14     ` Barry Song
2026-06-25 14:40       ` Kairui Song
2026-06-26 16:35         ` David Hildenbrand (Arm)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.