linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
@ 2024-07-04  6:52 yangge1116
  2024-07-12  8:56 ` Hugh Dickins
  2025-03-26 12:42 ` Jinjiang Tu
  0 siblings, 2 replies; 12+ messages in thread
From: yangge1116 @ 2024-07-04  6:52 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, david, baolin.wang,
	aneesh.kumar, liuzixing, yangge

From: yangge <yangge1116@126.com>

If a large number of CMA memory are configured in system (for example, the
CMA memory accounts for 50% of the system memory), starting a virtual
virtual machine with device passthrough, it will
call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.
Normally if a page is present and in CMA area, pin_user_pages_remote()
will migrate the page from CMA area to non-CMA area because of
FOLL_LONGTERM flag. But the current code will cause the migration failure
due to unexpected page refcounts, and eventually cause the virtual machine
fail to start.

If a page is added in LRU batch, its refcount increases one, remove the
page from LRU batch decreases one. Page migration requires the page is not
referenced by others except page mapping. Before migrating a page, we
should try to drain the page from LRU batch in case the page is in it,
however, folio_test_lru() is not sufficient to tell whether the page is
in LRU batch or not, if the page is in LRU batch, the migration will fail.

To solve the problem above, we modify the logic of adding to LRU batch.
Before adding a page to LRU batch, we clear the LRU flag of the page so
that we can check whether the page is in LRU batch by folio_test_lru(page).
It's quite valuable, because likely we don't want to blindly drain the LRU
batch simply because there is some unexpected reference on a page, as
described above.

This change makes the LRU flag of a page invisible for longer, which
may impact some programs. For example, as long as a page is on a LRU
batch, we cannot isolate it, and we cannot check if it's an LRU page.
Further, a page can now only be on exactly one LRU batch. This doesn't
seem to matter much, because a new page is allocated from buddy and
added to the lru batch, or be isolated, it's LRU flag may also be
invisible for a long time.

Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Cc: <stable@vger.kernel.org>
Signed-off-by: yangge <yangge1116@126.com>
---
 mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
 1 file changed, 31 insertions(+), 12 deletions(-)

V4:
   Adjust commit message according to David's comments
V3:
   Add fixes tag
V2:
   Adjust code and commit message according to David's comments

diff --git a/mm/swap.c b/mm/swap.c
index dc205bd..9caf6b0 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -211,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 	for (i = 0; i < folio_batch_count(fbatch); i++) {
 		struct folio *folio = fbatch->folios[i];
 
-		/* block memcg migration while the folio moves between lru */
-		if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
-			continue;
-
 		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
 		move_fn(lruvec, folio);
 
@@ -255,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
 void folio_rotate_reclaimable(struct folio *folio)
 {
 	if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
-	    !folio_test_unevictable(folio) && folio_test_lru(folio)) {
+	    !folio_test_unevictable(folio)) {
 		struct folio_batch *fbatch;
 		unsigned long flags;
 
 		folio_get(folio);
+		if (!folio_test_clear_lru(folio)) {
+			folio_put(folio);
+			return;
+		}
+
 		local_lock_irqsave(&lru_rotate.lock, flags);
 		fbatch = this_cpu_ptr(&lru_rotate.fbatch);
 		folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn);
@@ -352,11 +353,15 @@ static void folio_activate_drain(int cpu)
 
 void folio_activate(struct folio *folio)
 {
-	if (folio_test_lru(folio) && !folio_test_active(folio) &&
-	    !folio_test_unevictable(folio)) {
+	if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
 		struct folio_batch *fbatch;
 
 		folio_get(folio);
+		if (!folio_test_clear_lru(folio)) {
+			folio_put(folio);
+			return;
+		}
+
 		local_lock(&cpu_fbatches.lock);
 		fbatch = this_cpu_ptr(&cpu_fbatches.activate);
 		folio_batch_add_and_move(fbatch, folio, folio_activate_fn);
@@ -700,6 +705,11 @@ void deactivate_file_folio(struct folio *folio)
 		return;
 
 	folio_get(folio);
+	if (!folio_test_clear_lru(folio)) {
+		folio_put(folio);
+		return;
+	}
+
 	local_lock(&cpu_fbatches.lock);
 	fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file);
 	folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn);
@@ -716,11 +726,16 @@ void deactivate_file_folio(struct folio *folio)
  */
 void folio_deactivate(struct folio *folio)
 {
-	if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
-	    (folio_test_active(folio) || lru_gen_enabled())) {
+	if (!folio_test_unevictable(folio) && (folio_test_active(folio) ||
+	    lru_gen_enabled())) {
 		struct folio_batch *fbatch;
 
 		folio_get(folio);
+		if (!folio_test_clear_lru(folio)) {
+			folio_put(folio);
+			return;
+		}
+
 		local_lock(&cpu_fbatches.lock);
 		fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate);
 		folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn);
@@ -737,12 +752,16 @@ void folio_deactivate(struct folio *folio)
  */
 void folio_mark_lazyfree(struct folio *folio)
 {
-	if (folio_test_lru(folio) && folio_test_anon(folio) &&
-	    folio_test_swapbacked(folio) && !folio_test_swapcache(folio) &&
-	    !folio_test_unevictable(folio)) {
+	if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
+	    !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) {
 		struct folio_batch *fbatch;
 
 		folio_get(folio);
+		if (!folio_test_clear_lru(folio)) {
+			folio_put(folio);
+			return;
+		}
+
 		local_lock(&cpu_fbatches.lock);
 		fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree);
 		folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn);
-- 
2.7.4



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2024-07-04  6:52 [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch yangge1116
@ 2024-07-12  8:56 ` Hugh Dickins
  2024-07-12 20:56   ` Andrew Morton
  2025-03-26 12:42 ` Jinjiang Tu
  1 sibling, 1 reply; 12+ messages in thread
From: Hugh Dickins @ 2024-07-12  8:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: yangge, linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	aneesh.kumar, liuzixing, hughd

On Thu, 4 Jul 2024, yangge1116@126.com wrote:

> From: yangge <yangge1116@126.com>
> 
> If a large number of CMA memory are configured in system (for example, the
> CMA memory accounts for 50% of the system memory), starting a virtual
> virtual machine with device passthrough, it will
> call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.
> Normally if a page is present and in CMA area, pin_user_pages_remote()
> will migrate the page from CMA area to non-CMA area because of
> FOLL_LONGTERM flag. But the current code will cause the migration failure
> due to unexpected page refcounts, and eventually cause the virtual machine
> fail to start.
> 
> If a page is added in LRU batch, its refcount increases one, remove the
> page from LRU batch decreases one. Page migration requires the page is not
> referenced by others except page mapping. Before migrating a page, we
> should try to drain the page from LRU batch in case the page is in it,
> however, folio_test_lru() is not sufficient to tell whether the page is
> in LRU batch or not, if the page is in LRU batch, the migration will fail.
> 
> To solve the problem above, we modify the logic of adding to LRU batch.
> Before adding a page to LRU batch, we clear the LRU flag of the page so
> that we can check whether the page is in LRU batch by folio_test_lru(page).
> It's quite valuable, because likely we don't want to blindly drain the LRU
> batch simply because there is some unexpected reference on a page, as
> described above.
> 
> This change makes the LRU flag of a page invisible for longer, which
> may impact some programs. For example, as long as a page is on a LRU
> batch, we cannot isolate it, and we cannot check if it's an LRU page.
> Further, a page can now only be on exactly one LRU batch. This doesn't
> seem to matter much, because a new page is allocated from buddy and
> added to the lru batch, or be isolated, it's LRU flag may also be
> invisible for a long time.
> 
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: yangge <yangge1116@126.com>

This is an interesting patch, and may (but might not) be a very good one.

I have no objection to it going forward to 6.11-rc, but I do object to
the Cc stable, and its current placing in mm-hotfixes-unstable (if that
means it's to be rushed into 6.10 final).

This is a subtle change to PG_lru handling, altering how that has been
managed since 5.11 (and before).  I have not observed any ill effect
from this patch, but I'm not at all confident in understanding its
implications - though perhaps braver and quicker minds are confident.

To judge by the commit message, it's entirely to suit the
		if (!folio_test_lru(folio) && drain_allow) {
line in collect_longterm_unpinnable_folios(). And it is attractive
to associate the PG_lru with the unraised refcount.

But I worry that it may not be to the benefit of others: for example,
page reclaim's isolate_lru_folios() will no longer be able to isolate
folios on the LRU while they're on an fbatch.  Which may be okay (in
the dirty case, pageout() loses interest once it finds refcount raised,
so nothing lost there), or might not be.

It's good to try this, but please don't rush it in.
Thanks,
Hugh


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2024-07-12  8:56 ` Hugh Dickins
@ 2024-07-12 20:56   ` Andrew Morton
  0 siblings, 0 replies; 12+ messages in thread
From: Andrew Morton @ 2024-07-12 20:56 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: yangge, linux-mm, linux-kernel, 21cnbao, david, baolin.wang,
	aneesh.kumar, liuzixing

On Fri, 12 Jul 2024 01:56:04 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> I have no objection to it going forward to 6.11-rc, but I do object to
> the Cc stable, and its current placing in mm-hotfixes-unstable (if that
> means it's to be rushed into 6.10 final).

No probs, I moved it to mm-unstable and removed cc:stable.

The issue sounds significant so someone might request a -stable
backport in the future, but the workflow doesn't have a way of
expressing that.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2024-07-04  6:52 [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch yangge1116
  2024-07-12  8:56 ` Hugh Dickins
@ 2025-03-26 12:42 ` Jinjiang Tu
  2025-03-26 12:46   ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Jinjiang Tu @ 2025-03-26 12:42 UTC (permalink / raw)
  To: yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, david, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang

Hi,

We notiched a 12.3% performance regression for LibMicro pwrite testcase due to
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch").

The testcase is executed as follows, and the file is tmpfs file.
    pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE

this testcase writes 1KB (only one page) to the tmpfs and repeats this step for many times. The Flame
graph shows the performance regression comes from folio_mark_accessed() and workingset_activation().

folio_mark_accessed() is called for the same page for many times. Before this patch, each call will
add the page to cpu_fbatches.activate. When the fbatch is full, the fbatch is drained and the page
is promoted to active list. And then, folio_mark_accessed() does nothing in later calls.

But after this patch, the folio clear lru flags after it is added to cpu_fbatches.activate. After then,
folio_mark_accessed will never call folio_activate() again due to the page is without lru flag, and
the fbatch will not be full and the folio will not be marked active, later folio_mark_accessed()
calls will always call workingset_activation(), leading to performance regression.

In addition, folio_mark_accessed() calls __lru_cache_activate_folio(). This function does as
follow comments:

/*
	 * Search backwards on the optimistic assumption that the folio being
	 * activated has just been added to this batch.
*/

However, after this patch, folio without lru flag may be in other fbatch too, such as cpu_fbatches.activate.

在 2024/7/4 14:52, yangge1116@126.com 写道:
> From: yangge <yangge1116@126.com>
>
> If a large number of CMA memory are configured in system (for example, the
> CMA memory accounts for 50% of the system memory), starting a virtual
> virtual machine with device passthrough, it will
> call pin_user_pages_remote(..., FOLL_LONGTERM, ...) to pin memory.
> Normally if a page is present and in CMA area, pin_user_pages_remote()
> will migrate the page from CMA area to non-CMA area because of
> FOLL_LONGTERM flag. But the current code will cause the migration failure
> due to unexpected page refcounts, and eventually cause the virtual machine
> fail to start.
>
> If a page is added in LRU batch, its refcount increases one, remove the
> page from LRU batch decreases one. Page migration requires the page is not
> referenced by others except page mapping. Before migrating a page, we
> should try to drain the page from LRU batch in case the page is in it,
> however, folio_test_lru() is not sufficient to tell whether the page is
> in LRU batch or not, if the page is in LRU batch, the migration will fail.
>
> To solve the problem above, we modify the logic of adding to LRU batch.
> Before adding a page to LRU batch, we clear the LRU flag of the page so
> that we can check whether the page is in LRU batch by folio_test_lru(page).
> It's quite valuable, because likely we don't want to blindly drain the LRU
> batch simply because there is some unexpected reference on a page, as
> described above.
>
> This change makes the LRU flag of a page invisible for longer, which
> may impact some programs. For example, as long as a page is on a LRU
> batch, we cannot isolate it, and we cannot check if it's an LRU page.
> Further, a page can now only be on exactly one LRU batch. This doesn't
> seem to matter much, because a new page is allocated from buddy and
> added to the lru batch, or be isolated, it's LRU flag may also be
> invisible for a long time.
>
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> Cc: <stable@vger.kernel.org>
> Signed-off-by: yangge <yangge1116@126.com>
> ---
>   mm/swap.c | 43 +++++++++++++++++++++++++++++++------------
>   1 file changed, 31 insertions(+), 12 deletions(-)
>
> V4:
>     Adjust commit message according to David's comments
> V3:
>     Add fixes tag
> V2:
>     Adjust code and commit message according to David's comments
>
> diff --git a/mm/swap.c b/mm/swap.c
> index dc205bd..9caf6b0 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -211,10 +211,6 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
>   	for (i = 0; i < folio_batch_count(fbatch); i++) {
>   		struct folio *folio = fbatch->folios[i];
>   
> -		/* block memcg migration while the folio moves between lru */
> -		if (move_fn != lru_add_fn && !folio_test_clear_lru(folio))
> -			continue;
> -
>   		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
>   		move_fn(lruvec, folio);
>   
> @@ -255,11 +251,16 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>   void folio_rotate_reclaimable(struct folio *folio)
>   {
>   	if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> -	    !folio_test_unevictable(folio) && folio_test_lru(folio)) {
> +	    !folio_test_unevictable(folio)) {
>   		struct folio_batch *fbatch;
>   		unsigned long flags;
>   
>   		folio_get(folio);
> +		if (!folio_test_clear_lru(folio)) {
> +			folio_put(folio);
> +			return;
> +		}
> +
>   		local_lock_irqsave(&lru_rotate.lock, flags);
>   		fbatch = this_cpu_ptr(&lru_rotate.fbatch);
>   		folio_batch_add_and_move(fbatch, folio, lru_move_tail_fn);
> @@ -352,11 +353,15 @@ static void folio_activate_drain(int cpu)
>   
>   void folio_activate(struct folio *folio)
>   {
> -	if (folio_test_lru(folio) && !folio_test_active(folio) &&
> -	    !folio_test_unevictable(folio)) {
> +	if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
>   		struct folio_batch *fbatch;
>   
>   		folio_get(folio);
> +		if (!folio_test_clear_lru(folio)) {
> +			folio_put(folio);
> +			return;
> +		}
> +
>   		local_lock(&cpu_fbatches.lock);
>   		fbatch = this_cpu_ptr(&cpu_fbatches.activate);
>   		folio_batch_add_and_move(fbatch, folio, folio_activate_fn);
> @@ -700,6 +705,11 @@ void deactivate_file_folio(struct folio *folio)
>   		return;
>   
>   	folio_get(folio);
> +	if (!folio_test_clear_lru(folio)) {
> +		folio_put(folio);
> +		return;
> +	}
> +
>   	local_lock(&cpu_fbatches.lock);
>   	fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate_file);
>   	folio_batch_add_and_move(fbatch, folio, lru_deactivate_file_fn);
> @@ -716,11 +726,16 @@ void deactivate_file_folio(struct folio *folio)
>    */
>   void folio_deactivate(struct folio *folio)
>   {
> -	if (folio_test_lru(folio) && !folio_test_unevictable(folio) &&
> -	    (folio_test_active(folio) || lru_gen_enabled())) {
> +	if (!folio_test_unevictable(folio) && (folio_test_active(folio) ||
> +	    lru_gen_enabled())) {
>   		struct folio_batch *fbatch;
>   
>   		folio_get(folio);
> +		if (!folio_test_clear_lru(folio)) {
> +			folio_put(folio);
> +			return;
> +		}
> +
>   		local_lock(&cpu_fbatches.lock);
>   		fbatch = this_cpu_ptr(&cpu_fbatches.lru_deactivate);
>   		folio_batch_add_and_move(fbatch, folio, lru_deactivate_fn);
> @@ -737,12 +752,16 @@ void folio_deactivate(struct folio *folio)
>    */
>   void folio_mark_lazyfree(struct folio *folio)
>   {
> -	if (folio_test_lru(folio) && folio_test_anon(folio) &&
> -	    folio_test_swapbacked(folio) && !folio_test_swapcache(folio) &&
> -	    !folio_test_unevictable(folio)) {
> +	if (folio_test_anon(folio) && folio_test_swapbacked(folio) &&
> +	    !folio_test_swapcache(folio) && !folio_test_unevictable(folio)) {
>   		struct folio_batch *fbatch;
>   
>   		folio_get(folio);
> +		if (!folio_test_clear_lru(folio)) {
> +			folio_put(folio);
> +			return;
> +		}
> +
>   		local_lock(&cpu_fbatches.lock);
>   		fbatch = this_cpu_ptr(&cpu_fbatches.lru_lazyfree);
>   		folio_batch_add_and_move(fbatch, folio, lru_lazyfree_fn);


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-03-26 12:42 ` Jinjiang Tu
@ 2025-03-26 12:46   ` David Hildenbrand
  2025-03-27 11:16     ` Jinjiang Tu
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2025-03-26 12:46 UTC (permalink / raw)
  To: Jinjiang Tu, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang

On 26.03.25 13:42, Jinjiang Tu wrote:
> Hi,
> 

Hi!

> We notiched a 12.3% performance regression for LibMicro pwrite testcase due to
> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> 
> The testcase is executed as follows, and the file is tmpfs file.
>      pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE

Do we know how much that reflects real workloads? (IOW, how much should 
we care)

> 
> this testcase writes 1KB (only one page) to the tmpfs and repeats this step for many times. The Flame
> graph shows the performance regression comes from folio_mark_accessed() and workingset_activation().
> 
> folio_mark_accessed() is called for the same page for many times. Before this patch, each call will
> add the page to cpu_fbatches.activate. When the fbatch is full, the fbatch is drained and the page
> is promoted to active list. And then, folio_mark_accessed() does nothing in later calls.
> 
> But after this patch, the folio clear lru flags after it is added to cpu_fbatches.activate. After then,
> folio_mark_accessed will never call folio_activate() again due to the page is without lru flag, and
> the fbatch will not be full and the folio will not be marked active, later folio_mark_accessed()
> calls will always call workingset_activation(), leading to performance regression.

Would there be a good place to drain the LRU to effectively get that 
processed? (we can always try draining if the LRU flag is not set)


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-03-26 12:46   ` David Hildenbrand
@ 2025-03-27 11:16     ` Jinjiang Tu
  2025-04-01 14:33       ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Jinjiang Tu @ 2025-03-27 11:16 UTC (permalink / raw)
  To: David Hildenbrand, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang


在 2025/3/26 20:46, David Hildenbrand 写道:
> On 26.03.25 13:42, Jinjiang Tu wrote:
>> Hi,
>>
>
> Hi!
>
>> We notiched a 12.3% performance regression for LibMicro pwrite 
>> testcase due to
>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before 
>> adding to LRU batch").
>>
>> The testcase is executed as follows, and the file is tmpfs file.
>>      pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>
> Do we know how much that reflects real workloads? (IOW, how much 
> should we care)

No, it's hard to say.

>
>>
>> this testcase writes 1KB (only one page) to the tmpfs and repeats 
>> this step for many times. The Flame
>> graph shows the performance regression comes from 
>> folio_mark_accessed() and workingset_activation().
>>
>> folio_mark_accessed() is called for the same page for many times. 
>> Before this patch, each call will
>> add the page to cpu_fbatches.activate. When the fbatch is full, the 
>> fbatch is drained and the page
>> is promoted to active list. And then, folio_mark_accessed() does 
>> nothing in later calls.
>>
>> But after this patch, the folio clear lru flags after it is added to 
>> cpu_fbatches.activate. After then,
>> folio_mark_accessed will never call folio_activate() again due to the 
>> page is without lru flag, and
>> the fbatch will not be full and the folio will not be marked active, 
>> later folio_mark_accessed()
>> calls will always call workingset_activation(), leading to 
>> performance regression.
>
> Would there be a good place to drain the LRU to effectively get that 
> processed? (we can always try draining if the LRU flag is not set)

Maybe we could drain the search the cpu_fbatches.activate of the local cpu in __lru_cache_activate_folio()? Drain other fbatches is meaningless .

>
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-03-27 11:16     ` Jinjiang Tu
@ 2025-04-01 14:33       ` David Hildenbrand
  2025-04-07  7:41         ` Jinjiang Tu
  2025-04-08  8:47         ` Jinjiang Tu
  0 siblings, 2 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-04-01 14:33 UTC (permalink / raw)
  To: Jinjiang Tu, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang

On 27.03.25 12:16, Jinjiang Tu wrote:
> 
> 在 2025/3/26 20:46, David Hildenbrand 写道:
>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>> Hi,
>>>
>>
>> Hi!
>>
>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>> testcase due to
>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>> adding to LRU batch").
>>>
>>> The testcase is executed as follows, and the file is tmpfs file.
>>>       pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>>
>> Do we know how much that reflects real workloads? (IOW, how much
>> should we care)
> 
> No, it's hard to say.
> 
>>
>>>
>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>> this step for many times. The Flame
>>> graph shows the performance regression comes from
>>> folio_mark_accessed() and workingset_activation().
>>>
>>> folio_mark_accessed() is called for the same page for many times.
>>> Before this patch, each call will
>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>> fbatch is drained and the page
>>> is promoted to active list. And then, folio_mark_accessed() does
>>> nothing in later calls.
>>>
>>> But after this patch, the folio clear lru flags after it is added to
>>> cpu_fbatches.activate. After then,
>>> folio_mark_accessed will never call folio_activate() again due to the
>>> page is without lru flag, and
>>> the fbatch will not be full and the folio will not be marked active,
>>> later folio_mark_accessed()
>>> calls will always call workingset_activation(), leading to
>>> performance regression.
>>
>> Would there be a good place to drain the LRU to effectively get that
>> processed? (we can always try draining if the LRU flag is not set)
> 
> Maybe we could drain the search the cpu_fbatches.activate of the local cpu in __lru_cache_activate_folio()? Drain other fbatches is meaningless .

So the current behavior is that folio_mark_accessed() will end up calling folio_activate()
once, and then __lru_cache_activate_folio() until the LRU was drained (which can
take a looong time).

The old behavior was that folio_mark_accessed() would keep calling folio_activate() until
the LRU was drained simply because it was full of "all the same pages" ?. Only *after*
the LRU was drained, folio_mark_accessed() would actually not do anything (desired behavior).

So the overhead comes primarily from __lru_cache_activate_folio() searching through
the list "more" repeatedly because the LRU does get drained less frequently; and
it would never find it in there in this case.

So ... it used to be suboptimal before, now it's more suboptimal I guess?! :)

We'd need a way to better identify "this folio is already queued for activation". Searching
that list as well would be one option, but the hole "search the list" is nasty.

Maybe we can simply set the folio as active when staging it for activation, after having
cleared the LRU flag? We already do that during folio_add.

As the LRU flag was cleared, nobody should be messing with that folio until the cache was
drained and the move was successful.


Pretty sure this doesn't work, but just to throw out an idea:

 From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Tue, 1 Apr 2025 16:31:56 +0200
Subject: [PATCH] test

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/swap.c | 21 ++++++++++++++++-----
  1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index fc8281ef42415..bbf9aa76db87f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
  	folios_put(fbatch);
  }
  
+static void lru_activate(struct lruvec *lruvec, struct folio *folio);
+
  static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
  		struct folio *folio, move_fn_t move_fn,
  		bool on_lru, bool disable_irq)
@@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
  	else
  		local_lock(&cpu_fbatches.lock);
  
+	/* We'll only perform the actual list move deferred. */
+	if (move_fn == lru_activate)
+		folio_set_active(folio);
+
  	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
  	    lru_cache_disabled())
  		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
@@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec, struct folio *folio)
  {
  	long nr_pages = folio_nr_pages(folio);
  
-	if (folio_test_active(folio) || folio_test_unevictable(folio))
-		return;
-
+	/*
+	 * We set the folio active after clearing the LRU flag, and set the
+	 * LRU flag only after moving it to the right list.
+	 */
+	VM_WARN_ON_ONCE(!folio_test_active(folio));
+	VM_WARN_ON_ONCE(folio_test_unevictable(folio));
  
  	lruvec_del_folio(lruvec, folio);
-	folio_set_active(folio);
  	lruvec_add_folio(lruvec, folio);
  	trace_mm_lru_activate(folio);
  
@@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
  		return;
  
  	lruvec = folio_lruvec_lock_irq(folio);
-	lru_activate(lruvec, folio);
+	if (!folio_test_unevictable(folio)) {
+		folio_set_active(folio);
+		lru_activate(lruvec, folio);
+	}
  	unlock_page_lruvec_irq(lruvec);
  	folio_set_lru(folio);
  }
-- 
2.48.1


-- 
Cheers,

David / dhildenb



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-04-01 14:33       ` David Hildenbrand
@ 2025-04-07  7:41         ` Jinjiang Tu
  2025-04-08  8:47         ` Jinjiang Tu
  1 sibling, 0 replies; 12+ messages in thread
From: Jinjiang Tu @ 2025-04-07  7:41 UTC (permalink / raw)
  To: David Hildenbrand, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang


在 2025/4/1 22:33, David Hildenbrand 写道:
> On 27.03.25 12:16, Jinjiang Tu wrote:
>>
>> 在 2025/3/26 20:46, David Hildenbrand 写道:
>>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>>> Hi,
>>>>
>>>
>>> Hi!
>>>
>>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>>> testcase due to
>>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>>> adding to LRU batch").
>>>>
>>>> The testcase is executed as follows, and the file is tmpfs file.
>>>>       pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>>>
>>> Do we know how much that reflects real workloads? (IOW, how much
>>> should we care)
>>
>> No, it's hard to say.
>>
>>>
>>>>
>>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>>> this step for many times. The Flame
>>>> graph shows the performance regression comes from
>>>> folio_mark_accessed() and workingset_activation().
>>>>
>>>> folio_mark_accessed() is called for the same page for many times.
>>>> Before this patch, each call will
>>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>>> fbatch is drained and the page
>>>> is promoted to active list. And then, folio_mark_accessed() does
>>>> nothing in later calls.
>>>>
>>>> But after this patch, the folio clear lru flags after it is added to
>>>> cpu_fbatches.activate. After then,
>>>> folio_mark_accessed will never call folio_activate() again due to the
>>>> page is without lru flag, and
>>>> the fbatch will not be full and the folio will not be marked active,
>>>> later folio_mark_accessed()
>>>> calls will always call workingset_activation(), leading to
>>>> performance regression.
>>>
>>> Would there be a good place to drain the LRU to effectively get that
>>> processed? (we can always try draining if the LRU flag is not set)
>>
>> Maybe we could drain the search the cpu_fbatches.activate of the 
>> local cpu in __lru_cache_activate_folio()? Drain other fbatches is 
>> meaningless .
>
> So the current behavior is that folio_mark_accessed() will end up 
> calling folio_activate()
> once, and then __lru_cache_activate_folio() until the LRU was drained 
> (which can
> take a looong time).
>
> The old behavior was that folio_mark_accessed() would keep calling 
> folio_activate() until
> the LRU was drained simply because it was full of "all the same pages" 
> ?. Only *after*
> the LRU was drained, folio_mark_accessed() would actually not do 
> anything (desired behavior).
>
> So the overhead comes primarily from __lru_cache_activate_folio() 
> searching through
> the list "more" repeatedly because the LRU does get drained less 
> frequently; and
> it would never find it in there in this case.
>
> So ... it used to be suboptimal before, now it's more suboptimal I 
> guess?! :)
>
> We'd need a way to better identify "this folio is already queued for 
> activation". Searching
> that list as well would be one option, but the hole "search the list" 
> is nasty.

Sorry for late reply. I tried to search the activate batch with the following diff, and
the performance regression disappears.

diff --git a/mm/swap.c b/mm/swap.c
index f81c8aa93e67..cfe484ff12e6 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -400,6 +400,21 @@ static void __lru_cache_activate_folio(struct folio 
*folio)
                 }
         }

+       fbatch = this_cpu_ptr(&cpu_fbatches.activate);
+       for (i = folio_batch_count(fbatch) - 1; i >= 0; i--) {
+               struct folio *batch_folio = fbatch->folios[i];
+
+               if (batch_folio == folio) {
+                       struct lruvec *lruvec;
+                       unsigned long flags;
+
+                       lruvec = folio_lruvec_lock_irqsave(folio, &flags);
+                       folio_activate_fn(lruvec, folio);
+                       unlock_page_lruvec_irqrestore(lruvec, flags);
+                       break;
+               }
+       }
+
         local_unlock(&cpu_fbatches.lock);
  }

>
> Maybe we can simply set the folio as active when staging it for 
> activation, after having
> cleared the LRU flag? We already do that during folio_add.
>
> As the LRU flag was cleared, nobody should be messing with that folio 
> until the cache was
> drained and the move was successful.
>
Yes, anyone who wants to delete the folio from lru list should test LRU flag

I look some folio_test_active() calls, madvise_cold_or_pageout_pte_range() will
set workingset flag if active flag is set. It seems more reasonable if we set
active flag after adding to activate batch, since adding to activate batch already
means active.

>
> Pretty sure this doesn't work, but just to throw out an idea:

Thanks, I will try it ASAP.

>
> From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 1 Apr 2025 16:31:56 +0200
> Subject: [PATCH] test
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/swap.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index fc8281ef42415..bbf9aa76db87f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct 
> folio_batch *fbatch, move_fn_t move_fn)
>      folios_put(fbatch);
>  }
>
> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
> +
>  static void __folio_batch_add_and_move(struct folio_batch __percpu 
> *fbatch,
>          struct folio *folio, move_fn_t move_fn,
>          bool on_lru, bool disable_irq)
> @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct 
> folio_batch __percpu *fbatch,
>      else
>          local_lock(&cpu_fbatches.lock);
>
> +    /* We'll only perform the actual list move deferred. */
> +    if (move_fn == lru_activate)
> +        folio_set_active(folio);
> +
>      if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || 
> folio_test_large(folio) ||
>          lru_cache_disabled())
>          folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
> @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec, 
> struct folio *folio)
>  {
>      long nr_pages = folio_nr_pages(folio);
>
> -    if (folio_test_active(folio) || folio_test_unevictable(folio))
> -        return;
> -
> +    /*
> +     * We set the folio active after clearing the LRU flag, and set the
> +     * LRU flag only after moving it to the right list.
> +     */
> +    VM_WARN_ON_ONCE(!folio_test_active(folio));
> +    VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>
>      lruvec_del_folio(lruvec, folio);
> -    folio_set_active(folio);
>      lruvec_add_folio(lruvec, folio);
>      trace_mm_lru_activate(folio);
>
> @@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
>          return;
>
>      lruvec = folio_lruvec_lock_irq(folio);
> -    lru_activate(lruvec, folio);
> +    if (!folio_test_unevictable(folio)) {
> +        folio_set_active(folio);
> +        lru_activate(lruvec, folio);
> +    }
>      unlock_page_lruvec_irq(lruvec);
>      folio_set_lru(folio);
>  }


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-04-01 14:33       ` David Hildenbrand
  2025-04-07  7:41         ` Jinjiang Tu
@ 2025-04-08  8:47         ` Jinjiang Tu
  2025-04-08 10:04           ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Jinjiang Tu @ 2025-04-08  8:47 UTC (permalink / raw)
  To: David Hildenbrand, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang


在 2025/4/1 22:33, David Hildenbrand 写道:
> On 27.03.25 12:16, Jinjiang Tu wrote:
>>
>> 在 2025/3/26 20:46, David Hildenbrand 写道:
>>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>>> Hi,
>>>>
>>>
>>> Hi!
>>>
>>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>>> testcase due to
>>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>>> adding to LRU batch").
>>>>
>>>> The testcase is executed as follows, and the file is tmpfs file.
>>>>       pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>>>
>>> Do we know how much that reflects real workloads? (IOW, how much
>>> should we care)
>>
>> No, it's hard to say.
>>
>>>
>>>>
>>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>>> this step for many times. The Flame
>>>> graph shows the performance regression comes from
>>>> folio_mark_accessed() and workingset_activation().
>>>>
>>>> folio_mark_accessed() is called for the same page for many times.
>>>> Before this patch, each call will
>>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>>> fbatch is drained and the page
>>>> is promoted to active list. And then, folio_mark_accessed() does
>>>> nothing in later calls.
>>>>
>>>> But after this patch, the folio clear lru flags after it is added to
>>>> cpu_fbatches.activate. After then,
>>>> folio_mark_accessed will never call folio_activate() again due to the
>>>> page is without lru flag, and
>>>> the fbatch will not be full and the folio will not be marked active,
>>>> later folio_mark_accessed()
>>>> calls will always call workingset_activation(), leading to
>>>> performance regression.
>>>
>>> Would there be a good place to drain the LRU to effectively get that
>>> processed? (we can always try draining if the LRU flag is not set)
>>
>> Maybe we could drain the search the cpu_fbatches.activate of the 
>> local cpu in __lru_cache_activate_folio()? Drain other fbatches is 
>> meaningless .
>
> So the current behavior is that folio_mark_accessed() will end up 
> calling folio_activate()
> once, and then __lru_cache_activate_folio() until the LRU was drained 
> (which can
> take a looong time).
>
> The old behavior was that folio_mark_accessed() would keep calling 
> folio_activate() until
> the LRU was drained simply because it was full of "all the same pages" 
> ?. Only *after*
> the LRU was drained, folio_mark_accessed() would actually not do 
> anything (desired behavior).
>
> So the overhead comes primarily from __lru_cache_activate_folio() 
> searching through
> the list "more" repeatedly because the LRU does get drained less 
> frequently; and
> it would never find it in there in this case.
>
> So ... it used to be suboptimal before, now it's more suboptimal I 
> guess?! :)
>
> We'd need a way to better identify "this folio is already queued for 
> activation". Searching
> that list as well would be one option, but the hole "search the list" 
> is nasty.
>
> Maybe we can simply set the folio as active when staging it for 
> activation, after having
> cleared the LRU flag? We already do that during folio_add.
>
> As the LRU flag was cleared, nobody should be messing with that folio 
> until the cache was
> drained and the move was successful.
>
>
> Pretty sure this doesn't work, but just to throw out an idea:
>
> From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Tue, 1 Apr 2025 16:31:56 +0200
> Subject: [PATCH] test
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/swap.c | 21 ++++++++++++++++-----
>  1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index fc8281ef42415..bbf9aa76db87f 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct 
> folio_batch *fbatch, move_fn_t move_fn)
>      folios_put(fbatch);
>  }
>
> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
> +
>  static void __folio_batch_add_and_move(struct folio_batch __percpu 
> *fbatch,
>          struct folio *folio, move_fn_t move_fn,
>          bool on_lru, bool disable_irq)
> @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct 
> folio_batch __percpu *fbatch,
>      else
>          local_lock(&cpu_fbatches.lock);
>
> +    /* We'll only perform the actual list move deferred. */
> +    if (move_fn == lru_activate)
> +        folio_set_active(folio);
> +
>      if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || 
> folio_test_large(folio) ||
>          lru_cache_disabled())
>          folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
> @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec, 
> struct folio *folio)
>  {
>      long nr_pages = folio_nr_pages(folio);
>
> -    if (folio_test_active(folio) || folio_test_unevictable(folio))
> -        return;
> -
> +    /*
> +     * We set the folio active after clearing the LRU flag, and set the
> +     * LRU flag only after moving it to the right list.
> +     */
> +    VM_WARN_ON_ONCE(!folio_test_active(folio));
> +    VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>
>      lruvec_del_folio(lruvec, folio);
> -    folio_set_active(folio);
>      lruvec_add_folio(lruvec, folio);
>      trace_mm_lru_activate(folio);
>
> @@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
>          return;
>
>      lruvec = folio_lruvec_lock_irq(folio);
> -    lru_activate(lruvec, folio);
> +    if (!folio_test_unevictable(folio)) {
> +        folio_set_active(folio);
> +        lru_activate(lruvec, folio);
> +    }
>      unlock_page_lruvec_irq(lruvec);
>      folio_set_lru(folio);
>  }

I test with the patch, and the performance regression disappears.

By the way, I find folio_test_unevictable() is called in lru_deactivate, lru_lazyfree, etc.
unevictable flag is set when the caller clears lru flag. IIUC, since commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
adding to LRU batch"), folios in fbatch can't be set unevictable flag, so there is no need to check unevictable flag in lru_deactivate, lru_lazyfree, etc?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-04-08  8:47         ` Jinjiang Tu
@ 2025-04-08 10:04           ` David Hildenbrand
  2025-04-08 12:01             ` Jinjiang Tu
  2025-04-10  8:06             ` Jinjiang Tu
  0 siblings, 2 replies; 12+ messages in thread
From: David Hildenbrand @ 2025-04-08 10:04 UTC (permalink / raw)
  To: Jinjiang Tu, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang

On 08.04.25 10:47, Jinjiang Tu wrote:
> 
> 在 2025/4/1 22:33, David Hildenbrand 写道:
>> On 27.03.25 12:16, Jinjiang Tu wrote:
>>>
>>> 在 2025/3/26 20:46, David Hildenbrand 写道:
>>>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>>>> Hi,
>>>>>
>>>>
>>>> Hi!
>>>>
>>>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>>>> testcase due to
>>>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>>>> adding to LRU batch").
>>>>>
>>>>> The testcase is executed as follows, and the file is tmpfs file.
>>>>>        pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f $TFILE
>>>>
>>>> Do we know how much that reflects real workloads? (IOW, how much
>>>> should we care)
>>>
>>> No, it's hard to say.
>>>
>>>>
>>>>>
>>>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>>>> this step for many times. The Flame
>>>>> graph shows the performance regression comes from
>>>>> folio_mark_accessed() and workingset_activation().
>>>>>
>>>>> folio_mark_accessed() is called for the same page for many times.
>>>>> Before this patch, each call will
>>>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>>>> fbatch is drained and the page
>>>>> is promoted to active list. And then, folio_mark_accessed() does
>>>>> nothing in later calls.
>>>>>
>>>>> But after this patch, the folio clear lru flags after it is added to
>>>>> cpu_fbatches.activate. After then,
>>>>> folio_mark_accessed will never call folio_activate() again due to the
>>>>> page is without lru flag, and
>>>>> the fbatch will not be full and the folio will not be marked active,
>>>>> later folio_mark_accessed()
>>>>> calls will always call workingset_activation(), leading to
>>>>> performance regression.
>>>>
>>>> Would there be a good place to drain the LRU to effectively get that
>>>> processed? (we can always try draining if the LRU flag is not set)
>>>
>>> Maybe we could drain the search the cpu_fbatches.activate of the
>>> local cpu in __lru_cache_activate_folio()? Drain other fbatches is
>>> meaningless .
>>
>> So the current behavior is that folio_mark_accessed() will end up
>> calling folio_activate()
>> once, and then __lru_cache_activate_folio() until the LRU was drained
>> (which can
>> take a looong time).
>>
>> The old behavior was that folio_mark_accessed() would keep calling
>> folio_activate() until
>> the LRU was drained simply because it was full of "all the same pages"
>> ?. Only *after*
>> the LRU was drained, folio_mark_accessed() would actually not do
>> anything (desired behavior).
>>
>> So the overhead comes primarily from __lru_cache_activate_folio()
>> searching through
>> the list "more" repeatedly because the LRU does get drained less
>> frequently; and
>> it would never find it in there in this case.
>>
>> So ... it used to be suboptimal before, now it's more suboptimal I
>> guess?! :)
>>
>> We'd need a way to better identify "this folio is already queued for
>> activation". Searching
>> that list as well would be one option, but the hole "search the list"
>> is nasty.
>>
>> Maybe we can simply set the folio as active when staging it for
>> activation, after having
>> cleared the LRU flag? We already do that during folio_add.
>>
>> As the LRU flag was cleared, nobody should be messing with that folio
>> until the cache was
>> drained and the move was successful.
>>
>>
>> Pretty sure this doesn't work, but just to throw out an idea:
>>
>>  From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
>> From: David Hildenbrand <david@redhat.com>
>> Date: Tue, 1 Apr 2025 16:31:56 +0200
>> Subject: [PATCH] test
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/swap.c | 21 ++++++++++++++++-----
>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index fc8281ef42415..bbf9aa76db87f 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct
>> folio_batch *fbatch, move_fn_t move_fn)
>>       folios_put(fbatch);
>>   }
>>
>> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
>> +
>>   static void __folio_batch_add_and_move(struct folio_batch __percpu
>> *fbatch,
>>           struct folio *folio, move_fn_t move_fn,
>>           bool on_lru, bool disable_irq)
>> @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct
>> folio_batch __percpu *fbatch,
>>       else
>>           local_lock(&cpu_fbatches.lock);
>>
>> +    /* We'll only perform the actual list move deferred. */
>> +    if (move_fn == lru_activate)
>> +        folio_set_active(folio);
>> +
>>       if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
>> folio_test_large(folio) ||
>>           lru_cache_disabled())
>>           folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
>> @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec,
>> struct folio *folio)
>>   {
>>       long nr_pages = folio_nr_pages(folio);
>>
>> -    if (folio_test_active(folio) || folio_test_unevictable(folio))
>> -        return;
>> -
>> +    /*
>> +     * We set the folio active after clearing the LRU flag, and set the
>> +     * LRU flag only after moving it to the right list.
>> +     */
>> +    VM_WARN_ON_ONCE(!folio_test_active(folio));
>> +    VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>>
>>       lruvec_del_folio(lruvec, folio);
>> -    folio_set_active(folio);
>>       lruvec_add_folio(lruvec, folio);
>>       trace_mm_lru_activate(folio);
>>
>> @@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
>>           return;
>>
>>       lruvec = folio_lruvec_lock_irq(folio);
>> -    lru_activate(lruvec, folio);
>> +    if (!folio_test_unevictable(folio)) {
>> +        folio_set_active(folio);
>> +        lru_activate(lruvec, folio);
>> +    }
>>       unlock_page_lruvec_irq(lruvec);
>>       folio_set_lru(folio);
>>   }
> 
> I test with the patch, and the performance regression disappears.
> 
> By the way, I find folio_test_unevictable() is called in lru_deactivate, lru_lazyfree, etc.
> unevictable flag is set when the caller clears lru flag. IIUC, since commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
> adding to LRU batch"), folios in fbatch can't be set unevictable flag, so there is no need to check unevictable flag in lru_deactivate, lru_lazyfree, etc?

I was asking myself the exact same question when crafting this patch. 
Sounds like a cleanup worth investigating! :)

Do you have capacity to look into that, and to turn my proposal into a 
proper patch? It might take me a bit until I get to it.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-04-08 10:04           ` David Hildenbrand
@ 2025-04-08 12:01             ` Jinjiang Tu
  2025-04-10  8:06             ` Jinjiang Tu
  1 sibling, 0 replies; 12+ messages in thread
From: Jinjiang Tu @ 2025-04-08 12:01 UTC (permalink / raw)
  To: David Hildenbrand, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang


在 2025/4/8 18:04, David Hildenbrand 写道:
> On 08.04.25 10:47, Jinjiang Tu wrote:
>>
>> 在 2025/4/1 22:33, David Hildenbrand 写道:
>>> On 27.03.25 12:16, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/3/26 20:46, David Hildenbrand 写道:
>>>>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>>>>> Hi,
>>>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>>>>> testcase due to
>>>>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>>>>> adding to LRU batch").
>>>>>>
>>>>>> The testcase is executed as follows, and the file is tmpfs file.
>>>>>>        pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f 
>>>>>> $TFILE
>>>>>
>>>>> Do we know how much that reflects real workloads? (IOW, how much
>>>>> should we care)
>>>>
>>>> No, it's hard to say.
>>>>
>>>>>
>>>>>>
>>>>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>>>>> this step for many times. The Flame
>>>>>> graph shows the performance regression comes from
>>>>>> folio_mark_accessed() and workingset_activation().
>>>>>>
>>>>>> folio_mark_accessed() is called for the same page for many times.
>>>>>> Before this patch, each call will
>>>>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>>>>> fbatch is drained and the page
>>>>>> is promoted to active list. And then, folio_mark_accessed() does
>>>>>> nothing in later calls.
>>>>>>
>>>>>> But after this patch, the folio clear lru flags after it is added to
>>>>>> cpu_fbatches.activate. After then,
>>>>>> folio_mark_accessed will never call folio_activate() again due to 
>>>>>> the
>>>>>> page is without lru flag, and
>>>>>> the fbatch will not be full and the folio will not be marked active,
>>>>>> later folio_mark_accessed()
>>>>>> calls will always call workingset_activation(), leading to
>>>>>> performance regression.
>>>>>
>>>>> Would there be a good place to drain the LRU to effectively get that
>>>>> processed? (we can always try draining if the LRU flag is not set)
>>>>
>>>> Maybe we could drain the search the cpu_fbatches.activate of the
>>>> local cpu in __lru_cache_activate_folio()? Drain other fbatches is
>>>> meaningless .
>>>
>>> So the current behavior is that folio_mark_accessed() will end up
>>> calling folio_activate()
>>> once, and then __lru_cache_activate_folio() until the LRU was drained
>>> (which can
>>> take a looong time).
>>>
>>> The old behavior was that folio_mark_accessed() would keep calling
>>> folio_activate() until
>>> the LRU was drained simply because it was full of "all the same pages"
>>> ?. Only *after*
>>> the LRU was drained, folio_mark_accessed() would actually not do
>>> anything (desired behavior).
>>>
>>> So the overhead comes primarily from __lru_cache_activate_folio()
>>> searching through
>>> the list "more" repeatedly because the LRU does get drained less
>>> frequently; and
>>> it would never find it in there in this case.
>>>
>>> So ... it used to be suboptimal before, now it's more suboptimal I
>>> guess?! :)
>>>
>>> We'd need a way to better identify "this folio is already queued for
>>> activation". Searching
>>> that list as well would be one option, but the hole "search the list"
>>> is nasty.
>>>
>>> Maybe we can simply set the folio as active when staging it for
>>> activation, after having
>>> cleared the LRU flag? We already do that during folio_add.
>>>
>>> As the LRU flag was cleared, nobody should be messing with that folio
>>> until the cache was
>>> drained and the move was successful.
>>>
>>>
>>> Pretty sure this doesn't work, but just to throw out an idea:
>>>
>>>  From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
>>> From: David Hildenbrand <david@redhat.com>
>>> Date: Tue, 1 Apr 2025 16:31:56 +0200
>>> Subject: [PATCH] test
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/swap.c | 21 ++++++++++++++++-----
>>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index fc8281ef42415..bbf9aa76db87f 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct
>>> folio_batch *fbatch, move_fn_t move_fn)
>>>       folios_put(fbatch);
>>>   }
>>>
>>> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
>>> +
>>>   static void __folio_batch_add_and_move(struct folio_batch __percpu
>>> *fbatch,
>>>           struct folio *folio, move_fn_t move_fn,
>>>           bool on_lru, bool disable_irq)
>>> @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct
>>> folio_batch __percpu *fbatch,
>>>       else
>>>           local_lock(&cpu_fbatches.lock);
>>>
>>> +    /* We'll only perform the actual list move deferred. */
>>> +    if (move_fn == lru_activate)
>>> +        folio_set_active(folio);
>>> +
>>>       if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
>>> folio_test_large(folio) ||
>>>           lru_cache_disabled())
>>>           folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
>>> @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec,
>>> struct folio *folio)
>>>   {
>>>       long nr_pages = folio_nr_pages(folio);
>>>
>>> -    if (folio_test_active(folio) || folio_test_unevictable(folio))
>>> -        return;
>>> -
>>> +    /*
>>> +     * We set the folio active after clearing the LRU flag, and set 
>>> the
>>> +     * LRU flag only after moving it to the right list.
>>> +     */
>>> +    VM_WARN_ON_ONCE(!folio_test_active(folio));
>>> +    VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>>>
>>>       lruvec_del_folio(lruvec, folio);
>>> -    folio_set_active(folio);
>>>       lruvec_add_folio(lruvec, folio);
>>>       trace_mm_lru_activate(folio);
>>>
>>> @@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
>>>           return;
>>>
>>>       lruvec = folio_lruvec_lock_irq(folio);
>>> -    lru_activate(lruvec, folio);
>>> +    if (!folio_test_unevictable(folio)) {
>>> +        folio_set_active(folio);
>>> +        lru_activate(lruvec, folio);
>>> +    }
>>>       unlock_page_lruvec_irq(lruvec);
>>>       folio_set_lru(folio);
>>>   }
>>
>> I test with the patch, and the performance regression disappears.
>>
>> By the way, I find folio_test_unevictable() is called in 
>> lru_deactivate, lru_lazyfree, etc.
>> unevictable flag is set when the caller clears lru flag. IIUC, since 
>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>> adding to LRU batch"), folios in fbatch can't be set unevictable 
>> flag, so there is no need to check unevictable flag in 
>> lru_deactivate, lru_lazyfree, etc?
>
> I was asking myself the exact same question when crafting this patch. 
> Sounds like a cleanup worth investigating! :)
>
> Do you have capacity to look into that, and to turn my proposal into a 
> proper patch? It might take me a bit until I get to it.

Sure, I will send a formal patch ASAP.



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch
  2025-04-08 10:04           ` David Hildenbrand
  2025-04-08 12:01             ` Jinjiang Tu
@ 2025-04-10  8:06             ` Jinjiang Tu
  1 sibling, 0 replies; 12+ messages in thread
From: Jinjiang Tu @ 2025-04-10  8:06 UTC (permalink / raw)
  To: David Hildenbrand, yangge1116, akpm
  Cc: linux-mm, linux-kernel, stable, 21cnbao, baolin.wang,
	aneesh.kumar, liuzixing, Kefeng Wang


在 2025/4/8 18:04, David Hildenbrand 写道:
> On 08.04.25 10:47, Jinjiang Tu wrote:
>>
>> 在 2025/4/1 22:33, David Hildenbrand 写道:
>>> On 27.03.25 12:16, Jinjiang Tu wrote:
>>>>
>>>> 在 2025/3/26 20:46, David Hildenbrand 写道:
>>>>> On 26.03.25 13:42, Jinjiang Tu wrote:
>>>>>> Hi,
>>>>>>
>>>>>
>>>>> Hi!
>>>>>
>>>>>> We notiched a 12.3% performance regression for LibMicro pwrite
>>>>>> testcase due to
>>>>>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>>>>>> adding to LRU batch").
>>>>>>
>>>>>> The testcase is executed as follows, and the file is tmpfs file.
>>>>>>        pwrite -E -C 200 -L -S -W -N "pwrite_t1k" -s 1k -I 500 -f 
>>>>>> $TFILE
>>>>>
>>>>> Do we know how much that reflects real workloads? (IOW, how much
>>>>> should we care)
>>>>
>>>> No, it's hard to say.
>>>>
>>>>>
>>>>>>
>>>>>> this testcase writes 1KB (only one page) to the tmpfs and repeats
>>>>>> this step for many times. The Flame
>>>>>> graph shows the performance regression comes from
>>>>>> folio_mark_accessed() and workingset_activation().
>>>>>>
>>>>>> folio_mark_accessed() is called for the same page for many times.
>>>>>> Before this patch, each call will
>>>>>> add the page to cpu_fbatches.activate. When the fbatch is full, the
>>>>>> fbatch is drained and the page
>>>>>> is promoted to active list. And then, folio_mark_accessed() does
>>>>>> nothing in later calls.
>>>>>>
>>>>>> But after this patch, the folio clear lru flags after it is added to
>>>>>> cpu_fbatches.activate. After then,
>>>>>> folio_mark_accessed will never call folio_activate() again due to 
>>>>>> the
>>>>>> page is without lru flag, and
>>>>>> the fbatch will not be full and the folio will not be marked active,
>>>>>> later folio_mark_accessed()
>>>>>> calls will always call workingset_activation(), leading to
>>>>>> performance regression.
>>>>>
>>>>> Would there be a good place to drain the LRU to effectively get that
>>>>> processed? (we can always try draining if the LRU flag is not set)
>>>>
>>>> Maybe we could drain the search the cpu_fbatches.activate of the
>>>> local cpu in __lru_cache_activate_folio()? Drain other fbatches is
>>>> meaningless .
>>>
>>> So the current behavior is that folio_mark_accessed() will end up
>>> calling folio_activate()
>>> once, and then __lru_cache_activate_folio() until the LRU was drained
>>> (which can
>>> take a looong time).
>>>
>>> The old behavior was that folio_mark_accessed() would keep calling
>>> folio_activate() until
>>> the LRU was drained simply because it was full of "all the same pages"
>>> ?. Only *after*
>>> the LRU was drained, folio_mark_accessed() would actually not do
>>> anything (desired behavior).
>>>
>>> So the overhead comes primarily from __lru_cache_activate_folio()
>>> searching through
>>> the list "more" repeatedly because the LRU does get drained less
>>> frequently; and
>>> it would never find it in there in this case.
>>>
>>> So ... it used to be suboptimal before, now it's more suboptimal I
>>> guess?! :)
>>>
>>> We'd need a way to better identify "this folio is already queued for
>>> activation". Searching
>>> that list as well would be one option, but the hole "search the list"
>>> is nasty.
>>>
>>> Maybe we can simply set the folio as active when staging it for
>>> activation, after having
>>> cleared the LRU flag? We already do that during folio_add.
>>>
>>> As the LRU flag was cleared, nobody should be messing with that folio
>>> until the cache was
>>> drained and the move was successful.
>>>
>>>
>>> Pretty sure this doesn't work, but just to throw out an idea:
>>>
>>>  From c26e1c0ceda6c818826e5b89dc7c7c9279138f80 Mon Sep 17 00:00:00 2001
>>> From: David Hildenbrand <david@redhat.com>
>>> Date: Tue, 1 Apr 2025 16:31:56 +0200
>>> Subject: [PATCH] test
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/swap.c | 21 ++++++++++++++++-----
>>>   1 file changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/swap.c b/mm/swap.c
>>> index fc8281ef42415..bbf9aa76db87f 100644
>>> --- a/mm/swap.c
>>> +++ b/mm/swap.c
>>> @@ -175,6 +175,8 @@ static void folio_batch_move_lru(struct
>>> folio_batch *fbatch, move_fn_t move_fn)
>>>       folios_put(fbatch);
>>>   }
>>>
>>> +static void lru_activate(struct lruvec *lruvec, struct folio *folio);
>>> +
>>>   static void __folio_batch_add_and_move(struct folio_batch __percpu
>>> *fbatch,
>>>           struct folio *folio, move_fn_t move_fn,
>>>           bool on_lru, bool disable_irq)
>>> @@ -191,6 +193,10 @@ static void __folio_batch_add_and_move(struct
>>> folio_batch __percpu *fbatch,
>>>       else
>>>           local_lock(&cpu_fbatches.lock);
>>>
>>> +    /* We'll only perform the actual list move deferred. */
>>> +    if (move_fn == lru_activate)
>>> +        folio_set_active(folio);
>>> +
>>>       if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
>>> folio_test_large(folio) ||
>>>           lru_cache_disabled())
>>>           folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
>>> @@ -299,12 +305,14 @@ static void lru_activate(struct lruvec *lruvec,
>>> struct folio *folio)
>>>   {
>>>       long nr_pages = folio_nr_pages(folio);
>>>
>>> -    if (folio_test_active(folio) || folio_test_unevictable(folio))
>>> -        return;
>>> -
>>> +    /*
>>> +     * We set the folio active after clearing the LRU flag, and set 
>>> the
>>> +     * LRU flag only after moving it to the right list.
>>> +     */
>>> +    VM_WARN_ON_ONCE(!folio_test_active(folio));
>>> +    VM_WARN_ON_ONCE(folio_test_unevictable(folio));
>>>
>>>       lruvec_del_folio(lruvec, folio);
>>> -    folio_set_active(folio);
>>>       lruvec_add_folio(lruvec, folio);
>>>       trace_mm_lru_activate(folio);
>>>
>>> @@ -342,7 +350,10 @@ void folio_activate(struct folio *folio)
>>>           return;
>>>
>>>       lruvec = folio_lruvec_lock_irq(folio);
>>> -    lru_activate(lruvec, folio);
>>> +    if (!folio_test_unevictable(folio)) {
>>> +        folio_set_active(folio);
>>> +        lru_activate(lruvec, folio);
>>> +    }
>>>       unlock_page_lruvec_irq(lruvec);
>>>       folio_set_lru(folio);
>>>   }
>>
>> I test with the patch, and the performance regression disappears.
>>
>> By the way, I find folio_test_unevictable() is called in 
>> lru_deactivate, lru_lazyfree, etc.
>> unevictable flag is set when the caller clears lru flag. IIUC, since 
>> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before
>> adding to LRU batch"), folios in fbatch can't be set unevictable 
>> flag, so there is no need to check unevictable flag in 
>> lru_deactivate, lru_lazyfree, etc?
>
> I was asking myself the exact same question when crafting this patch. 
> Sounds like a cleanup worth investigating! :)

folio_activate                 __mlock_folio
   folio_test_unevictable
                                      folio_test_clear_lru
                                      folio_set_unevictable
                                      folio_set_lru
   folio_batch_add_and_move
     folio_test_clear_lru

In the above case, unevictable flag will be set before lru flag is cleared in folio_activate(). So folio may still be
unevictable in lru_activate().

>
> Do you have capacity to look into that, and to turn my proposal into a 
> proper patch? It might take me a bit until I get to it.
>


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-04-10  8:06 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-04  6:52 [PATCH V4] mm/gup: Clear the LRU flag of a page before adding to LRU batch yangge1116
2024-07-12  8:56 ` Hugh Dickins
2024-07-12 20:56   ` Andrew Morton
2025-03-26 12:42 ` Jinjiang Tu
2025-03-26 12:46   ` David Hildenbrand
2025-03-27 11:16     ` Jinjiang Tu
2025-04-01 14:33       ` David Hildenbrand
2025-04-07  7:41         ` Jinjiang Tu
2025-04-08  8:47         ` Jinjiang Tu
2025-04-08 10:04           ` David Hildenbrand
2025-04-08 12:01             ` Jinjiang Tu
2025-04-10  8:06             ` Jinjiang Tu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).