[PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
@ 2023-07-07 10:32 Efly Young
  2023-07-07 19:09 ` Andrew Morton
  2023-07-11 15:28 ` Johannes Weiner
  0 siblings, 2 replies; 7+ messages in thread
From: Efly Young @ 2023-07-07 10:32 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel

With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim
and swap floods"), proactive reclaim still seems inaccurate.

Our problematic scene also are almost anon pages. Request 1G
by writing memory.reclaim will reclaim 1.7G or other values
more than 1G by swapping.

This try to fix the inaccurate reclaim problem.

Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b..2aea8d9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 	unsigned long nr_to_scan;
 	enum lru_list lru;
 	unsigned long nr_reclaimed = 0;
-	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
+	unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed);
 	bool proportional_reclaim;
 	struct blk_plug plug;
 
-- 
1.8.3.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
  2023-07-07 10:32 Efly Young
@ 2023-07-07 19:09 ` Andrew Morton
  2023-07-11 15:28 ` Johannes Weiner
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Morton @ 2023-07-07 19:09 UTC (permalink / raw)
  To: Efly Young; +Cc: linux-mm, linux-kernel, Johannes Weiner

(cc hannes)

On Fri, 7 Jul 2023 18:32:26 +0800 Efly Young <yangyifei03@kuaishou.com> wrote:

> With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim
> and swap floods"), proactive reclaim still seems inaccurate.
> 
> Our problematic scene also are almost anon pages. Request 1G
> by writing memory.reclaim will reclaim 1.7G or other values
> more than 1G by swapping.
> 
> This try to fix the inaccurate reclaim problem.

It would be helpful to have some additional explanation of why you
believe the current code is incorrect?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  	unsigned long nr_to_scan;
>  	enum lru_list lru;
>  	unsigned long nr_reclaimed = 0;
> -	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed);
>  	bool proportional_reclaim;
>  	struct blk_plug plug;
>  



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
@ 2023-07-11  9:43 杨逸飞
  0 siblings, 0 replies; 7+ messages in thread
From: 杨逸飞 @ 2023-07-11  9:43 UTC (permalink / raw)
  To: akpm@linux-foundation.org
  Cc: hannes@cmpxchg.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, 杨逸飞

[-- Attachment #1: Type: text/plain, Size: 608 bytes --]

Thank you for your reply. shrink_lruvec() is nested in deep loop. Reclaimer may have already reclaimed part of requested memory in one loop, but before adjust sc->nr_to_reclaim in outer loop, call shrink_lruvec() again will still follow the current sc->nr_to_reclaim to work. It will eventually lead to overreclaim.
Problematic case is easy to be constructed. Allocate lots of anonymous memory(e.g. 20G) in a memcg, then swapping by writing memory.reclaim and there is a certain probability of overreclaim.
(Sorry to find out that there was a problem with the format of the previous reply. Thanks again.)

[-- Attachment #2: Type: text/html, Size: 2864 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
  2023-07-07 10:32 Efly Young
  2023-07-07 19:09 ` Andrew Morton
@ 2023-07-11 15:28 ` Johannes Weiner
  1 sibling, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2023-07-11 15:28 UTC (permalink / raw)
  To: Efly Young; +Cc: akpm, linux-mm, linux-kernel

On Fri, Jul 07, 2023 at 06:32:26PM +0800, Efly Young wrote:
> With commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim
> and swap floods"), proactive reclaim still seems inaccurate.
> 
> Our problematic scene also are almost anon pages. Request 1G
> by writing memory.reclaim will reclaim 1.7G or other values
> more than 1G by swapping.
> 
> This try to fix the inaccurate reclaim problem.

I can see how this happens. Direct and kswapd reclaim have much
smaller nr_to_reclaim targets, so it's less noticable when we loop a
few times. Proactive reclaim can come in with a rather large value.

What does the reproducer setup look like? Are you calling reclaim on a
higher level cgroup with several children? Or is the looping coming
from having multiple zones alone?

> Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9c1c5e8b..2aea8d9 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6208,7 +6208,7 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  	unsigned long nr_to_scan;
>  	enum lru_list lru;
>  	unsigned long nr_reclaimed = 0;
> -	unsigned long nr_to_reclaim = sc->nr_to_reclaim;
> +	unsigned long nr_to_reclaim = (sc->nr_to_reclaim - sc->nr_reclaimed);

This can underflow. shrink_list() eats SWAP_CLUSTER_MAX batches out of
lru_pages >> priority, and only checks reclaimed > to_reclaim
after. This will then disable the bailout mechanism entirely.

In general, I'm not sure this is the best spot to fix the problem:

- During reclaim/compaction, should_continue_reclaim() may decide that
  more reclaim is required before compaction can proceed. But the
  second cycle might not do anything now, since you remember the work
  done by the previous one.

- shrink_node_memcgs() might do the full batch against the first
  cgroup and not touch the second one anymore. This will result in
  super lopsided behavior when you target a tree of multiple groups.

There might be other spots that break, I haven't checked.

You could go through them one by one, of course. But the truth is,
larger reclaim targets are the rare exception. Trying to support them
at the risk of breaking all other reclaim users seems ill-advised.

A better approach might be to just say: "don't call reclaim with large
numbers". Have proactive reclaim code handle the batching into smaller
chunks:

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..4b016806dcc7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6696,7 +6696,7 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			lru_add_drain_all();

 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						nr_to_reclaim - nr_reclaimed,
+						min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
 						GFP_KERNEL, reclaim_options);

 		if (!reclaimed && !nr_retries--)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
@ 2023-07-20  7:27 Efly Young
  2023-07-20 16:24 ` Johannes Weiner
  0 siblings, 1 reply; 7+ messages in thread
From: Efly Young @ 2023-07-20  7:27 UTC (permalink / raw)
  To: hannes; +Cc: cgroups, linux-mm, linux-kernel, bpf

Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and
swap floods"), proactive reclaim will extreme overreclaim sometimes.
But proactive reclaim still inaccurate and some extent overreclaim.

Problematic case is easy to construct. Allocate lots of anonymous
memory (e.g., 20G) in a memcg, then swapping by writing memory.recalim
and there is a certain probability of overreclaim. For example, request
1G by writing memory.reclaim will eventually reclaim 1.7G or other
values more than 1G.

The reason is that reclaimer may have already reclaimed part of requested
memory in one loop, but before adjust sc->nr_to_reclaim in outer loop,
call shrink_lruvec() again will still follow the current sc->nr_to_reclaim
to work. It will eventually lead to overreclaim. In theory, the amount
of reclaimed would be in [request, 2 * request).

Reclaimer usually tends to reclaim more than request. But either direct
or kswapd reclaim have much smaller nr_to_reclaim targets, so it is
less noticeable and not have much impact.

Proactive reclaim can usually come in with a larger value, so the error
is difficult to ignore. Considering proactive reclaim is usually low
frequency, handle the batching into smaller chunks is a better approach.

Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e24..d36cf88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6741,8 +6741,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			lru_add_drain_all();

 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, reclaim_options);
+					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
+					GFP_KERNEL, reclaim_options);

 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
  2023-07-20  7:27 [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim Efly Young
@ 2023-07-20 16:24 ` Johannes Weiner
  0 siblings, 0 replies; 7+ messages in thread
From: Johannes Weiner @ 2023-07-20 16:24 UTC (permalink / raw)
  To: Efly Young
  Cc: cgroups, linux-mm, linux-kernel, bpf, Michal Hocko, Shakeel Butt,
	Yosry Ahmed

On Thu, Jul 20, 2023 at 03:27:08PM +0800, Efly Young wrote:
> Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and
> swap floods"), proactive reclaim will extreme overreclaim sometimes.
> But proactive reclaim still inaccurate and some extent overreclaim.
> 
> Problematic case is easy to construct. Allocate lots of anonymous
> memory (e.g., 20G) in a memcg, then swapping by writing memory.recalim
> and there is a certain probability of overreclaim. For example, request
> 1G by writing memory.reclaim will eventually reclaim 1.7G or other
> values more than 1G.
> 
> The reason is that reclaimer may have already reclaimed part of requested
> memory in one loop, but before adjust sc->nr_to_reclaim in outer loop,
> call shrink_lruvec() again will still follow the current sc->nr_to_reclaim
> to work. It will eventually lead to overreclaim. In theory, the amount
> of reclaimed would be in [request, 2 * request).
> 
> Reclaimer usually tends to reclaim more than request. But either direct
> or kswapd reclaim have much smaller nr_to_reclaim targets, so it is
> less noticeable and not have much impact.
> 
> Proactive reclaim can usually come in with a larger value, so the error
> is difficult to ignore. Considering proactive reclaim is usually low
> frequency, handle the batching into smaller chunks is a better approach.
> 
> Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>

Hey, I didn't write the patch, you did :) Please change it to

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

You can also add

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

[quoting remainder for new CCs]

> ---
>  mm/memcontrol.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4b27e24..d36cf88 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6741,8 +6741,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>  			lru_add_drain_all();
>  
>  		reclaimed = try_to_free_mem_cgroup_pages(memcg,
> -						nr_to_reclaim - nr_reclaimed,
> -						GFP_KERNEL, reclaim_options);
> +					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> +					GFP_KERNEL, reclaim_options);
>  
>  		if (!reclaimed && !nr_retries--)
>  			return -EAGAIN;
> -- 
> 1.8.3.1


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim
@ 2023-07-21  1:41 Efly Young
  0 siblings, 0 replies; 7+ messages in thread
From: Efly Young @ 2023-07-21  1:41 UTC (permalink / raw)
  To: hannes
  Cc: bpf, cgroups, linux-kernel, linux-mm, mhocko, shakeelb,
	yosryahmed, yangyifei03

Before commit f53af4285d77 ("mm: vmscan: fix extreme overreclaim and
swap floods"), proactive reclaim will extreme overreclaim sometimes.
But proactive reclaim still inaccurate and some extent overreclaim.

Problematic case is easy to construct. Allocate lots of anonymous
memory (e.g., 20G) in a memcg, then swapping by writing memory.recalim
and there is a certain probability of overreclaim. For example, request
1G by writing memory.reclaim will eventually reclaim 1.7G or other
values more than 1G.

The reason is that reclaimer may have already reclaimed part of requested
memory in one loop, but before adjust sc->nr_to_reclaim in outer loop,
call shrink_lruvec() again will still follow the current sc->nr_to_reclaim
to work. It will eventually lead to overreclaim. In theory, the amount
of reclaimed would be in [request, 2 * request).

Reclaimer usually tends to reclaim more than request. But either direct
or kswapd reclaim have much smaller nr_to_reclaim targets, so it is
less noticeable and not have much impact.

Proactive reclaim can usually come in with a larger value, so the error
is difficult to ignore. Considering proactive reclaim is usually low
frequency, handle the batching into smaller chunks is a better approach.

Signed-off-by: Efly Young <yangyifei03@kuaishou.com>
Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/memcontrol.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4b27e24..d36cf88 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6741,8 +6741,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			lru_add_drain_all();

 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						nr_to_reclaim - nr_reclaimed,
-						GFP_KERNEL, reclaim_options);
+					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
+					GFP_KERNEL, reclaim_options);

 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2023-07-21  1:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-20  7:27 [PATCH] mm:vmscan: fix inaccurate reclaim during proactive reclaim Efly Young
2023-07-20 16:24 ` Johannes Weiner
  -- strict thread matches above, loose matches on Subject: below --
2023-07-21  1:41 Efly Young
2023-07-11  9:43 杨逸飞
2023-07-07 10:32 Efly Young
2023-07-07 19:09 ` Andrew Morton
2023-07-11 15:28 ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).