All of lore.kernel.org
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: Zhongkun He <hezhongkun.hzk@bytedance.com>
Cc: akpm@linux-foundation.org, mgorman@techsingularity.net,
	hannes@cmpxchg.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, lizefan.x@bytedance.com
Subject: Re: [PATCH V1] mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath()
Date: Thu, 22 Aug 2024 11:00:58 +0200	[thread overview]
Message-ID: <Zsb-SplgW_JizWdE@tiehlicka> (raw)
In-Reply-To: <20240822083842.3167137-1-hezhongkun.hzk@bytedance.com>

On Thu 22-08-24 16:38:42, Zhongkun He wrote:
> I found a problem in my test machine that should_reclaim_retry() do
> not get the right node if i set the cpuset.mems.
> The should_reclaim_retry() and try_to_compact_pages() are iterating
> nodes which are not allowed by cpusets and that makes the retry loop
> happening more than unnecessary.

I would update the problem description because from the above it is not
really clear what the actual problem is.

should_reclaim_retry is not ALLOC_CPUSET aware and that means that it
considers reclaimability of NUMA nodes which are outside of the cpuset.
If other nodes have a lot of reclaimable memory then should_reclaim_retry
would instruct page allocator to retry even though there is no memory
reclaimable on the cpuset nodemask. This is not really a huge problem
because the number of retries without any reclaim progress is bound but
it could be certainly improved. This is a cold path so this shouldn't
really have a measurable impact on performance on most workloads.

> 
> 1.Test step and the machines.
> ------------
> root@vm:/sys/fs/cgroup/test# numactl -H | grep size
> node 0 size: 9477 MB
> node 1 size: 10079 MB
> node 2 size: 10079 MB
> node 3 size: 10078 MB
> 
> root@vm:/sys/fs/cgroup/test# cat cpuset.mems
>     2
> 
> root@vm:/sys/fs/cgroup/test# stress --vm 1 --vm-bytes 12g  --vm-keep
> stress: info: [33430] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
> stress: FAIL: [33430] (425) <-- worker 33431 got signal 9
> stress: WARN: [33430] (427) now reaping child worker processes
> stress: FAIL: [33430] (461) failed run completed in 2s
> 
> 2. reclaim_retry_zone info:
> 
> We can only alloc pages from node=2, but the reclaim_retry_zone is
> node=0 and return true.
> 
> root@vm:/sys/kernel/debug/tracing# cat trace
> stress-33431   [001] ..... 13223.617311: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=1 wmark_check=1
> stress-33431   [001] ..... 13223.617682: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=2 wmark_check=1
> stress-33431   [001] ..... 13223.618103: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=3 wmark_check=1
> stress-33431   [001] ..... 13223.618454: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=4 wmark_check=1
> stress-33431   [001] ..... 13223.618770: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=5 wmark_check=1
> stress-33431   [001] ..... 13223.619150: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=6 wmark_check=1
> stress-33431   [001] ..... 13223.619510: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=7 wmark_check=1
> stress-33431   [001] ..... 13223.619850: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=8 wmark_check=1
> stress-33431   [001] ..... 13223.620171: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=9 wmark_check=1
> stress-33431   [001] ..... 13223.620533: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=10 wmark_check=1
> stress-33431   [001] ..... 13223.620894: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=11 wmark_check=1
> stress-33431   [001] ..... 13223.621224: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=12 wmark_check=1
> stress-33431   [001] ..... 13223.621551: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=13 wmark_check=1
> stress-33431   [001] ..... 13223.621847: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=14 wmark_check=1
> stress-33431   [001] ..... 13223.622200: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=15 wmark_check=1
> stress-33431   [001] ..... 13223.622580: reclaim_retry_zone: node=0 zone=Normal   order=0 reclaimable=4260 available=1772019 min_wmark=5962 no_progress_loops=16 wmark_check=1
> 

You can drop the following

> 3. Root cause:
> Nodemask usually comes from mempolicy in policy_nodemask(), which
> is always NULL unless the memory policy is bind or prefer_many.
> 
> nodemask = NULL
> __alloc_pages_noprof()
> 	prepare_alloc_pages
> 		ac->nodemask = &cpuset_current_mems_allowed;
> 
> 	get_page_from_freelist()
> 
> 	ac.nodemask = nodemask;  /*set  NULL*/
> 
> 	__alloc_pages_slowpath() {
> 		f (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
> 			ac->nodemask = NULL;
> 			ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
> 					ac->highest_zoneidx, ac->nodemask);
> 
> 		/* so ac.nodemask = NULL */
> 	}
> 
> According to the function flow above, we do not have the memory limit to
> follow cpuset.mems, so we need to add it.
> 
> Test result:
> Try 3 times with different cpuset.mems and alloc large memorys than that numa size.
> echo 1 > cpuset.mems
> stress --vm 1 --vm-bytes 12g --vm-hang 0
> ---------------
> echo 2 > cpuset.mems
> stress --vm 1 --vm-bytes 12g --vm-hang 0
> ---------------
> echo 3 > cpuset.mems
> stress --vm 1 --vm-bytes 12g --vm-hang 0
> 
> The retry trace look like:
> stress-2139    [003] .....   666.934104: reclaim_retry_zone: node=1 zone=Normal   order=0 reclaimable=7 available=7355 min_wmark=8598 no_progress_loops=1 wmark_check=0
> stress-2204    [010] .....   695.447393: reclaim_retry_zone: node=2 zone=Normal   order=0 reclaimable=2 available=6916 min_wmark=8598 no_progress_loops=1 wmark_check=0
> stress-2271    [008] .....   725.683058: reclaim_retry_zone: node=3 zone=Normal   order=0 reclaimable=17 available=8079 min_wmark=8597 no_progress_loops=1 wmark_check=0
> 

And only keep this

> With this patch, we can check the right node and get less retry in __alloc_pages_slowpath()
> because there is nothing to do.
> 
> V1:
> Do the same with the page allocator using __cpuset_zone_allowed().
> 
> Suggested-by: Michal Hocko <mhocko@suse.com>
> Signed-off-by: Zhongkun He <hezhongkun.hzk@bytedance.com>

With those changes you can add
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/compaction.c | 6 ++++++
>  mm/page_alloc.c | 5 +++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index d1041fbce679..a2b16b08cbbf 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -23,6 +23,7 @@
>  #include <linux/freezer.h>
>  #include <linux/page_owner.h>
>  #include <linux/psi.h>
> +#include <linux/cpuset.h>
>  #include "internal.h"
>  
>  #ifdef CONFIG_COMPACTION
> @@ -2822,6 +2823,11 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  					ac->highest_zoneidx, ac->nodemask) {
>  		enum compact_result status;
>  
> +		if (cpusets_enabled() &&
> +			(alloc_flags & ALLOC_CPUSET) &&
> +			!__cpuset_zone_allowed(zone, gfp_mask))
> +				continue;
> +
>  		if (prio > MIN_COMPACT_PRIORITY
>  					&& compaction_deferred(zone, order)) {
>  			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 29608ca294cf..8a67d760b71a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4128,6 +4128,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
>  		unsigned long min_wmark = min_wmark_pages(zone);
>  		bool wmark;
>  
> +		if (cpusets_enabled() &&
> +			(alloc_flags & ALLOC_CPUSET) &&
> +			!__cpuset_zone_allowed(zone, gfp_mask))
> +				continue;
> +
>  		available = reclaimable = zone_reclaimable_pages(zone);
>  		available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  
> -- 
> 2.20.1

-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2024-08-22  9:01 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-22  8:38 [PATCH V1] mm:page_alloc: fix the NULL ac->nodemask in __alloc_pages_slowpath() Zhongkun He
2024-08-22  9:00 ` Michal Hocko [this message]
2024-08-22  9:11   ` [External] " Zhongkun He

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zsb-SplgW_JizWdE@tiehlicka \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hezhongkun.hzk@bytedance.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mgorman@techsingularity.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.