Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Li Wang <liwang@redhat.com>
To: Waiman Long <longman@redhat.com>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Shakeel Butt" <shakeel.butt@linux.dev>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
	"Shuah Khan" <shuah@kernel.org>,
	"Mike Rapoport" <rppt@kernel.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	"Sean Christopherson" <seanjc@google.com>,
	"James Houghton" <jthoughton@google.com>,
	"Sebastian Chlad" <sebastianchlad@gmail.com>,
	"Guopeng Zhang" <zhangguopeng@kylinos.cn>,
	"Li Wang" <liwan@redhat.com>
Subject: Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
Date: Mon, 23 Mar 2026 20:46:42 +0800	[thread overview]
Message-ID: <acE2MoIZ0pl7U7PX@redhat.com> (raw)
In-Reply-To: <20260320204241.1613861-2-longman@redhat.com>

On Fri, Mar 20, 2026 at 04:42:35PM -0400, Waiman Long wrote:
> The vmstats flush threshold currently increases linearly with the
> number of online CPUs. As the number of CPUs increases over time, it
> will become increasingly difficult to meet the threshold and update the
> vmstats data in a timely manner. These days, systems with hundreds of
> CPUs or even thousands of them are becoming more common.
> 
> For example, the test_memcg_sock test of test_memcontrol always fails
> when running on an arm64 system with 128 CPUs. It is because the
> threshold is now 64*128 = 8192. With 4k page size, it needs changes in
> 32 MB of memory. It will be even worse with larger page size like 64k.
> 
> To make the output of memory.stat more correct, it is better to scale
> up the threshold slower than linearly with the number of CPUs. The
> int_sqrt() function is a good compromise as suggested by Li Wang [1].
> An extra 2 is added to make sure that we will double the threshold for
> a 2-core system. The increase will be slower after that.
> 
> With the int_sqrt() scale, we can use the possibly larger
> num_possible_cpus() instead of num_online_cpus() which may change at
> run time.
> 
> Although there is supposed to be a periodic and asynchronous flush of
> vmstats every 2 seconds, the actual time lag between succesive runs
> can actually vary quite a bit. In fact, I have seen time lags of up
> to 10s of seconds in some cases. So we couldn't too rely on the hope
> that there will be an asynchronous vmstats flush every 2 seconds. This
> may be something we need to look into.
> 
> [1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
> 
> Suggested-by: Li Wang <liwang@redhat.com>
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  mm/memcontrol.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 772bac21d155..cc1fc0f5aeea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -548,20 +548,20 @@ struct memcg_vmstats {
>   *    rstat update tree grow unbounded.
>   *
>   * 2) Flush the stats synchronously on reader side only when there are more than
> - *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
> - *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
> - *    only for 2 seconds due to (1).
> + *    (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
> + *    optimization will let stats be out of sync by up to that amount. This is
> + *    supposed to last for up to 2 seconds due to (1).
>   */
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static u64 flush_last_time;
> +static int vmstats_flush_threshold __ro_after_init;
>  
>  #define FLUSH_TIME (2UL*HZ)
>  
>  static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
>  {
> -	return atomic_read(&vmstats->stats_updates) >
> -		MEMCG_CHARGE_BATCH * num_online_cpus();
> +	return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
>  }
>  
>  static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
> @@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
>  
>  	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
>  				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
> +	/*
> +	 * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
> +	 * 2 constant is to make sure that the threshold is double for a 2-core
> +	 * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
> +	 * number of the CPUs reaches the next (2^n - 2) value.
> +	 */
> +	vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
> +				  (int_sqrt(num_possible_cpus() + 2));
>  
>  	return 0;
>  }

Reviewed-by: Li Wang <liwang@redhat.com>

-- 
Regards,
Li Wang

next prev parent reply	other threads:[~2026-03-23 12:46 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
2026-03-23 12:46   ` Li Wang [this message]
2026-03-24  0:15     ` Yosry Ahmed
2026-03-25 16:47       ` Waiman Long
2026-03-25 17:23         ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-23 12:47   ` Li Wang
2026-03-24  0:17     ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-23  2:53   ` Li Wang
2026-03-23  2:56     ` Li Wang
2026-03-25  3:33     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-23  8:01   ` Li Wang
2026-03-25 16:42     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-23  8:24   ` Li Wang
2026-03-25  3:47     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-23  8:53   ` Li Wang
2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-23  9:44   ` Li Wang
2026-03-21  1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=acE2MoIZ0pl7U7PX@redhat.com \
    --to=liwang@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jthoughton@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liwan@redhat.com \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=sebastianchlad@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    --cc=zhangguopeng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox