From: Waiman Long <longman@redhat.com>
To: "Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Shakeel Butt" <shakeel.butt@linux.dev>,
"Muchun Song" <muchun.song@linux.dev>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Tejun Heo" <tj@kernel.org>, "Michal Koutný" <mkoutny@suse.com>,
"Shuah Khan" <shuah@kernel.org>,
"Mike Rapoport" <rppt@kernel.org>
Cc: linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
Sean Christopherson <seanjc@google.com>,
James Houghton <jthoughton@google.com>,
Sebastian Chlad <sebastianchlad@gmail.com>,
Guopeng Zhang <zhangguopeng@kylinos.cn>,
Li Wang <liwan@redhat.com>, Waiman Long <longman@redhat.com>,
Li Wang <liwang@redhat.com>
Subject: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
Date: Fri, 20 Mar 2026 16:42:35 -0400 [thread overview]
Message-ID: <20260320204241.1613861-2-longman@redhat.com> (raw)
In-Reply-To: <20260320204241.1613861-1-longman@redhat.com>
The vmstats flush threshold currently increases linearly with the
number of online CPUs. As the number of CPUs increases over time, it
will become increasingly difficult to meet the threshold and update the
vmstats data in a timely manner. These days, systems with hundreds of
CPUs or even thousands of them are becoming more common.
For example, the test_memcg_sock test of test_memcontrol always fails
when running on an arm64 system with 128 CPUs. It is because the
threshold is now 64*128 = 8192. With 4k page size, it needs changes in
32 MB of memory. It will be even worse with larger page size like 64k.
To make the output of memory.stat more correct, it is better to scale
up the threshold slower than linearly with the number of CPUs. The
int_sqrt() function is a good compromise as suggested by Li Wang [1].
An extra 2 is added to make sure that we will double the threshold for
a 2-core system. The increase will be slower after that.
With the int_sqrt() scale, we can use the possibly larger
num_possible_cpus() instead of num_online_cpus() which may change at
run time.
Although there is supposed to be a periodic and asynchronous flush of
vmstats every 2 seconds, the actual time lag between succesive runs
can actually vary quite a bit. In fact, I have seen time lags of up
to 10s of seconds in some cases. So we couldn't too rely on the hope
that there will be an asynchronous vmstats flush every 2 seconds. This
may be something we need to look into.
[1] https://lore.kernel.org/lkml/ab0kAE7mJkEL9kWb@redhat.com/
Suggested-by: Li Wang <liwang@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
mm/memcontrol.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 772bac21d155..cc1fc0f5aeea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -548,20 +548,20 @@ struct memcg_vmstats {
* rstat update tree grow unbounded.
*
* 2) Flush the stats synchronously on reader side only when there are more than
- * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
- * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
- * only for 2 seconds due to (1).
+ * (MEMCG_CHARGE_BATCH * int_sqrt(nr_cpus+2)) update events. Though this
+ * optimization will let stats be out of sync by up to that amount. This is
+ * supposed to last for up to 2 seconds due to (1).
*/
static void flush_memcg_stats_dwork(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
static u64 flush_last_time;
+static int vmstats_flush_threshold __ro_after_init;
#define FLUSH_TIME (2UL*HZ)
static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
{
- return atomic_read(&vmstats->stats_updates) >
- MEMCG_CHARGE_BATCH * num_online_cpus();
+ return atomic_read(&vmstats->stats_updates) > vmstats_flush_threshold;
}
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
@@ -5191,6 +5191,14 @@ int __init mem_cgroup_init(void)
memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
SLAB_PANIC | SLAB_HWCACHE_ALIGN);
+ /*
+ * Scale up vmstats flush threshold with int_sqrt(nr_cpus+2). The extra
+ * 2 constant is to make sure that the threshold is double for a 2-core
+ * system. After that, it will increase by MEMCG_CHARGE_BATCH when the
+ * number of the CPUs reaches the next (2^n - 2) value.
+ */
+ vmstats_flush_threshold = MEMCG_CHARGE_BATCH *
+ (int_sqrt(num_possible_cpus() + 2));
return 0;
}
--
2.53.0
next prev parent reply other threads:[~2026-03-20 20:43 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-20 20:42 ` Waiman Long [this message]
2026-03-23 12:46 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Li Wang
2026-03-24 0:15 ` Yosry Ahmed
2026-03-25 16:47 ` Waiman Long
2026-03-25 17:23 ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-23 12:47 ` Li Wang
2026-03-24 0:17 ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-23 2:53 ` Li Wang
2026-03-23 2:56 ` Li Wang
2026-03-25 3:33 ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-23 8:01 ` Li Wang
2026-03-25 16:42 ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-23 8:24 ` Li Wang
2026-03-25 3:47 ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-23 8:53 ` Li Wang
2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-23 9:44 ` Li Wang
2026-03-21 1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260320204241.1613861-2-longman@redhat.com \
--to=longman@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jthoughton@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liwan@redhat.com \
--cc=liwang@redhat.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=sebastianchlad@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
--cc=zhangguopeng@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox