From: Li Wang <liwang@redhat.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Waiman Long <longman@redhat.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeel.butt@linux.dev>,
Muchun Song <muchun.song@linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Tejun Heo <tj@kernel.org>, Shuah Khan <shuah@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
Sean Christopherson <seanjc@google.com>,
James Houghton <jthoughton@google.com>,
Sebastian Chlad <sebastianchlad@gmail.com>,
Guopeng Zhang <zhangguopeng@kylinos.cn>,
Li Wang <liwan@redhat.com>
Subject: Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
Date: Thu, 2 Apr 2026 17:27:04 +0800 [thread overview]
Message-ID: <ac42aIDVaGim5moO@redhat.com> (raw)
In-Reply-To: <n6mhkjsxsami3qmczkdh57eep4lmcgbtyl7ox3ajzveke44yf6@m4bjevvsr47k>
Hi Michal,
> Hello Waiman and Li.
> ...
> The explanation seems [1] to just pick a function because log seemed too
> slow.
>
> (We should add a BPF hook to calculate the threshold. Haha, Date:)
>
> The threshold has twofold role: to bound error and to preserve some
> performance thanks to laziness and these two go against each other when
> determining the threshold. The reasoning for linear scaling is that
> _each_ CPU contributes some updates so that preserves the laziness.
> Whereas error capping would hint to no dependency on nr_cpus.
>
> My idea is that a job associated to a selected memcg doesn't necessarily
> run on _all_ CPUs of (such big) machines but effectively cause updates
> on J CPUs. (Either they're artificially constrained or they simply are
> not-so-parallel jobs.)
> Hence the threshold should be based on that J and not actual nr_cpus.
I completely agree on this point.
> Now the question is what is expected (CPU) size of a job and for that
> I'd would consider a distribution like:
> - 1 job of size nr_cpus, // you'd overcommit your machine with bigger job
> - 2 jobs of size nr_cpus/2,
> - 3 jobs of size nr_cpus/3,
> - ...
> - nr_cpus jobs of size 1. // you'd underutilize the machine with fewer
>
> Note this is quite naïve and arbitrary deliberation of mine but it
> results in something like Pareto distribution which is IMO quite
> reasonable. With (only) that assumption, I can estimate the average size
> of jobs like
> nr_cpus / (log(nr_cpus) + 1)
> (it's natural logarithm from harmonic series and +1 is from that
> approximation too, it comes handy also on UP)
>
> log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69
> log(x) ~ 45426 * ilog2(x) / 65536
>
> or
> 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
>
>
> with kernel functions:
> var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
> var3 = roundup_pow_of_two(var2)
>
> I hope I don't need to present any more numbers at this moment because
> the parameter derivation is backed by solid theory ;-) [*]
> [*]
It is a elegant method but still not based on the J CPUs.
As you capture the core tension: bounding error wants the threshold
as small as possible, while preserving laziness wants it as large as
possible. Any scheme is a compromise between the two.
But there has several practical issues:
The threshold formula is system-wide, while each memcg has its own counter,
they all evaluate against the same MEMCG_CHARGE_BATCH * f(nr_cpu_ids),
with no awareness of how many CPUs are actually active for that particular
memcg. Small tasks with J=2 coexist with large services where J approaches
nr_cpus, yet they all face the same threshold. The ln-harmonic formula
optimizes for the average J, but workloads that most critically need
accurate memory.stat are precisely those spanning many CPUs, well above
average.
Moreover, the "average J" estimate assumes tasks are uniformly distributed
across CPUs, which rarely holds in practice with cpuset constraints, NUMA
affinity, and nested cgroup hierarchies. And even accepting that estimate,
the data shows ln-harmonic still yields 237MB of error at 2048 CPUs with
64K pages — still large enough to cause selftest failures.
In short: the theoretical analysis is sound, but the conclusion conflates
average case with worst case. Under the constraint of a single global
threshold, sqrt remains the more robust choice.
In future, if the J-sensory threshold per-memcg can be achieved, then your
ln-harmonic method is the most ideal formula.
To compare the three methods (linear, sqrt, ln-harmonic):
4K page size (BATCH=64):
CPUs linear sqrt ln-var3
--------------------------------
1 256KB 256KB 256KB
2 512KB 512KB 512KB
4 1MB 512KB 512KB
8 2MB 768KB 1MB
16 4MB 1MB 2MB
32 8MB 1.25MB 2MB
64 16MB 2MB 4MB
128 32MB 2.75MB 8MB
256 64MB 4MB 16MB
512 128MB 5.5MB 32MB
1024 256MB 8MB 64MB
2048 512MB 11.25MB 64MB
64K page size (BATCH=16):
CPUs linear sqrt ln-var3
--------------------------------
1 1MB 1MB 1MB
2 2MB 2MB 2MB
4 4MB 2MB 2MB
8 8MB 3MB 4MB
16 16MB 4MB 8MB
32 32MB 5MB 8MB
64 64MB 8MB 16MB
128 128MB 11MB 32MB
256 256MB 16MB 64MB
512 512MB 22MB 128MB
1024 1GB 32MB 256MB
2048 2GB 45MB 256MB
--
Regards,
Li Wang
next prev parent reply other threads:[~2026-04-02 9:27 UTC|newest]
Thread overview: 27+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
2026-03-23 12:46 ` Li Wang
2026-03-24 0:15 ` Yosry Ahmed
2026-03-25 16:47 ` Waiman Long
2026-03-25 17:23 ` Yosry Ahmed
2026-04-01 18:41 ` Michal Koutný
2026-04-02 9:27 ` Li Wang [this message]
2026-04-02 10:19 ` Li Wang
2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-23 12:47 ` Li Wang
2026-03-24 0:17 ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-23 2:53 ` Li Wang
2026-03-23 2:56 ` Li Wang
2026-03-25 3:33 ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-23 8:01 ` Li Wang
2026-03-25 16:42 ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-23 8:24 ` Li Wang
2026-03-25 3:47 ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-23 8:53 ` Li Wang
2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-23 9:44 ` Li Wang
2026-03-21 1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ac42aIDVaGim5moO@redhat.com \
--to=liwang@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jthoughton@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=liwan@redhat.com \
--cc=longman@redhat.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=sebastianchlad@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=shuah@kernel.org \
--cc=tj@kernel.org \
--cc=zhangguopeng@kylinos.cn \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox