Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Li Wang <liwang@redhat.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Waiman Long <longman@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Shakeel Butt <shakeel.butt@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Tejun Heo <tj@kernel.org>, Shuah Khan <shuah@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	Sean Christopherson <seanjc@google.com>,
	James Houghton <jthoughton@google.com>,
	Sebastian Chlad <sebastianchlad@gmail.com>,
	Guopeng Zhang <zhangguopeng@kylinos.cn>,
	Li Wang <liwan@redhat.com>
Subject: Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2)
Date: Thu, 2 Apr 2026 17:27:04 +0800	[thread overview]
Message-ID: <ac42aIDVaGim5moO@redhat.com> (raw)
In-Reply-To: <n6mhkjsxsami3qmczkdh57eep4lmcgbtyl7ox3ajzveke44yf6@m4bjevvsr47k>

Hi Michal,

> Hello Waiman and Li.
> ...
> The explanation seems [1] to just pick a function because log seemed too
> slow.
> 
> (We should add a BPF hook to calculate the threshold. Haha, Date:)
> 
> The threshold has twofold role: to bound error and to preserve some
> performance thanks to laziness and these two go against each other when
> determining the threshold. The reasoning for linear scaling is that
> _each_ CPU contributes some updates so that preserves the laziness.
> Whereas error capping would hint to no dependency on nr_cpus.
> 
> My idea is that a job associated to a selected memcg doesn't necessarily
> run on _all_ CPUs of (such big) machines but effectively cause updates
> on J CPUs. (Either they're artificially constrained or they simply are
> not-so-parallel jobs.) 

> Hence the threshold should be based on that J and not actual nr_cpus.

I completely agree on this point.

> Now the question is what is expected (CPU) size of a job and for that
> I'd would consider a distribution like:
> - 1 job of size nr_cpus, // you'd overcommit your machine with bigger job
> - 2 jobs of size nr_cpus/2,
> - 3 jobs of size nr_cpus/3,
> - ...
> - nr_cpus jobs of size 1. // you'd underutilize the machine with fewer
> 
> Note this is quite naïve and arbitrary deliberation of mine but it
> results in something like Pareto distribution which is IMO quite
> reasonable. With (only) that assumption, I can estimate the average size
> of jobs like
> 	nr_cpus / (log(nr_cpus) + 1)
> (it's natural logarithm from harmonic series and +1 is from that
> approximation too, it comes handy also on UP)
> 
> 	log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69
> 	log(x) ~ 45426 * ilog2(x) / 65536
> 
> or 
> 	65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> 
> 
> with kernel functions:
> 	var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536)
> 	var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536)
> 	var3 = roundup_pow_of_two(var2)
> 
> I hope I don't need to present any more numbers at this moment because
> the parameter derivation is backed by solid theory ;-) [*]
> [*]

It is a elegant method but still not based on the J CPUs.

As you capture the core tension: bounding error wants the threshold
as small as possible, while preserving laziness wants it as large as
possible. Any scheme is a compromise between the two.

But there has several practical issues:

The threshold formula is system-wide, while each memcg has its own counter,
they all evaluate against the same MEMCG_CHARGE_BATCH * f(nr_cpu_ids),
with no awareness of how many CPUs are actually active for that particular
memcg. Small tasks with J=2 coexist with large services where J approaches
nr_cpus, yet they all face the same threshold. The ln-harmonic formula
optimizes for the average J, but workloads that most critically need
accurate memory.stat are precisely those spanning many CPUs, well above
average.

Moreover, the "average J" estimate assumes tasks are uniformly distributed
across CPUs, which rarely holds in practice with cpuset constraints, NUMA
affinity, and nested cgroup hierarchies. And even accepting that estimate,
the data shows ln-harmonic still yields 237MB of error at 2048 CPUs with
64K pages — still large enough to cause selftest failures.

In short: the theoretical analysis is sound, but the conclusion conflates
average case with worst case. Under the constraint of a single global
threshold, sqrt remains the more robust choice.

In future, if the J-sensory threshold per-memcg can be achieved, then your
ln-harmonic method is the most ideal formula.

To compare the three methods (linear, sqrt, ln-harmonic):

4K page size (BATCH=64):

  CPUs   linear   sqrt     ln-var3
  --------------------------------
  1      256KB    256KB     256KB
  2      512KB    512KB     512KB
  4      1MB      512KB     512KB
  8      2MB      768KB     1MB
  16     4MB      1MB       2MB
  32     8MB      1.25MB    2MB
  64     16MB     2MB       4MB
  128    32MB     2.75MB    8MB
  256    64MB     4MB       16MB
  512    128MB    5.5MB     32MB
  1024   256MB    8MB       64MB
  2048   512MB    11.25MB   64MB

64K page size (BATCH=16):

  CPUs   linear   sqrt     ln-var3
  --------------------------------
  1      1MB      1MB      1MB
  2      2MB      2MB      2MB
  4      4MB      2MB      2MB
  8      8MB      3MB      4MB
  16     16MB     4MB      8MB
  32     32MB     5MB      8MB
  64     64MB     8MB      16MB
  128    128MB    11MB     32MB
  256    256MB    16MB     64MB
  512    512MB    22MB     128MB
  1024   1GB      32MB     256MB
  2048   2GB      45MB     256MB

-- 
Regards,
Li Wang

next prev parent reply	other threads:[~2026-04-02  9:27 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-20 20:42 [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Waiman Long
2026-03-20 20:42 ` [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Waiman Long
2026-03-23 12:46   ` Li Wang
2026-03-24  0:15     ` Yosry Ahmed
2026-03-25 16:47       ` Waiman Long
2026-03-25 17:23         ` Yosry Ahmed
2026-04-01 18:41   ` Michal Koutný
2026-04-02  9:27     ` Li Wang [this message]
2026-04-02 10:19       ` Li Wang
2026-03-20 20:42 ` [PATCH v2 2/7] memcg: Scale down MEMCG_CHARGE_BATCH with increase in PAGE_SIZE Waiman Long
2026-03-23 12:47   ` Li Wang
2026-03-24  0:17     ` Yosry Ahmed
2026-03-20 20:42 ` [PATCH v2 3/7] selftests: memcg: Iterate pages based on the actual page size Waiman Long
2026-03-23  2:53   ` Li Wang
2026-03-23  2:56     ` Li Wang
2026-03-25  3:33     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 4/7] selftests: memcg: Increase error tolerance in accordance with " Waiman Long
2026-03-23  8:01   ` Li Wang
2026-03-25 16:42     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 5/7] selftests: memcg: Reduce the expected swap.peak with larger " Waiman Long
2026-03-23  8:24   ` Li Wang
2026-03-25  3:47     ` Waiman Long
2026-03-20 20:42 ` [PATCH v2 6/7] selftests: memcg: Don't call reclaim_until() if already in target Waiman Long
2026-03-23  8:53   ` Li Wang
2026-03-20 20:42 ` [PATCH v2 7/7] selftests: memcg: Treat failure for zeroing sock in test_memcg_sock as XFAIL Waiman Long
2026-03-23  9:44   ` Li Wang
2026-03-21  1:16 ` [PATCH v2 0/7] selftests: memcg: Fix test_memcontrol test failures with large page sizes Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac42aIDVaGim5moO@redhat.com \
    --to=liwang@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jthoughton@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liwan@redhat.com \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=sebastianchlad@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=shuah@kernel.org \
    --cc=tj@kernel.org \
    --cc=zhangguopeng@kylinos.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox