From: Michal Hocko <mhocko@suse.com>
To: "Michal Koutný" <mkoutny@suse.com>
Cc: Leon Huang Fu <leon.huangfu@shopee.com>,
linux-mm@kvack.org, tj@kernel.org, hannes@cmpxchg.org,
roman.gushchin@linux.dev, shakeel.butt@linux.dev,
muchun.song@linux.dev, akpm@linux-foundation.org,
joel.granados@kernel.org, jack@suse.cz, laoar.shao@gmail.com,
mclapinski@google.com, kyle.meyer@hpe.com, corbet@lwn.net,
lance.yang@linux.dev, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org
Subject: Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing
Date: Tue, 11 Nov 2025 09:10:38 +0100 [thread overview]
Message-ID: <aRLvfoMKcVEZGSym@tiehlicka> (raw)
In-Reply-To: <ewcsz3553cd6ooslgzwbubnbaxwmpd23d2k7pw5s4ckfvbb7sp@dffffjvohz5b>
On Mon 10-11-25 14:50:11, Michal Koutny wrote:
> Hello Leon.
>
> On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu <leon.huangfu@shopee.com> wrote:
> > Memory cgroup statistics are updated asynchronously with periodic
> > flushing to reduce overhead. The current implementation uses a flush
> > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for
> > determining when to aggregate per-CPU memory cgroup statistics. On
> > systems with high core counts, this threshold can become very large
> > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale
> > statistics when userspace reads memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data
> > that is thousands of updates out of date.
> >
> > Introduce a new write-only file, memory.stat_refresh, that allows
> > userspace to explicitly trigger an immediate flush of memory statistics.
>
> I think it's worth thinking twice when introducing a new file like
> this...
>
> > Writing any value to this file forces a synchronous flush via
> > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its
> > descendants, ensuring that subsequent reads of memory.stat and
> > memory.numa_stat reflect current data.
> >
> > This approach follows the pattern established by /proc/sys/vm/stat_refresh
> > and memory.peak, where the written value is ignored, keeping the
> > interface simple and consistent with existing kernel APIs.
> >
> > Usage example:
> > echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh
> > cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > The feature is available in both cgroup v1 and v2 for consistency.
>
> First, I find the motivation by the testcase (not real world) weak when
> considering such an API change (e.g. real world would be confined to
> fewer CPUs or there'd be other "traffic" causing flushes making this a
> non-issue, we don't know here).
I do agree that the current justification is rather weak.
> Second, this is open to everyone (non-root) who mkdir's their cgroups.
> Then why not make it the default memory.stat behavior? (Tongue-in-cheek,
> but [*].)
>
> With this change, we admit the implementation (async flushing) and leak
> it to the users which is hard to take back. Why should we continue doing
> any implicit in-kernel flushing afterwards?
In theory you are correct but I think it is also good to recognize the
reality. Keeping accurate stats is _expensive_ and we are always
struggling to keep a balance between accurace and runtime overhead. Yet
there will always be those couple special cases that would like to have
precision we do not want to pay for in general case.
We have recognized that in /proc/vmstat casee already without much added
maintenance burden. This seem a very similar case. If there is a general
consensus that we want to outsource all those special cases into BPF
then fine (I guess) but I believe BPF approach is figting a completely
different problem (data formating overhead rather than accuracy).
All that being said I do agree that we should have a more real usecase
than LTP test to justify a new interface. I am personally not convinced
about BPF-only way to address this fundamental precision-vs-overhead
battle.
--
Michal Hocko
SUSE Labs
next prev parent reply other threads:[~2025-11-11 8:10 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-10 10:19 [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Leon Huang Fu
2025-11-10 11:28 ` Michal Hocko
2025-11-11 6:12 ` Leon Huang Fu
2025-11-10 11:52 ` Harry Yoo
2025-11-11 6:12 ` Leon Huang Fu
2025-11-10 13:50 ` Michal Koutný
2025-11-10 16:04 ` Tejun Heo
2025-11-11 6:27 ` Leon Huang Fu
2025-11-11 1:00 ` Chen Ridong
2025-11-11 6:44 ` Leon Huang Fu
2025-11-12 0:56 ` Chen Ridong
2025-11-12 14:02 ` Michal Koutný
2025-11-11 6:13 ` Leon Huang Fu
2025-11-11 18:52 ` Tejun Heo
2025-11-11 19:01 ` Michal Koutný
2025-11-11 8:10 ` Michal Hocko [this message]
2025-11-11 19:10 ` Waiman Long
2025-11-11 19:47 ` Michal Hocko
2025-11-11 20:44 ` Waiman Long
2025-11-11 21:01 ` Michal Hocko
2025-11-12 14:02 ` Michal Koutný
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aRLvfoMKcVEZGSym@tiehlicka \
--to=mhocko@suse.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=joel.granados@kernel.org \
--cc=kyle.meyer@hpe.com \
--cc=lance.yang@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=leon.huangfu@shopee.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mclapinski@google.com \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).