From: JP Kobryn <inwardvessel@gmail.com>
To: Leon Huang Fu <leon.huangfu@shopee.com>
Cc: akpm@linux-foundation.org, cgroups@vger.kernel.org,
corbet@lwn.net, hannes@cmpxchg.org, jack@suse.cz,
joel.granados@kernel.org, kyle.meyer@hpe.com,
lance.yang@linux.dev, laoar.shao@gmail.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, mclapinski@google.com, mhocko@kernel.org,
muchun.song@linux.dev, roman.gushchin@linux.dev,
shakeel.butt@linux.dev
Subject: Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
Date: Mon, 10 Nov 2025 11:24:05 -0800 [thread overview]
Message-ID: <51f1f343-c29f-49b5-8016-bbda4bc778a2@gmail.com> (raw)
In-Reply-To: <20251110062053.83754-1-leon.huangfu@shopee.com>
On 11/9/25 10:20 PM, Leon Huang Fu wrote:
> On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> On 11/4/25 11:49 PM, Leon Huang Fu wrote:
>>> On high-core count systems, memory cgroup statistics can become stale
>>> due to per-CPU caching and deferred aggregation. Monitoring tools and
>>> management applications sometimes need guaranteed up-to-date statistics
>>> at specific points in time to make accurate decisions.
>>>
>>> This patch adds write handlers to both memory.stat and memory.numa_stat
>>> files to allow userspace to explicitly force an immediate flush of
>>> memory statistics. When "1" is written to either file, it triggers
>>> __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
>>> all pending statistics for the cgroup and its descendants.
>>>
>>> The write operation validates the input and only accepts the value "1",
>>> returning -EINVAL for any other input.
>>>
>>> Usage example:
>>> # Force immediate flush before reading critical statistics
>>> echo 1 > /sys/fs/cgroup/mygroup/memory.stat
>>> cat /sys/fs/cgroup/mygroup/memory.stat
>>>
>>> This provides several benefits:
>>>
>>> 1. On-demand accuracy: Tools can flush only when needed, avoiding
>>> continuous overhead
>>>
>>> 2. Targeted flushing: Allows flushing specific cgroups when precision
>>> is required for particular workloads
>>
>> I'm curious about your use case. Since you mention required precision,
>> are you planning on manually flushing before every read?
>>
>
> Yes, for our use case, manual flushing before critical reads is necessary.
> We're going to run on high-core count servers (224-256 cores), where the
> per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
> accumulate up to 16,384 events (on 256 cores) before an automatic flush is
> triggered. This means memory statistics can be likely stale, often exceeding
> acceptable tolerance for critical memory management decisions.
>
> Our monitoring tools don't need to flush on every read - only when making
> critical decisions like OOM adjustments, container placement, or resource
> limit enforcement. The opt-in nature of this mechanism allows us to pay the
> flush cost only when precision is truly required.
>
>>>
>>> 3. Integration flexibility: Monitoring scripts can decide when to pay
>>> the flush cost based on their specific accuracy requirements
>>
>> [...]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index c34029e92bab..d6a5d872fbcb 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
>>> return 0;
>>> }
>>>
>>> +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
>>> +{
>>> + if (val != 1)
>>> + return -EINVAL;
>>> +
>>> + if (css)
>>> + css_rstat_flush(css);
>>
>> This is a kfunc. You can do this right now from a bpf program without
>> any kernel changes.
>>
>
> While css_rstat_flush() is indeed available as a BPF kfunc, the practical
> challenge is determining when to call it. The natural hook point would be
> memory_stat_show() using fentry, but this runs into a BPF verifier
> limitation: the function's 'struct seq_file *' argument doesn't provide a
> trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
> required by css_rstat_flush().
Ok, I see this would only work on the css for base stats.
SEC("iter.s/cgroup")
int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
{
struct cgroup *cgrp = ctx->cgroup;
struct cgroup_subsys_state *css;
if (!cgrp)
return 1;
/* example of flushing css for base cpu stats
* css = container_of(cgrp, struct cgroup_subsys_state, cgroup);
* if (!css)
* return 1;
* css_rstat_flush(css);
*/
/* get css for memcg stats */
css = cgrp->subsys[memory_cgrp_id];
if (!css)
return 1;
css_rstat_flush(css); <- confirm untrusted pointer arg error
...
>
> I attempted to implement this via BPF (code below), but it fails
> verification because deriving the css pointer through
> seq->private->kn->parent->priv results in an untrusted scalar that the
> verifier rejects for the kfunc call:
>
> R1 invalid mem access 'scalar'
>
> The verifier error occurs because:
> 1. seq->private is rdonly_untrusted_mem
> 2. Dereferencing through kernfs_node internals produces untracked pointers
> 3. css_rstat_flush() requires a trusted css pointer per its kfunc definition
>
> A direct userspace interface (memory.stat_refresh) avoids these verifier
> limitations and provides a cleaner, more maintainable solution that doesn't
> require BPF expertise or complex workarounds.
This is subjective. After hearing more about your use case and how you
mention making critical decisions, you should have a look at the work
being done on BPF OOM [0][1]. I think you would benefit from this
series. Specifically for your case it provides the ability to flush
memcg on demand and also fetch stats.
[0]
https://lore.kernel.org/all/20251027231727.472628-1-roman.gushchin@linux.dev/
[1]
https://lore.kernel.org/all/20251027232206.473085-2-roman.gushchin@linux.dev/
prev parent reply other threads:[~2025-11-10 19:24 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 7:49 [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file Leon Huang Fu
2025-11-05 8:19 ` Michal Hocko
2025-11-05 8:39 ` Lance Yang
2025-11-05 8:51 ` Leon Huang Fu
2025-11-06 1:19 ` Shakeel Butt
2025-11-06 3:30 ` Leon Huang Fu
2025-11-06 5:35 ` JP Kobryn
2025-11-06 6:42 ` Leon Huang Fu
2025-11-06 23:55 ` Shakeel Butt
2025-11-10 6:37 ` Leon Huang Fu
2025-11-10 20:19 ` Yosry Ahmed
2025-11-06 17:02 ` JP Kobryn
2025-11-10 6:20 ` Leon Huang Fu
2025-11-10 19:24 ` JP Kobryn [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51f1f343-c29f-49b5-8016-bbda4bc778a2@gmail.com \
--to=inwardvessel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=joel.granados@kernel.org \
--cc=kyle.meyer@hpe.com \
--cc=lance.yang@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=leon.huangfu@shopee.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mclapinski@google.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).