From: JP Kobryn <inwardvessel@gmail.com>
To: Leon Huang Fu <leon.huangfu@shopee.com>
Cc: akpm@linux-foundation.org, cgroups@vger.kernel.org,
corbet@lwn.net, hannes@cmpxchg.org, jack@suse.cz,
joel.granados@kernel.org, kyle.meyer@hpe.com,
lance.yang@linux.dev, laoar.shao@gmail.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, mclapinski@google.com, mhocko@kernel.org,
muchun.song@linux.dev, roman.gushchin@linux.dev,
shakeel.butt@linux.dev
Subject: Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
Date: Mon, 10 Nov 2025 11:24:05 -0800 [thread overview]
Message-ID: <51f1f343-c29f-49b5-8016-bbda4bc778a2@gmail.com> (raw)
In-Reply-To: <20251110062053.83754-1-leon.huangfu@shopee.com>
On 11/9/25 10:20 PM, Leon Huang Fu wrote:
> On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
>> On 11/4/25 11:49 PM, Leon Huang Fu wrote:
>>> On high-core count systems, memory cgroup statistics can become stale
>>> due to per-CPU caching and deferred aggregation. Monitoring tools and
>>> management applications sometimes need guaranteed up-to-date statistics
>>> at specific points in time to make accurate decisions.
>>>
>>> This patch adds write handlers to both memory.stat and memory.numa_stat
>>> files to allow userspace to explicitly force an immediate flush of
>>> memory statistics. When "1" is written to either file, it triggers
>>> __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
>>> all pending statistics for the cgroup and its descendants.
>>>
>>> The write operation validates the input and only accepts the value "1",
>>> returning -EINVAL for any other input.
>>>
>>> Usage example:
>>> # Force immediate flush before reading critical statistics
>>> echo 1 > /sys/fs/cgroup/mygroup/memory.stat
>>> cat /sys/fs/cgroup/mygroup/memory.stat
>>>
>>> This provides several benefits:
>>>
>>> 1. On-demand accuracy: Tools can flush only when needed, avoiding
>>> continuous overhead
>>>
>>> 2. Targeted flushing: Allows flushing specific cgroups when precision
>>> is required for particular workloads
>>
>> I'm curious about your use case. Since you mention required precision,
>> are you planning on manually flushing before every read?
>>
>
> Yes, for our use case, manual flushing before critical reads is necessary.
> We're going to run on high-core count servers (224-256 cores), where the
> per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
> accumulate up to 16,384 events (on 256 cores) before an automatic flush is
> triggered. This means memory statistics can be likely stale, often exceeding
> acceptable tolerance for critical memory management decisions.
>
> Our monitoring tools don't need to flush on every read - only when making
> critical decisions like OOM adjustments, container placement, or resource
> limit enforcement. The opt-in nature of this mechanism allows us to pay the
> flush cost only when precision is truly required.
>
>>>
>>> 3. Integration flexibility: Monitoring scripts can decide when to pay
>>> the flush cost based on their specific accuracy requirements
>>
>> [...]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index c34029e92bab..d6a5d872fbcb 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
>>> return 0;
>>> }
>>>
>>> +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
>>> +{
>>> + if (val != 1)
>>> + return -EINVAL;
>>> +
>>> + if (css)
>>> + css_rstat_flush(css);
>>
>> This is a kfunc. You can do this right now from a bpf program without
>> any kernel changes.
>>
>
> While css_rstat_flush() is indeed available as a BPF kfunc, the practical
> challenge is determining when to call it. The natural hook point would be
> memory_stat_show() using fentry, but this runs into a BPF verifier
> limitation: the function's 'struct seq_file *' argument doesn't provide a
> trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
> required by css_rstat_flush().
Ok, I see this would only work on the css for base stats.
SEC("iter.s/cgroup")
int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
{
struct cgroup *cgrp = ctx->cgroup;
struct cgroup_subsys_state *css;
if (!cgrp)
return 1;
/* example of flushing css for base cpu stats
* css = container_of(cgrp, struct cgroup_subsys_state, cgroup);
* if (!css)
* return 1;
* css_rstat_flush(css);
*/
/* get css for memcg stats */
css = cgrp->subsys[memory_cgrp_id];
if (!css)
return 1;
css_rstat_flush(css); <- confirm untrusted pointer arg error
...
>
> I attempted to implement this via BPF (code below), but it fails
> verification because deriving the css pointer through
> seq->private->kn->parent->priv results in an untrusted scalar that the
> verifier rejects for the kfunc call:
>
> R1 invalid mem access 'scalar'
>
> The verifier error occurs because:
> 1. seq->private is rdonly_untrusted_mem
> 2. Dereferencing through kernfs_node internals produces untracked pointers
> 3. css_rstat_flush() requires a trusted css pointer per its kfunc definition
>
> A direct userspace interface (memory.stat_refresh) avoids these verifier
> limitations and provides a cleaner, more maintainable solution that doesn't
> require BPF expertise or complex workarounds.
This is subjective. After hearing more about your use case and how you
mention making critical decisions, you should have a look at the work
being done on BPF OOM [0][1]. I think you would benefit from this
series. Specifically for your case it provides the ability to flush
memcg on demand and also fetch stats.
[0]
https://lore.kernel.org/all/20251027231727.472628-1-roman.gushchin@linux.dev/
[1]
https://lore.kernel.org/all/20251027232206.473085-2-roman.gushchin@linux.dev/
prev parent reply other threads:[~2025-11-10 19:24 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-05 7:49 [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file Leon Huang Fu
2025-11-05 8:19 ` Michal Hocko
2025-11-05 8:39 ` Lance Yang
2025-11-05 8:51 ` Leon Huang Fu
2025-11-06 1:19 ` Shakeel Butt
2025-11-06 3:30 ` Leon Huang Fu
2025-11-06 5:35 ` JP Kobryn
2025-11-06 6:42 ` Leon Huang Fu
2025-11-06 23:55 ` Shakeel Butt
2025-11-10 6:37 ` Leon Huang Fu
2025-11-10 20:19 ` Yosry Ahmed
2025-11-06 17:02 ` JP Kobryn
2025-11-10 6:20 ` Leon Huang Fu
2025-11-10 19:24 ` JP Kobryn [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51f1f343-c29f-49b5-8016-bbda4bc778a2@gmail.com \
--to=inwardvessel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=jack@suse.cz \
--cc=joel.granados@kernel.org \
--cc=kyle.meyer@hpe.com \
--cc=lance.yang@linux.dev \
--cc=laoar.shao@gmail.com \
--cc=leon.huangfu@shopee.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mclapinski@google.com \
--cc=mhocko@kernel.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeel.butt@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.