From: JP Kobryn <inwardvessel@gmail.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: mkoutny@suse.com, yosryahmed@google.com, hannes@cmpxchg.org,
tj@kernel.org, akpm@linux-foundation.org,
linux-kernel@vger.kernel.org, cgroups@vger.kernel.org,
kernel-team@meta.com, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
Date: Tue, 23 Sep 2025 11:02:48 -0700 [thread overview]
Message-ID: <01fdd968-8b82-4777-88c3-e1dc0c81e9bc@gmail.com> (raw)
In-Reply-To: <ky2yjg6qrqf6hqych7v3usphpcgpcemsmfrb5ephc7bdzxo57b@6cxnzxap3bsc>
On 9/19/25 10:17 PM, Shakeel Butt wrote:
> +linux-mm, bpf
>
> Hi JP,
>
> On Fri, Sep 19, 2025 at 06:55:26PM -0700, JP Kobryn wrote:
>> The kernel has to perform a significant amount of the work when a user mode
>> program reads the memory.stat file of a cgroup. Aside from flushing stats,
>> there is overhead in the string formatting that is done for each stat. Some
>> perf data is shown below from a program that reads memory.stat 1M times:
>>
>> 26.75% a.out [kernel.kallsyms] [k] vsnprintf
>> 19.88% a.out [kernel.kallsyms] [k] format_decode
>> 12.11% a.out [kernel.kallsyms] [k] number
>> 11.72% a.out [kernel.kallsyms] [k] string
>> 8.46% a.out [kernel.kallsyms] [k] strlen
>> 4.22% a.out [kernel.kallsyms] [k] seq_buf_printf
>> 2.79% a.out [kernel.kallsyms] [k] memory_stat_format
>> 1.49% a.out [kernel.kallsyms] [k] put_dec_trunc8
>> 1.45% a.out [kernel.kallsyms] [k] widen_string
>> 1.01% a.out [kernel.kallsyms] [k] memcpy_orig
>>
>> As an alternative to reading memory.stat, introduce new kfuncs to allow
>> fetching specific memcg stats from within bpf iter/cgroup-based programs.
>> Reading stats in this manner avoids the overhead of the string formatting
>> shown above.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>
> Thanks for this but I feel like you are drastically under-selling the
> potential of this work. This will not just reduce the cost of reading
> stats but will also provide a lot of flexibility.
>
> Large infra owners which use cgroup, spent a lot of compute on reading
> stats (I know about Google & Meta) and even small optimizations becomes
> significant at the fleet level.
>
> Your perf profile is focusing only on kernel but I can see similar
> operation in the userspace (i.e. from string to binary format) would be
> happening in the real world workloads. I imagine with bpf we can
> directly pass binary data to userspace or we can do custom serialization
> (like protobuf or thrift) in the bpf program directly.
>
> Beside string formatting, I think you should have seen open()/close() as
> well in your perf profile. In your microbenchmark, did you read
> memory.stat 1M times with the same fd and use lseek(0) between the reads
> or did you open(), read() & close(). If you had done later one, then
> open/close would be visible in the perf data as well. I know Google
> implemented fd caching in their userspacecontainer library to reduce
> their open/close cost. I imagine with this approach, we can avoid this
> cost as well.
In the test program, I opened once and used lseek() at the end of each
iteration. It's a good point though about user programs typically
opening and closing. I'll adjust the test program to resemble that
action.
>
> In terms of flexibility, I can see userspace can get the stats which it
> needs rather than getting all the stats. In addition, userspace can
> avoid flushing stats based on the fact that system is flushing the stats
> every 2 seconds.
That's true. The kfunc for flushing is made available but not required.
>
> In your next version, please also include the sample bpf which uses
> these kfuncs and also include the performance comparison between this
> approach and the traditional reading memory.stat approach.
Thanks for the good input. Will do.
prev parent reply other threads:[~2025-09-23 18:02 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-20 1:55 [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
2025-09-20 5:17 ` Shakeel Butt
2025-09-23 18:02 ` JP Kobryn [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=01fdd968-8b82-4777-88c3-e1dc0c81e9bc@gmail.com \
--to=inwardvessel@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mkoutny@suse.com \
--cc=shakeel.butt@linux.dev \
--cc=tj@kernel.org \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.