[RFC PATCH] memcg: introduce kfuncs for fetching memcg stats

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
@ 2025-09-20  1:55 JP Kobryn
  2025-09-20  5:17 ` Shakeel Butt
  0 siblings, 1 reply; 3+ messages in thread
From: JP Kobryn @ 2025-09-20  1:55 UTC (permalink / raw)
  To: shakeel.butt, mkoutny, yosryahmed, hannes, tj, akpm
  Cc: linux-kernel, cgroups, kernel-team

The kernel has to perform a significant amount of the work when a user mode
program reads the memory.stat file of a cgroup. Aside from flushing stats,
there is overhead in the string formatting that is done for each stat. Some
perf data is shown below from a program that reads memory.stat 1M times:

26.75%  a.out [kernel.kallsyms] [k] vsnprintf
19.88%  a.out [kernel.kallsyms] [k] format_decode
12.11%  a.out [kernel.kallsyms] [k] number
11.72%  a.out [kernel.kallsyms] [k] string
 8.46%  a.out [kernel.kallsyms] [k] strlen
 4.22%  a.out [kernel.kallsyms] [k] seq_buf_printf
 2.79%  a.out [kernel.kallsyms] [k] memory_stat_format
 1.49%  a.out [kernel.kallsyms] [k] put_dec_trunc8
 1.45%  a.out [kernel.kallsyms] [k] widen_string
 1.01%  a.out [kernel.kallsyms] [k] memcpy_orig

As an alternative to reading memory.stat, introduce new kfuncs to allow
fetching specific memcg stats from within bpf iter/cgroup-based programs.
Reading stats in this manner avoids the overhead of the string formatting
shown above.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8dd7fbed5a94..aa22dc6f47ee 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -870,6 +870,73 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 }
 #endif
 
+static inline struct mem_cgroup *mem_cgroup_from_cgroup(struct cgroup *cgrp)
+{
+	return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
+}
+
+__bpf_kfunc static void cgroup_flush_memcg_stats(struct cgroup *cgrp)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
+
+	if (!memcg)
+		return;
+
+	mem_cgroup_flush_stats(memcg);
+}
+
+__bpf_kfunc static unsigned long node_stat_fetch(struct cgroup *cgrp,
+		enum node_stat_item item)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
+
+	if (!memcg)
+		return 0;
+
+	return memcg_page_state_output(memcg, item);
+}
+
+__bpf_kfunc static unsigned long memcg_stat_fetch(struct cgroup *cgrp,
+		enum memcg_stat_item item)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
+
+	if (!memcg)
+		return 0;
+
+	return memcg_page_state_output(memcg, item);
+}
+
+__bpf_kfunc static unsigned long vm_event_fetch(struct cgroup *cgrp,
+		enum vm_event_item item)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cgroup(cgrp);
+
+	if (!memcg)
+		return 0;
+
+	return memcg_events(memcg, item);
+}
+
+BTF_KFUNCS_START(bpf_memcontrol_kfunc_ids)
+BTF_ID_FLAGS(func, cgroup_flush_memcg_stats)
+BTF_ID_FLAGS(func, node_stat_fetch)
+BTF_ID_FLAGS(func, memcg_stat_fetch)
+BTF_ID_FLAGS(func, vm_event_fetch)
+BTF_KFUNCS_END(bpf_memcontrol_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_memcontrol_kfunc_ids,
+};
+
+static int __init bpf_memcontrol_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
+					 &bpf_memcontrol_kfunc_set);
+}
+late_initcall(bpf_memcontrol_kfunc_init);
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	/*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
  2025-09-20  1:55 [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
@ 2025-09-20  5:17 ` Shakeel Butt
  2025-09-23 18:02   ` JP Kobryn
  0 siblings, 1 reply; 3+ messages in thread
From: Shakeel Butt @ 2025-09-20  5:17 UTC (permalink / raw)
  To: JP Kobryn
  Cc: mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel, cgroups,
	kernel-team, linux-mm, bpf

+linux-mm, bpf

Hi JP,

On Fri, Sep 19, 2025 at 06:55:26PM -0700, JP Kobryn wrote:
> The kernel has to perform a significant amount of the work when a user mode
> program reads the memory.stat file of a cgroup. Aside from flushing stats,
> there is overhead in the string formatting that is done for each stat. Some
> perf data is shown below from a program that reads memory.stat 1M times:
> 
> 26.75%  a.out [kernel.kallsyms] [k] vsnprintf
> 19.88%  a.out [kernel.kallsyms] [k] format_decode
> 12.11%  a.out [kernel.kallsyms] [k] number
> 11.72%  a.out [kernel.kallsyms] [k] string
>  8.46%  a.out [kernel.kallsyms] [k] strlen
>  4.22%  a.out [kernel.kallsyms] [k] seq_buf_printf
>  2.79%  a.out [kernel.kallsyms] [k] memory_stat_format
>  1.49%  a.out [kernel.kallsyms] [k] put_dec_trunc8
>  1.45%  a.out [kernel.kallsyms] [k] widen_string
>  1.01%  a.out [kernel.kallsyms] [k] memcpy_orig
> 
> As an alternative to reading memory.stat, introduce new kfuncs to allow
> fetching specific memcg stats from within bpf iter/cgroup-based programs.
> Reading stats in this manner avoids the overhead of the string formatting
> shown above.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Thanks for this but I feel like you are drastically under-selling the
potential of this work. This will not just reduce the cost of reading
stats but will also provide a lot of flexibility.

Large infra owners which use cgroup, spent a lot of compute on reading
stats (I know about Google & Meta) and even small optimizations becomes
significant at the fleet level.

Your perf profile is focusing only on kernel but I can see similar
operation in the userspace (i.e. from string to binary format) would be
happening in the real world workloads. I imagine with bpf we can
directly pass binary data to userspace or we can do custom serialization
(like protobuf or thrift) in the bpf program directly.

Beside string formatting, I think you should have seen open()/close() as
well in your perf profile. In your microbenchmark, did you read
memory.stat 1M times with the same fd and use lseek(0) between the reads
or did you open(), read() & close(). If you had done later one, then
open/close would be visible in the perf data as well. I know Google
implemented fd caching in their userspacecontainer library to reduce
their open/close cost. I imagine with this approach, we can avoid this
cost as well.

In terms of flexibility, I can see userspace can get the stats which it
needs rather than getting all the stats. In addition, userspace can
avoid flushing stats based on the fact that system is flushing the stats
every 2 seconds.

In your next version, please also include the sample bpf which uses
these kfuncs and also include the performance comparison between this
approach and the traditional reading memory.stat approach.

thanks,
Shakeel

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
  2025-09-20  5:17 ` Shakeel Butt
@ 2025-09-23 18:02   ` JP Kobryn
  0 siblings, 0 replies; 3+ messages in thread
From: JP Kobryn @ 2025-09-23 18:02 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel, cgroups,
	kernel-team, linux-mm, bpf

On 9/19/25 10:17 PM, Shakeel Butt wrote:
> +linux-mm, bpf
> 
> Hi JP,
> 
> On Fri, Sep 19, 2025 at 06:55:26PM -0700, JP Kobryn wrote:
>> The kernel has to perform a significant amount of the work when a user mode
>> program reads the memory.stat file of a cgroup. Aside from flushing stats,
>> there is overhead in the string formatting that is done for each stat. Some
>> perf data is shown below from a program that reads memory.stat 1M times:
>>
>> 26.75%  a.out [kernel.kallsyms] [k] vsnprintf
>> 19.88%  a.out [kernel.kallsyms] [k] format_decode
>> 12.11%  a.out [kernel.kallsyms] [k] number
>> 11.72%  a.out [kernel.kallsyms] [k] string
>>   8.46%  a.out [kernel.kallsyms] [k] strlen
>>   4.22%  a.out [kernel.kallsyms] [k] seq_buf_printf
>>   2.79%  a.out [kernel.kallsyms] [k] memory_stat_format
>>   1.49%  a.out [kernel.kallsyms] [k] put_dec_trunc8
>>   1.45%  a.out [kernel.kallsyms] [k] widen_string
>>   1.01%  a.out [kernel.kallsyms] [k] memcpy_orig
>>
>> As an alternative to reading memory.stat, introduce new kfuncs to allow
>> fetching specific memcg stats from within bpf iter/cgroup-based programs.
>> Reading stats in this manner avoids the overhead of the string formatting
>> shown above.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> 
> Thanks for this but I feel like you are drastically under-selling the
> potential of this work. This will not just reduce the cost of reading
> stats but will also provide a lot of flexibility.
> 
> Large infra owners which use cgroup, spent a lot of compute on reading
> stats (I know about Google & Meta) and even small optimizations becomes
> significant at the fleet level.
> 
> Your perf profile is focusing only on kernel but I can see similar
> operation in the userspace (i.e. from string to binary format) would be
> happening in the real world workloads. I imagine with bpf we can
> directly pass binary data to userspace or we can do custom serialization
> (like protobuf or thrift) in the bpf program directly.
> 
> Beside string formatting, I think you should have seen open()/close() as
> well in your perf profile. In your microbenchmark, did you read
> memory.stat 1M times with the same fd and use lseek(0) between the reads
> or did you open(), read() & close(). If you had done later one, then
> open/close would be visible in the perf data as well. I know Google
> implemented fd caching in their userspacecontainer library to reduce
> their open/close cost. I imagine with this approach, we can avoid this
> cost as well.

In the test program, I opened once and used lseek() at the end of each
iteration. It's a good point though about user programs typically
opening and closing. I'll adjust the test program to resemble that
action.

> 
> In terms of flexibility, I can see userspace can get the stats which it
> needs rather than getting all the stats. In addition, userspace can
> avoid flushing stats based on the fact that system is flushing the stats
> every 2 seconds.

That's true. The kfunc for flushing is made available but not required.

> 
> In your next version, please also include the sample bpf which uses
> these kfuncs and also include the performance comparison between this
> approach and the traditional reading memory.stat approach.

Thanks for the good input. Will do.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-09-23 18:02 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-20  1:55 [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
2025-09-20  5:17 ` Shakeel Butt
2025-09-23 18:02   ` JP Kobryn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox