cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] memcg: reading memcg stats more efficiently
@ 2025-10-15 19:08 JP Kobryn
  2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: JP Kobryn @ 2025-10-15 19:08 UTC (permalink / raw)
  To: shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm
  Cc: linux-kernel, cgroups, linux-mm, bpf, kernel-team

When reading cgroup memory.stat files there is significant kernel overhead
in the formatting and encoding of numeric data into a string buffer. Beyond
that, the given user mode program must decode this data and possibly
perform filtering to obtain the desired stats. This process can be
expensive for programs that periodically sample this data over a large
enough fleet.

As an alternative to reading memory.stat, introduce new kfuncs that allow
fetching specific memcg stats from within cgroup iterator based bpf
programs. This approach allows for numeric values to be transferred
directly from the kernel to user mode via the mapped memory of the bpf
program's elf data section. Reading stats this way effectively eliminates
the numeric conversion work needed to be performed in both kernel and user
mode. It also eliminates the need for filtering in a user mode program.
i.e. where reading memory.stat returns all stats, this new approach allows
returning only select stats.

An experiment was setup to compare the performance of a program using these
new kfuncs vs a program that uses the traditional method of reading
memory.stat. On the experimental side, a libbpf based program was written
which sets up a link to the bpf program once in advance and then reuses
this link to create and read from a bpf iterator program for 1M iterations.
Meanwhile on the control side, a program was written to open the root
memory.stat file and repeatedly read 1M times from the associated file
descriptor (while seeking back to zero before each subsequent read). Note
that the program does not bother to decode or filter any data in user mode.
The reason for this is because the experimental program completely removes
the need for this work.

The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 80% elapsed time in kernel
mode. The kernel overhead of numeric conversion on the control side is
eliminated on the experimental side since the values are read directly
through mapped memory of the bpf program. The experiment data is shown
here:

control: elapsed time
real    0m13.062s
user    0m0.147s
sys     0m12.876s

experiment: elapsed time
real    0m2.717s
user    0m0.175s
sys     0m2.451s

control: perf data
22.23% a.out [kernel.kallsyms] [k] vsnprintf
18.83% a.out [kernel.kallsyms] [k] format_decode
12.05% a.out [kernel.kallsyms] [k] string
11.56% a.out [kernel.kallsyms] [k] number
 7.71% a.out [kernel.kallsyms] [k] strlen
 4.80% a.out [kernel.kallsyms] [k] memcpy_orig
 4.67% a.out [kernel.kallsyms] [k] memory_stat_format
 4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
 2.22% a.out [kernel.kallsyms] [k] widen_string
 1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
 0.95% a.out [kernel.kallsyms] [k] put_dec_full8
 0.69% a.out [kernel.kallsyms] [k] put_dec
 0.69% a.out [kernel.kallsyms] [k] memcpy

experiment: perf data
10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
 7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
 4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
 3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
 2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
 2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
 2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
 2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
 2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
 2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist

Aside from the perf gain, the kfunc/bpf approach provides flexibility in
how memcg data can be delivered to a user mode program. As seen in the
second patch which contains the selftests, it is possible to use a struct
with select memory stat fields. But it is completely up to the programmer
on how to lay out the data.

JP Kobryn (2):
  memcg: introduce kfuncs for fetching memcg stats
  memcg: selftests for memcg stat kfuncs

 mm/memcontrol.c                               |  67 ++++
 .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
 .../bpf/prog_tests/cgroup_iter_memcg.c        | 294 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_iter_memcg.c   |  61 ++++
 4 files changed, 440 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c

-- 
2.47.3


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
  2025-10-15 19:08 [PATCH v2 0/2] memcg: reading memcg stats more efficiently JP Kobryn
@ 2025-10-15 19:08 ` JP Kobryn
  2025-10-15 20:48   ` Shakeel Butt
                     ` (2 more replies)
  2025-10-15 19:08 ` [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs JP Kobryn
  2025-10-15 20:46 ` [PATCH v2 0/2] memcg: reading memcg stats more efficiently Shakeel Butt
  2 siblings, 3 replies; 16+ messages in thread
From: JP Kobryn @ 2025-10-15 19:08 UTC (permalink / raw)
  To: shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm
  Cc: linux-kernel, cgroups, linux-mm, bpf, kernel-team

Reading from the memory.stat file can be expensive because of the string
encoding/decoding and text filtering involved. Introduce three kfuncs for
fetching each type of memcg stat from a bpf program. This allows data to be
transferred directly to userspace, eliminating the need for string
encoding/decoding. It also removes the need for text filtering since it
allows for fetching specific stats.

The patch also includes a kfunc for flushing stats in order to read the
latest values. Note that this is not required for fetching stats, since the
kernel periodically flushes memcg stats. It is left up to the programmer
whether they want more recent stats or not.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..6547c27d4430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -871,6 +871,73 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 }
 #endif
 
+static inline struct mem_cgroup *memcg_from_cgroup(struct cgroup *cgrp)
+{
+	return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
+}
+
+__bpf_kfunc static void memcg_flush_stats(struct cgroup *cgrp)
+{
+	struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+	if (!memcg)
+		return;
+
+	mem_cgroup_flush_stats(memcg);
+}
+
+__bpf_kfunc static unsigned long memcg_stat_fetch(struct cgroup *cgrp,
+		enum memcg_stat_item item)
+{
+	struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+	if (!memcg)
+		return 0;
+
+	return memcg_page_state_output(memcg, item);
+}
+
+__bpf_kfunc static unsigned long memcg_node_stat_fetch(struct cgroup *cgrp,
+		enum node_stat_item item)
+{
+	struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+	if (!memcg)
+		return 0;
+
+	return memcg_page_state_output(memcg, item);
+}
+
+__bpf_kfunc static unsigned long memcg_vm_event_fetch(struct cgroup *cgrp,
+		enum vm_event_item item)
+{
+	struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
+
+	if (!memcg)
+		return 0;
+
+	return memcg_events(memcg, item);
+}
+
+BTF_KFUNCS_START(bpf_memcontrol_kfunc_ids)
+BTF_ID_FLAGS(func, memcg_flush_stats, KF_TRUSTED_ARGS | KF_SLEEPABLE)
+BTF_ID_FLAGS(func, memcg_stat_fetch, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, memcg_node_stat_fetch, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, memcg_vm_event_fetch, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_memcontrol_kfunc_ids)
+
+static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_memcontrol_kfunc_ids,
+};
+
+static int __init bpf_memcontrol_kfunc_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
+					 &bpf_memcontrol_kfunc_set);
+}
+late_initcall(bpf_memcontrol_kfunc_init);
+
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	/*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs
  2025-10-15 19:08 [PATCH v2 0/2] memcg: reading memcg stats more efficiently JP Kobryn
  2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
@ 2025-10-15 19:08 ` JP Kobryn
  2025-10-15 23:17   ` Shakeel Butt
  2025-10-16  5:04   ` Yonghong Song
  2025-10-15 20:46 ` [PATCH v2 0/2] memcg: reading memcg stats more efficiently Shakeel Butt
  2 siblings, 2 replies; 16+ messages in thread
From: JP Kobryn @ 2025-10-15 19:08 UTC (permalink / raw)
  To: shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm
  Cc: linux-kernel, cgroups, linux-mm, bpf, kernel-team

Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test before and after scenarios ensuring that the given stat
increases by some arbitrary amount. The stats selected cover the three
categories represented by the enums: node_stat_item, memcg_stat_item,
vm_event_item.

Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
 .../bpf/prog_tests/cgroup_iter_memcg.c        | 295 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_iter_memcg.c   |  61 ++++
 3 files changed, 374 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c

diff --git a/tools/testing/selftests/bpf/cgroup_iter_memcg.h b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
new file mode 100644
index 000000000000..5f4c6502d9f1
--- /dev/null
+++ b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#ifndef __CGROUP_ITER_MEMCG_H
+#define __CGROUP_ITER_MEMCG_H
+
+struct memcg_query {
+	/* some node_stat_item's */
+	long nr_anon_mapped;
+	long nr_shmem;
+	long nr_file_pages;
+	long nr_file_mapped;
+	/* some memcg_stat_item */
+	long memcg_kmem;
+	/* some vm_event_item */
+	long pgfault;
+};
+
+#endif /* __CGROUP_ITER_MEMCG_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
new file mode 100644
index 000000000000..264dc3c9ec30
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include "cgroup_helpers.h"
+#include "cgroup_iter_memcg.h"
+#include "cgroup_iter_memcg.skel.h"
+
+int read_stats(struct bpf_link *link)
+{
+	int fd, ret = 0;
+	ssize_t bytes;
+
+	fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_OK_FD(fd, "bpf_iter_create"))
+		return 1;
+
+	/*
+	 * Invoke iter program by reading from its fd. We're not expecting any
+	 * data to be written by the bpf program so the result should be zero.
+	 * Results will be read directly through the custom data section
+	 * accessible through skel->data_query.memcg_query.
+	 */
+	bytes = read(fd, NULL, 0);
+	if (!ASSERT_EQ(bytes, 0, "read fd"))
+		ret = 1;
+
+	close(fd);
+	return ret;
+}
+
+static void test_anon(struct bpf_link *link,
+		struct memcg_query *memcg_query)
+{
+	void *map;
+	size_t len;
+	long val;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		return;
+
+	val = memcg_query->nr_anon_mapped;
+	if (!ASSERT_GE(val, 0, "initial anon mapped val"))
+		return;
+
+	/*
+	 * Increase memcg anon usage by mapping and writing
+	 * to a new anon region.
+	 */
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
+			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap anon"))
+		return;
+
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->nr_anon_mapped, val, "final anon mapped val");
+
+cleanup:
+	munmap(map, len);
+}
+
+static void test_file(struct bpf_link *link,
+		struct memcg_query *memcg_query)
+{
+	void *map;
+	size_t len;
+	long val_pages, val_mapped;
+	FILE *f;
+	int fd;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		return;
+
+	val_pages = memcg_query->nr_file_pages;
+	if (!ASSERT_GE(val_pages, 0, "initial file val"))
+		return;
+	val_mapped = memcg_query->nr_file_mapped;
+	if (!ASSERT_GE(val_mapped, 0, "initial file mapped val"))
+		return;
+
+	/*
+	 * Increase memcg file usage by creating and writing
+	 * to a temoprary mapped file.
+	 */
+	f = tmpfile();
+	if (!ASSERT_OK_PTR(f, "tmpfile"))
+		return;
+	fd = fileno(f);
+	if (!ASSERT_OK_FD(fd, "open fd"))
+		return;
+	if (!ASSERT_OK(ftruncate(fd, len), "ftruncate"))
+		goto cleanup_fd;
+
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
+			MAP_SHARED, fd, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap file"))
+		goto cleanup_fd;
+
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup_map;
+
+	ASSERT_GT(memcg_query->nr_file_pages, val_pages, "final file value");
+	ASSERT_GT(memcg_query->nr_file_mapped, val_mapped,
+			"final file mapped value");
+
+cleanup_map:
+	munmap(map, len);
+cleanup_fd:
+	close(fd);
+}
+
+static void test_shmem(struct bpf_link *link,
+		struct memcg_query *memcg_query)
+{
+	size_t len;
+	int fd;
+	void *map;
+	long val;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		return;
+
+	val = memcg_query->nr_shmem;
+	if (!ASSERT_GE(val, 0, "init shmem val"))
+		return;
+
+	/*
+	 * Increase memcg shmem usage by creating and writing
+	 * to a shmem object.
+	 */
+	fd = shm_open("/tmp_shmem", O_CREAT | O_RDWR, 0644);
+	if (!ASSERT_OK_FD(fd, "shm_open"))
+		return;
+
+	if (!ASSERT_OK(ftruncate(fd, len), "ftruncate"))
+		goto cleanup_fd;
+
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
+			MAP_SHARED, fd, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap shmem"))
+		goto cleanup_fd;
+
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup_map;
+
+	ASSERT_GT(memcg_query->nr_shmem, val, "final shmem value");
+
+cleanup_map:
+	munmap(map, len);
+cleanup_fd:
+	close(fd);
+	shm_unlink("/tmp_shmem");
+}
+
+static void test_kmem(struct bpf_link *link,
+		struct memcg_query *memcg_query)
+{
+	int fds[2];
+	int err;
+	ssize_t bytes;
+	size_t len;
+	char *buf;
+	long val;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		return;
+
+	val = memcg_query->memcg_kmem;
+	if (!ASSERT_GE(val, 0, "initial kmem val"))
+		return;
+
+	err = pipe2(fds, O_NONBLOCK);
+	if (!ASSERT_OK(err, "pipe"))
+		return;
+
+	buf = malloc(len);
+	memset(buf, 1, len);
+	bytes = write(fds[1], buf, len);
+	if (!ASSERT_GT(bytes, 0, "write"))
+		goto cleanup;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->memcg_kmem, val, "kmem value");
+
+cleanup:
+	free(buf);
+	close(fds[0]);
+	close(fds[1]);
+}
+
+static void test_pgfault(struct bpf_link *link,
+		struct memcg_query *memcg_query)
+{
+	void *map;
+	size_t len;
+	long val;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		return;
+
+	val = memcg_query->pgfault;
+	if (!ASSERT_GE(val, 0, "initial pgfault val"))
+		return;
+
+	/* Create region to use for triggering a page fault. */
+	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
+			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap anon"))
+		return;
+
+	/* Trigger page fault. */
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->pgfault, val, "final pgfault val");
+
+cleanup:
+	munmap(map, len);
+}
+
+void test_cgroup_iter_memcg(void)
+{
+	char *cgroup_rel_path = "/cgroup_iter_memcg_test";
+	struct cgroup_iter_memcg *skel;
+	struct bpf_link *link;
+	int cgroup_fd, err;
+
+	cgroup_fd = cgroup_setup_and_join(cgroup_rel_path);
+	if (!ASSERT_OK_FD(cgroup_fd, "cgroup_setup_and_join"))
+		return;
+
+	skel = cgroup_iter_memcg__open();
+	if (!ASSERT_OK_PTR(skel, "cgroup_iter_memcg__open"))
+		goto cleanup_cgroup_fd;
+
+	err = cgroup_iter_memcg__load(skel);
+	if (!ASSERT_OK(err, "cgroup_iter_memcg__load"))
+		goto cleanup_skel;
+
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {
+		.cgroup.cgroup_fd = cgroup_fd,
+		.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY,
+	};
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.cgroup_memcg_query, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
+		goto cleanup_cgroup_fd;
+
+	if (test__start_subtest("cgroup_iter_memcg__anon"))
+		test_anon(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__shmem"))
+		test_shmem(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__file"))
+		test_file(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__kmem"))
+		test_kmem(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__pgfault"))
+		test_pgfault(link, &skel->data_query->memcg_query);
+
+	bpf_link__destroy(link);
+cleanup_skel:
+	cgroup_iter_memcg__destroy(skel);
+cleanup_cgroup_fd:
+	close(cgroup_fd);
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
new file mode 100644
index 000000000000..0d913d72b68d
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_core_read.h>
+#include "cgroup_iter_memcg.h"
+
+char _license[] SEC("license") = "GPL";
+
+extern void memcg_flush_stats(struct cgroup *cgrp) __ksym;
+extern unsigned long memcg_stat_fetch(struct cgroup *cgrp,
+		enum memcg_stat_item item) __ksym;
+extern unsigned long memcg_node_stat_fetch(struct cgroup *cgrp,
+		enum node_stat_item item) __ksym;
+extern unsigned long memcg_vm_event_fetch(struct cgroup *cgrp,
+		enum vm_event_item item) __ksym;
+
+/* The latest values read are stored here. */
+struct memcg_query memcg_query SEC(".data.query");
+
+/*
+ * Helpers for fetching any of the three different types of memcg stats.
+ * BPF core macros are used to ensure an enumerator is present in the given
+ * kernel. Falling back on -1 indicates its absence.
+ */
+#define node_stat_fetch_if_exists(cgrp, item) \
+	bpf_core_enum_value_exists(enum node_stat_item, item) ? \
+		memcg_node_stat_fetch((cgrp), bpf_core_enum_value( \
+					 enum node_stat_item, item)) : -1
+
+#define memcg_stat_fetch_if_exists(cgrp, item) \
+	bpf_core_enum_value_exists(enum memcg_stat_item, item) ? \
+		memcg_node_stat_fetch((cgrp), bpf_core_enum_value( \
+					 enum memcg_stat_item, item)) : -1
+
+#define vm_event_fetch_if_exists(cgrp, item) \
+	bpf_core_enum_value_exists(enum vm_event_item, item) ? \
+		memcg_vm_event_fetch((cgrp), bpf_core_enum_value( \
+					 enum vm_event_item, item)) : -1
+
+SEC("iter.s/cgroup")
+int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
+{
+	struct cgroup *cgrp = ctx->cgroup;
+
+	if (!cgrp)
+		return 1;
+
+	memcg_flush_stats(cgrp);
+
+	memcg_query.nr_anon_mapped = node_stat_fetch_if_exists(cgrp,
+			NR_ANON_MAPPED);
+	memcg_query.nr_shmem = node_stat_fetch_if_exists(cgrp, NR_SHMEM);
+	memcg_query.nr_file_pages = node_stat_fetch_if_exists(cgrp,
+			NR_FILE_PAGES);
+	memcg_query.nr_file_mapped = node_stat_fetch_if_exists(cgrp,
+			NR_FILE_MAPPED);
+	memcg_query.memcg_kmem = memcg_stat_fetch_if_exists(cgrp, MEMCG_KMEM);
+	memcg_query.pgfault = vm_event_fetch_if_exists(cgrp, PGFAULT);
+
+	return 0;
+}
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
  2025-10-15 19:08 [PATCH v2 0/2] memcg: reading memcg stats more efficiently JP Kobryn
  2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
  2025-10-15 19:08 ` [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs JP Kobryn
@ 2025-10-15 20:46 ` Shakeel Butt
  2025-10-16  0:21   ` JP Kobryn
  2 siblings, 1 reply; 16+ messages in thread
From: Shakeel Butt @ 2025-10-15 20:46 UTC (permalink / raw)
  To: JP Kobryn
  Cc: andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel,
	cgroups, linux-mm, bpf, kernel-team, mhocko, roman.gushchin,
	muchun.song

Cc memcg maintainers.

On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
> When reading cgroup memory.stat files there is significant kernel overhead
> in the formatting and encoding of numeric data into a string buffer. Beyond
> that, the given user mode program must decode this data and possibly
> perform filtering to obtain the desired stats. This process can be
> expensive for programs that periodically sample this data over a large
> enough fleet.
> 
> As an alternative to reading memory.stat, introduce new kfuncs that allow
> fetching specific memcg stats from within cgroup iterator based bpf
> programs. This approach allows for numeric values to be transferred
> directly from the kernel to user mode via the mapped memory of the bpf
> program's elf data section. Reading stats this way effectively eliminates
> the numeric conversion work needed to be performed in both kernel and user
> mode. It also eliminates the need for filtering in a user mode program.
> i.e. where reading memory.stat returns all stats, this new approach allows
> returning only select stats.
> 
> An experiment was setup to compare the performance of a program using these
> new kfuncs vs a program that uses the traditional method of reading
> memory.stat. On the experimental side, a libbpf based program was written
> which sets up a link to the bpf program once in advance and then reuses
> this link to create and read from a bpf iterator program for 1M iterations.

I am getting a bit confused on the terminology. You mentioned libbpf
program, bpf program, link. Can you describe each of them? Think of
explaining this to someone with no bpf background.

(BTW Yonghong already explained to me these details but I wanted the
commit message to be self explanatory).

> Meanwhile on the control side, a program was written to open the root
> memory.stat file

How much activity was on the system? I imagine none because I don't see
flushing in the perf profile. This experiment focuses on the
non-flushing part of the memcg stats which is fine.

> and repeatedly read 1M times from the associated file
> descriptor (while seeking back to zero before each subsequent read). Note
> that the program does not bother to decode or filter any data in user mode.
> The reason for this is because the experimental program completely removes
> the need for this work.

Hmm in your experiment is the control program doing the decode and/or
filter or no? The last sentence in above para is confusing. Yes, the
experiment program does not need to do the parsing or decoding in
userspace but the control program needs to do that. If your control
program is not doing it then you are under-selling your work.

> 
> The results showed a significant perf benefit on the experimental side,
> outperforming the control side by a margin of 80% elapsed time in kernel
> mode. The kernel overhead of numeric conversion on the control side is
> eliminated on the experimental side since the values are read directly
> through mapped memory of the bpf program. The experiment data is shown
> here:
> 
> control: elapsed time
> real    0m13.062s
> user    0m0.147s
> sys     0m12.876s
> 
> experiment: elapsed time
> real    0m2.717s
> user    0m0.175s
> sys     0m2.451s

These numbers are really awesome.

> 
> control: perf data
> 22.23% a.out [kernel.kallsyms] [k] vsnprintf
> 18.83% a.out [kernel.kallsyms] [k] format_decode
> 12.05% a.out [kernel.kallsyms] [k] string
> 11.56% a.out [kernel.kallsyms] [k] number
>  7.71% a.out [kernel.kallsyms] [k] strlen
>  4.80% a.out [kernel.kallsyms] [k] memcpy_orig
>  4.67% a.out [kernel.kallsyms] [k] memory_stat_format
>  4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
>  2.22% a.out [kernel.kallsyms] [k] widen_string
>  1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
>  0.95% a.out [kernel.kallsyms] [k] put_dec_full8
>  0.69% a.out [kernel.kallsyms] [k] put_dec
>  0.69% a.out [kernel.kallsyms] [k] memcpy
> 
> experiment: perf data
> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
>  7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
>  4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
>  3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
>  2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
>  2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
>  2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
>  2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
>  2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
>  2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
> 
> Aside from the perf gain, the kfunc/bpf approach provides flexibility in
> how memcg data can be delivered to a user mode program. As seen in the
> second patch which contains the selftests, it is possible to use a struct
> with select memory stat fields. But it is completely up to the programmer
> on how to lay out the data.

I remember you plan to convert couple of open source program to use this
new feature. I think below [1] and oomd [2]. Adding that information
would further make your case strong. cAdvisor[3] is another open source
tool which can take benefit from this work.

[1] https://github.com/facebookincubator/below
[2] https://github.com/facebookincubator/oomd
[3] https://github.com/google/cadvisor


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
  2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
@ 2025-10-15 20:48   ` Shakeel Butt
  2025-10-15 23:12   ` Song Liu
  2025-10-16 22:28   ` kernel test robot
  2 siblings, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2025-10-15 20:48 UTC (permalink / raw)
  To: JP Kobryn
  Cc: andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel,
	cgroups, linux-mm, bpf, kernel-team

On Wed, Oct 15, 2025 at 12:08:12PM -0700, JP Kobryn wrote:
> Reading from the memory.stat file can be expensive because of the string
> encoding/decoding and text filtering involved. Introduce three kfuncs for
> fetching each type of memcg stat from a bpf program. This allows data to be
> transferred directly to userspace, eliminating the need for string
> encoding/decoding. It also removes the need for text filtering since it
> allows for fetching specific stats.
> 
> The patch also includes a kfunc for flushing stats in order to read the
> latest values. Note that this is not required for fetching stats, since the
> kernel periodically flushes memcg stats. It is left up to the programmer
> whether they want more recent stats or not.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
  2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
  2025-10-15 20:48   ` Shakeel Butt
@ 2025-10-15 23:12   ` Song Liu
  2025-10-16  4:18     ` Yonghong Song
  2025-10-16 20:28     ` JP Kobryn
  2025-10-16 22:28   ` kernel test robot
  2 siblings, 2 replies; 16+ messages in thread
From: Song Liu @ 2025-10-15 23:12 UTC (permalink / raw)
  To: JP Kobryn
  Cc: shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm,
	linux-kernel, cgroups, linux-mm, bpf, kernel-team

On Wed, Oct 15, 2025 at 12:08 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>
[...]
> ---
>  mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4deda33625f4..6547c27d4430 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -871,6 +871,73 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>  }
>  #endif
>
> +static inline struct mem_cgroup *memcg_from_cgroup(struct cgroup *cgrp)
> +{
> +       return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
> +}
> +

We should add __bpf_kfunc_start_defs() here, and __bpf_kfunc_end_defs()
after all the kfuncs.

> +__bpf_kfunc static void memcg_flush_stats(struct cgroup *cgrp)

We mostly do not make kfunc static, but it seems to also work.

> +{
> +       struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
> +
> +       if (!memcg)
> +               return;

Maybe we can let memcg_flush_stats return int, and return -EINVAL
on memcg == NULL cases?

> +
> +       mem_cgroup_flush_stats(memcg);
> +}
> +
[...]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs
  2025-10-15 19:08 ` [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs JP Kobryn
@ 2025-10-15 23:17   ` Shakeel Butt
  2025-10-16  5:04   ` Yonghong Song
  1 sibling, 0 replies; 16+ messages in thread
From: Shakeel Butt @ 2025-10-15 23:17 UTC (permalink / raw)
  To: JP Kobryn
  Cc: andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel,
	cgroups, linux-mm, bpf, kernel-team

On Wed, Oct 15, 2025 at 12:08:13PM -0700, JP Kobryn wrote:
> Add test coverage for the kfuncs that fetch memcg stats. Using some common
> stats, test before and after scenarios ensuring that the given stat
> increases by some arbitrary amount. The stats selected cover the three
> categories represented by the enums: node_stat_item, memcg_stat_item,
> vm_event_item.
> 
> Since only a subset of all stats are queried, use a static struct made up
> of fields for each stat. Write to the struct with the fetched values when
> the bpf program is invoked and read the fields in the user mode program for
> verification.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>  .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
>  .../bpf/prog_tests/cgroup_iter_memcg.c        | 295 ++++++++++++++++++
>  .../selftests/bpf/progs/cgroup_iter_memcg.c   |  61 ++++
>  3 files changed, 374 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
>  create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
> 
> diff --git a/tools/testing/selftests/bpf/cgroup_iter_memcg.h b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
> new file mode 100644
> index 000000000000..5f4c6502d9f1
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#ifndef __CGROUP_ITER_MEMCG_H
> +#define __CGROUP_ITER_MEMCG_H
> +
> +struct memcg_query {
> +	/* some node_stat_item's */
> +	long nr_anon_mapped;
> +	long nr_shmem;
> +	long nr_file_pages;
> +	long nr_file_mapped;
> +	/* some memcg_stat_item */
> +	long memcg_kmem;
> +	/* some vm_event_item */
> +	long pgfault;
> +};
> +
> +#endif /* __CGROUP_ITER_MEMCG_H */
> diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
> new file mode 100644
> index 000000000000..264dc3c9ec30
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
> @@ -0,0 +1,295 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <test_progs.h>
> +#include <bpf/libbpf.h>
> +#include <bpf/btf.h>
> +#include <fcntl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include "cgroup_helpers.h"
> +#include "cgroup_iter_memcg.h"
> +#include "cgroup_iter_memcg.skel.h"
> +
> +int read_stats(struct bpf_link *link)
> +{
> +	int fd, ret = 0;
> +	ssize_t bytes;
> +
> +	fd = bpf_iter_create(bpf_link__fd(link));

Do we need to create a new fd for every new stat read? Is there lseek()
like approach possible here?

> +	if (!ASSERT_OK_FD(fd, "bpf_iter_create"))
> +		return 1;
> +
> +	/*
> +	 * Invoke iter program by reading from its fd. We're not expecting any
> +	 * data to be written by the bpf program so the result should be zero.
> +	 * Results will be read directly through the custom data section
> +	 * accessible through skel->data_query.memcg_query.
> +	 */
> +	bytes = read(fd, NULL, 0);
> +	if (!ASSERT_EQ(bytes, 0, "read fd"))
> +		ret = 1;
> +
> +	close(fd);
> +	return ret;
> +}
> +

[...]

> +static void test_shmem(struct bpf_link *link,
> +		struct memcg_query *memcg_query)
> +{
> +	size_t len;
> +	int fd;
> +	void *map;
> +	long val;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val = memcg_query->nr_shmem;
> +	if (!ASSERT_GE(val, 0, "init shmem val"))
> +		return;
> +
> +	/*
> +	 * Increase memcg shmem usage by creating and writing
> +	 * to a shmem object.
> +	 */
> +	fd = shm_open("/tmp_shmem", O_CREAT | O_RDWR, 0644);
> +	if (!ASSERT_OK_FD(fd, "shm_open"))
> +		return;
> +
> +	if (!ASSERT_OK(ftruncate(fd, len), "ftruncate"))
> +		goto cleanup_fd;
> +
> +	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
> +			MAP_SHARED, fd, 0);

You don't need to mmap(), you can just fallocate() or simply write to
increase shmem stats.

> +	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap shmem"))
> +		goto cleanup_fd;
> +
> +	memset(map, 1, len);
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup_map;
> +
> +	ASSERT_GT(memcg_query->nr_shmem, val, "final shmem value");
> +
> +cleanup_map:
> +	munmap(map, len);
> +cleanup_fd:
> +	close(fd);
> +	shm_unlink("/tmp_shmem");
> +}
> +
> +static void test_kmem(struct bpf_link *link,
> +		struct memcg_query *memcg_query)
> +{
> +	int fds[2];
> +	int err;
> +	ssize_t bytes;
> +	size_t len;
> +	char *buf;
> +	long val;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val = memcg_query->memcg_kmem;
> +	if (!ASSERT_GE(val, 0, "initial kmem val"))
> +		return;
> +
> +	err = pipe2(fds, O_NONBLOCK);

You will need to create a bit more of these cause the kernel memory
stats to be updated due to per-cpu memcg stock caching kmem stats.

> +	if (!ASSERT_OK(err, "pipe"))
> +		return;
> +
> +	buf = malloc(len);
> +	memset(buf, 1, len);
> +	bytes = write(fds[1], buf, len);
> +	if (!ASSERT_GT(bytes, 0, "write"))
> +		goto cleanup;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup;
> +
> +	ASSERT_GT(memcg_query->memcg_kmem, val, "kmem value");
> +
> +cleanup:
> +	free(buf);
> +	close(fds[0]);
> +	close(fds[1]);
> +}
> +

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
  2025-10-15 20:46 ` [PATCH v2 0/2] memcg: reading memcg stats more efficiently Shakeel Butt
@ 2025-10-16  0:21   ` JP Kobryn
  2025-10-16  1:10     ` Roman Gushchin
  0 siblings, 1 reply; 16+ messages in thread
From: JP Kobryn @ 2025-10-16  0:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel,
	cgroups, linux-mm, bpf, kernel-team, mhocko, roman.gushchin,
	muchun.song

On 10/15/25 1:46 PM, Shakeel Butt wrote:
> Cc memcg maintainers.
> 
> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>> When reading cgroup memory.stat files there is significant kernel overhead
>> in the formatting and encoding of numeric data into a string buffer. Beyond
>> that, the given user mode program must decode this data and possibly
>> perform filtering to obtain the desired stats. This process can be
>> expensive for programs that periodically sample this data over a large
>> enough fleet.
>>
>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>> fetching specific memcg stats from within cgroup iterator based bpf
>> programs. This approach allows for numeric values to be transferred
>> directly from the kernel to user mode via the mapped memory of the bpf
>> program's elf data section. Reading stats this way effectively eliminates
>> the numeric conversion work needed to be performed in both kernel and user
>> mode. It also eliminates the need for filtering in a user mode program.
>> i.e. where reading memory.stat returns all stats, this new approach allows
>> returning only select stats.
>>
>> An experiment was setup to compare the performance of a program using these
>> new kfuncs vs a program that uses the traditional method of reading
>> memory.stat. On the experimental side, a libbpf based program was written
>> which sets up a link to the bpf program once in advance and then reuses
>> this link to create and read from a bpf iterator program for 1M iterations.
> 
> I am getting a bit confused on the terminology. You mentioned libbpf
> program, bpf program, link. Can you describe each of them? Think of
> explaining this to someone with no bpf background.
> 
> (BTW Yonghong already explained to me these details but I wanted the
> commit message to be self explanatory).

No problem. I'll try to expand on those terms in v3.

> 
>> Meanwhile on the control side, a program was written to open the root
>> memory.stat file
> 
> How much activity was on the system? I imagine none because I don't see
> flushing in the perf profile. This experiment focuses on the
> non-flushing part of the memcg stats which is fine.

Right, at the time there was no custom workload running alongside the
tests.

> 
>> and repeatedly read 1M times from the associated file
>> descriptor (while seeking back to zero before each subsequent read). Note
>> that the program does not bother to decode or filter any data in user mode.
>> The reason for this is because the experimental program completely removes
>> the need for this work.
> 
> Hmm in your experiment is the control program doing the decode and/or
> filter or no? The last sentence in above para is confusing. Yes, the
> experiment program does not need to do the parsing or decoding in
> userspace but the control program needs to do that. If your control
> program is not doing it then you are under-selling your work.

The control does not perform decoding. But it's a good point. Let me add
decoding to the control side in v3.

> 
>>
>> The results showed a significant perf benefit on the experimental side,
>> outperforming the control side by a margin of 80% elapsed time in kernel
>> mode. The kernel overhead of numeric conversion on the control side is
>> eliminated on the experimental side since the values are read directly
>> through mapped memory of the bpf program. The experiment data is shown
>> here:
>>
>> control: elapsed time
>> real    0m13.062s
>> user    0m0.147s
>> sys     0m12.876s
>>
>> experiment: elapsed time
>> real    0m2.717s
>> user    0m0.175s
>> sys     0m2.451s
> 
> These numbers are really awesome.

:)

> 
>>
>> control: perf data
>> 22.23% a.out [kernel.kallsyms] [k] vsnprintf
>> 18.83% a.out [kernel.kallsyms] [k] format_decode
>> 12.05% a.out [kernel.kallsyms] [k] string
>> 11.56% a.out [kernel.kallsyms] [k] number
>>   7.71% a.out [kernel.kallsyms] [k] strlen
>>   4.80% a.out [kernel.kallsyms] [k] memcpy_orig
>>   4.67% a.out [kernel.kallsyms] [k] memory_stat_format
>>   4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
>>   2.22% a.out [kernel.kallsyms] [k] widen_string
>>   1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
>>   0.95% a.out [kernel.kallsyms] [k] put_dec_full8
>>   0.69% a.out [kernel.kallsyms] [k] put_dec
>>   0.69% a.out [kernel.kallsyms] [k] memcpy
>>
>> experiment: perf data
>> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
>>   7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
>>   4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
>>   3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
>>   2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
>>   2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
>>   2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
>>   2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
>>   2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
>>   2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
>>
>> Aside from the perf gain, the kfunc/bpf approach provides flexibility in
>> how memcg data can be delivered to a user mode program. As seen in the
>> second patch which contains the selftests, it is possible to use a struct
>> with select memory stat fields. But it is completely up to the programmer
>> on how to lay out the data.
> 
> I remember you plan to convert couple of open source program to use this
> new feature. I think below [1] and oomd [2]. Adding that information
> would further make your case strong. cAdvisor[3] is another open source
> tool which can take benefit from this work.

That is accurate, thanks. Will include in v3.

> 
> [1] https://github.com/facebookincubator/below
> [2] https://github.com/facebookincubator/oomd
> [3] https://github.com/google/cadvisor
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
  2025-10-16  0:21   ` JP Kobryn
@ 2025-10-16  1:10     ` Roman Gushchin
  2025-10-16 20:26       ` JP Kobryn
  0 siblings, 1 reply; 16+ messages in thread
From: Roman Gushchin @ 2025-10-16  1:10 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Shakeel Butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm,
	linux-kernel, cgroups, linux-mm, bpf, kernel-team, mhocko,
	muchun.song

JP Kobryn <inwardvessel@gmail.com> writes:

> On 10/15/25 1:46 PM, Shakeel Butt wrote:
>> Cc memcg maintainers.
>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>>> When reading cgroup memory.stat files there is significant kernel overhead
>>> in the formatting and encoding of numeric data into a string buffer. Beyond
>>> that, the given user mode program must decode this data and possibly
>>> perform filtering to obtain the desired stats. This process can be
>>> expensive for programs that periodically sample this data over a large
>>> enough fleet.
>>>
>>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>>> fetching specific memcg stats from within cgroup iterator based bpf
>>> programs. This approach allows for numeric values to be transferred
>>> directly from the kernel to user mode via the mapped memory of the bpf
>>> program's elf data section. Reading stats this way effectively eliminates
>>> the numeric conversion work needed to be performed in both kernel and user
>>> mode. It also eliminates the need for filtering in a user mode program.
>>> i.e. where reading memory.stat returns all stats, this new approach allows
>>> returning only select stats.

It seems like I've most of these functions implemented as part of
bpfoom: https://lkml.org/lkml/2025/8/18/1403

So I definitely find them useful. Would be nice to merge our efforts.

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
  2025-10-15 23:12   ` Song Liu
@ 2025-10-16  4:18     ` Yonghong Song
  2025-10-16 20:28     ` JP Kobryn
  1 sibling, 0 replies; 16+ messages in thread
From: Yonghong Song @ 2025-10-16  4:18 UTC (permalink / raw)
  To: Song Liu, JP Kobryn
  Cc: shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm,
	linux-kernel, cgroups, linux-mm, bpf, kernel-team



On 10/15/25 4:12 PM, Song Liu wrote:
> On Wed, Oct 15, 2025 at 12:08 PM JP Kobryn <inwardvessel@gmail.com> wrote:
> [...]
>> ---
>>   mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 67 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4deda33625f4..6547c27d4430 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -871,6 +871,73 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>>   }
>>   #endif
>>
>> +static inline struct mem_cgroup *memcg_from_cgroup(struct cgroup *cgrp)
>> +{
>> +       return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
>> +}
>> +
> We should add __bpf_kfunc_start_defs() here, and __bpf_kfunc_end_defs()
> after all the kfuncs.
>
>> +__bpf_kfunc static void memcg_flush_stats(struct cgroup *cgrp)
> We mostly do not make kfunc static, but it seems to also work.

Let us remove 'static' in __bpf_kfunc functions in order to be consistent
with other existing kfuncs.


The __bpf_kfunc macro is
    linux/btf.h:#define __bpf_kfunc __used __retain __noclone noinline

__used and __retain attributes ensure the function won't be removed
by compiler/linker.

>
>> +{
>> +       struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
>> +
>> +       if (!memcg)
>> +               return;
> Maybe we can let memcg_flush_stats return int, and return -EINVAL
> on memcg == NULL cases?
>
>> +
>> +       mem_cgroup_flush_stats(memcg);
>> +}
>> +
> [...]
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs
  2025-10-15 19:08 ` [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs JP Kobryn
  2025-10-15 23:17   ` Shakeel Butt
@ 2025-10-16  5:04   ` Yonghong Song
  2025-10-16 20:45     ` JP Kobryn
  1 sibling, 1 reply; 16+ messages in thread
From: Yonghong Song @ 2025-10-16  5:04 UTC (permalink / raw)
  To: JP Kobryn, shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes,
	tj, akpm
  Cc: linux-kernel, cgroups, linux-mm, bpf, kernel-team



On 10/15/25 12:08 PM, JP Kobryn wrote:
> Add test coverage for the kfuncs that fetch memcg stats. Using some common
> stats, test before and after scenarios ensuring that the given stat
> increases by some arbitrary amount. The stats selected cover the three
> categories represented by the enums: node_stat_item, memcg_stat_item,
> vm_event_item.
>
> Since only a subset of all stats are queried, use a static struct made up
> of fields for each stat. Write to the struct with the fetched values when
> the bpf program is invoked and read the fields in the user mode program for
> verification.
>
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
> ---
>   .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
>   .../bpf/prog_tests/cgroup_iter_memcg.c        | 295 ++++++++++++++++++
>   .../selftests/bpf/progs/cgroup_iter_memcg.c   |  61 ++++
>   3 files changed, 374 insertions(+)
>   create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
>   create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
>   create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
>
> diff --git a/tools/testing/selftests/bpf/cgroup_iter_memcg.h b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
> new file mode 100644
> index 000000000000..5f4c6502d9f1
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#ifndef __CGROUP_ITER_MEMCG_H
> +#define __CGROUP_ITER_MEMCG_H
> +
> +struct memcg_query {
> +	/* some node_stat_item's */
> +	long nr_anon_mapped;
> +	long nr_shmem;
> +	long nr_file_pages;
> +	long nr_file_mapped;
> +	/* some memcg_stat_item */
> +	long memcg_kmem;
> +	/* some vm_event_item */
> +	long pgfault;
> +};
> +
> +#endif /* __CGROUP_ITER_MEMCG_H */
> diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
> new file mode 100644
> index 000000000000..264dc3c9ec30
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
> @@ -0,0 +1,295 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <test_progs.h>
> +#include <bpf/libbpf.h>
> +#include <bpf/btf.h>
> +#include <fcntl.h>
> +#include <sys/mman.h>
> +#include <unistd.h>
> +#include "cgroup_helpers.h"
> +#include "cgroup_iter_memcg.h"
> +#include "cgroup_iter_memcg.skel.h"
> +
> +int read_stats(struct bpf_link *link)

static int read_stats(...)

> +{
> +	int fd, ret = 0;
> +	ssize_t bytes;
> +
> +	fd = bpf_iter_create(bpf_link__fd(link));
> +	if (!ASSERT_OK_FD(fd, "bpf_iter_create"))
> +		return 1;
> +
> +	/*
> +	 * Invoke iter program by reading from its fd. We're not expecting any
> +	 * data to be written by the bpf program so the result should be zero.
> +	 * Results will be read directly through the custom data section
> +	 * accessible through skel->data_query.memcg_query.
> +	 */
> +	bytes = read(fd, NULL, 0);
> +	if (!ASSERT_EQ(bytes, 0, "read fd"))
> +		ret = 1;
> +
> +	close(fd);
> +	return ret;
> +}
> +
> +static void test_anon(struct bpf_link *link,
> +		struct memcg_query *memcg_query)

Alignment between arguments? Actually two arguments can be in the same line.

> +{
> +	void *map;
> +	size_t len;
> +	long val;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val = memcg_query->nr_anon_mapped;
> +	if (!ASSERT_GE(val, 0, "initial anon mapped val"))
> +		return;
> +
> +	/*
> +	 * Increase memcg anon usage by mapping and writing
> +	 * to a new anon region.
> +	 */
> +	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
> +			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

All arguments can be in the same line.

> +	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap anon"))
> +		return;
> +
> +	memset(map, 1, len);
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup;
> +
> +	ASSERT_GT(memcg_query->nr_anon_mapped, val, "final anon mapped val");
> +
> +cleanup:
> +	munmap(map, len);
> +}
> +
> +static void test_file(struct bpf_link *link,
> +		struct memcg_query *memcg_query)

Arguments can be in the same line. Some other examples below.

> +{
> +	void *map;
> +	size_t len;
> +	long val_pages, val_mapped;
> +	FILE *f;
> +	int fd;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val_pages = memcg_query->nr_file_pages;
> +	if (!ASSERT_GE(val_pages, 0, "initial file val"))
> +		return;
> +	val_mapped = memcg_query->nr_file_mapped;
> +	if (!ASSERT_GE(val_mapped, 0, "initial file mapped val"))
> +		return;
> +
> +	/*
> +	 * Increase memcg file usage by creating and writing
> +	 * to a temoprary mapped file.
> +	 */
> +	f = tmpfile();
> +	if (!ASSERT_OK_PTR(f, "tmpfile"))
> +		return;
> +	fd = fileno(f);
> +	if (!ASSERT_OK_FD(fd, "open fd"))
> +		return;
> +	if (!ASSERT_OK(ftruncate(fd, len), "ftruncate"))
> +		goto cleanup_fd;
> +
> +	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
> +			MAP_SHARED, fd, 0);

ditto.

> +	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap file"))
> +		goto cleanup_fd;
> +
> +	memset(map, 1, len);
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup_map;
> +
> +	ASSERT_GT(memcg_query->nr_file_pages, val_pages, "final file value");
> +	ASSERT_GT(memcg_query->nr_file_mapped, val_mapped,
> +			"final file mapped value");

ditto.

> +
> +cleanup_map:
> +	munmap(map, len);
> +cleanup_fd:
> +	close(fd);
> +}
> +
> +static void test_shmem(struct bpf_link *link,
> +		struct memcg_query *memcg_query)

ditto.

> +{
> +	size_t len;
> +	int fd;
> +	void *map;
> +	long val;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val = memcg_query->nr_shmem;
> +	if (!ASSERT_GE(val, 0, "init shmem val"))
> +		return;
> +
> +	/*
> +	 * Increase memcg shmem usage by creating and writing
> +	 * to a shmem object.
> +	 */
> +	fd = shm_open("/tmp_shmem", O_CREAT | O_RDWR, 0644);
> +	if (!ASSERT_OK_FD(fd, "shm_open"))
> +		return;
> +
> +	if (!ASSERT_OK(ftruncate(fd, len), "ftruncate"))
> +		goto cleanup_fd;
> +
> +	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
> +			MAP_SHARED, fd, 0);

ditto.

> +	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap shmem"))
> +		goto cleanup_fd;
> +
> +	memset(map, 1, len);
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup_map;
> +
> +	ASSERT_GT(memcg_query->nr_shmem, val, "final shmem value");
> +
> +cleanup_map:
> +	munmap(map, len);
> +cleanup_fd:
> +	close(fd);
> +	shm_unlink("/tmp_shmem");
> +}
> +
> +static void test_kmem(struct bpf_link *link,
> +		struct memcg_query *memcg_query)

ditto.

> +{
> +	int fds[2];
> +	int err;
> +	ssize_t bytes;
> +	size_t len;
> +	char *buf;
> +	long val;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val = memcg_query->memcg_kmem;
> +	if (!ASSERT_GE(val, 0, "initial kmem val"))
> +		return;
> +
> +	err = pipe2(fds, O_NONBLOCK);
> +	if (!ASSERT_OK(err, "pipe"))
> +		return;
> +
> +	buf = malloc(len);

buf could be NULL?

> +	memset(buf, 1, len);
> +	bytes = write(fds[1], buf, len);
> +	if (!ASSERT_GT(bytes, 0, "write"))
> +		goto cleanup;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup;
> +
> +	ASSERT_GT(memcg_query->memcg_kmem, val, "kmem value");
> +
> +cleanup:
> +	free(buf);
> +	close(fds[0]);
> +	close(fds[1]);
> +}
> +
> +static void test_pgfault(struct bpf_link *link,
> +		struct memcg_query *memcg_query)

ditto.

> +{
> +	void *map;
> +	size_t len;
> +	long val;
> +
> +	len = sysconf(_SC_PAGESIZE) * 1024;
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		return;
> +
> +	val = memcg_query->pgfault;
> +	if (!ASSERT_GE(val, 0, "initial pgfault val"))
> +		return;
> +
> +	/* Create region to use for triggering a page fault. */
> +	map = mmap(NULL, len, PROT_READ | PROT_WRITE,
> +			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> +	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap anon"))
> +		return;
> +
> +	/* Trigger page fault. */
> +	memset(map, 1, len);
> +
> +	if (!ASSERT_OK(read_stats(link), "read stats"))
> +		goto cleanup;
> +
> +	ASSERT_GT(memcg_query->pgfault, val, "final pgfault val");
> +
> +cleanup:
> +	munmap(map, len);
> +}
> +
> +void test_cgroup_iter_memcg(void)
> +{
> +	char *cgroup_rel_path = "/cgroup_iter_memcg_test";
> +	struct cgroup_iter_memcg *skel;
> +	struct bpf_link *link;
> +	int cgroup_fd, err;
> +
> +	cgroup_fd = cgroup_setup_and_join(cgroup_rel_path);
> +	if (!ASSERT_OK_FD(cgroup_fd, "cgroup_setup_and_join"))
> +		return;
> +
> +	skel = cgroup_iter_memcg__open();
> +	if (!ASSERT_OK_PTR(skel, "cgroup_iter_memcg__open"))
> +		goto cleanup_cgroup_fd;
> +
> +	err = cgroup_iter_memcg__load(skel);
> +	if (!ASSERT_OK(err, "cgroup_iter_memcg__load"))
> +		goto cleanup_skel;

The above two can be combined with cgroup_iter_memcg__open_and_load().

> +
> +	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
> +	union bpf_iter_link_info linfo = {
> +		.cgroup.cgroup_fd = cgroup_fd,
> +		.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY,
> +	};
> +	opts.link_info = &linfo;
> +	opts.link_info_len = sizeof(linfo);
> +
> +	link = bpf_program__attach_iter(skel->progs.cgroup_memcg_query, &opts);
> +	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
> +		goto cleanup_cgroup_fd;

goto cleanup_skel;

> +
> +	if (test__start_subtest("cgroup_iter_memcg__anon"))
> +		test_anon(link, &skel->data_query->memcg_query);
> +	if (test__start_subtest("cgroup_iter_memcg__shmem"))
> +		test_shmem(link, &skel->data_query->memcg_query);
> +	if (test__start_subtest("cgroup_iter_memcg__file"))
> +		test_file(link, &skel->data_query->memcg_query);
> +	if (test__start_subtest("cgroup_iter_memcg__kmem"))
> +		test_kmem(link, &skel->data_query->memcg_query);
> +	if (test__start_subtest("cgroup_iter_memcg__pgfault"))
> +		test_pgfault(link, &skel->data_query->memcg_query);
> +
> +	bpf_link__destroy(link);
> +cleanup_skel:
> +	cgroup_iter_memcg__destroy(skel);
> +cleanup_cgroup_fd:
> +	close(cgroup_fd);
> +	cleanup_cgroup_environment();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
> new file mode 100644
> index 000000000000..0d913d72b68d
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
> @@ -0,0 +1,61 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
> +#include <vmlinux.h>
> +#include <bpf/bpf_core_read.h>
> +#include "cgroup_iter_memcg.h"
> +
> +char _license[] SEC("license") = "GPL";
> +
> +extern void memcg_flush_stats(struct cgroup *cgrp) __ksym;
> +extern unsigned long memcg_stat_fetch(struct cgroup *cgrp,
> +		enum memcg_stat_item item) __ksym;
> +extern unsigned long memcg_node_stat_fetch(struct cgroup *cgrp,
> +		enum node_stat_item item) __ksym;
> +extern unsigned long memcg_vm_event_fetch(struct cgroup *cgrp,
> +		enum vm_event_item item) __ksym;

The above four extern functions are not needed. They should be included
in vmlinux.h if the latest pahole version (1.30) is used.

> +
> +/* The latest values read are stored here. */
> +struct memcg_query memcg_query SEC(".data.query");
> +
> +/*
> + * Helpers for fetching any of the three different types of memcg stats.
> + * BPF core macros are used to ensure an enumerator is present in the given
> + * kernel. Falling back on -1 indicates its absence.
> + */
> +#define node_stat_fetch_if_exists(cgrp, item) \
> +	bpf_core_enum_value_exists(enum node_stat_item, item) ? \
> +		memcg_node_stat_fetch((cgrp), bpf_core_enum_value( \
> +					 enum node_stat_item, item)) : -1
> +
> +#define memcg_stat_fetch_if_exists(cgrp, item) \
> +	bpf_core_enum_value_exists(enum memcg_stat_item, item) ? \
> +		memcg_node_stat_fetch((cgrp), bpf_core_enum_value( \
> +					 enum memcg_stat_item, item)) : -1
> +
> +#define vm_event_fetch_if_exists(cgrp, item) \
> +	bpf_core_enum_value_exists(enum vm_event_item, item) ? \
> +		memcg_vm_event_fetch((cgrp), bpf_core_enum_value( \
> +					 enum vm_event_item, item)) : -1
> +
> +SEC("iter.s/cgroup")
> +int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
> +{
> +	struct cgroup *cgrp = ctx->cgroup;
> +
> +	if (!cgrp)
> +		return 1;
> +
> +	memcg_flush_stats(cgrp);
> +
> +	memcg_query.nr_anon_mapped = node_stat_fetch_if_exists(cgrp,
> +			NR_ANON_MAPPED);
> +	memcg_query.nr_shmem = node_stat_fetch_if_exists(cgrp, NR_SHMEM);
> +	memcg_query.nr_file_pages = node_stat_fetch_if_exists(cgrp,
> +			NR_FILE_PAGES);
> +	memcg_query.nr_file_mapped = node_stat_fetch_if_exists(cgrp,
> +			NR_FILE_MAPPED);
> +	memcg_query.memcg_kmem = memcg_stat_fetch_if_exists(cgrp, MEMCG_KMEM);
> +	memcg_query.pgfault = vm_event_fetch_if_exists(cgrp, PGFAULT);

There is a type mismatch:

+struct memcg_query {
+	/* some node_stat_item's */
+	long nr_anon_mapped;
+	long nr_shmem;
+	long nr_file_pages;
+	long nr_file_mapped;
+	/* some memcg_stat_item */
+	long memcg_kmem;
+	/* some vm_event_item */
+	long pgfault;
+};

memcg_query.nr_anon_mapped is long, but node_stat_fetch_if_exists
(...) return value type is unsigned long. It would be good if two
types are the same.

> +
> +	return 0;
> +}


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
  2025-10-16  1:10     ` Roman Gushchin
@ 2025-10-16 20:26       ` JP Kobryn
  2025-10-16 23:00         ` Roman Gushchin
  0 siblings, 1 reply; 16+ messages in thread
From: JP Kobryn @ 2025-10-16 20:26 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm,
	linux-kernel, cgroups, linux-mm, bpf, kernel-team, mhocko,
	muchun.song

On 10/15/25 6:10 PM, Roman Gushchin wrote:
> JP Kobryn <inwardvessel@gmail.com> writes:
> 
>> On 10/15/25 1:46 PM, Shakeel Butt wrote:
>>> Cc memcg maintainers.
>>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>>>> When reading cgroup memory.stat files there is significant kernel overhead
>>>> in the formatting and encoding of numeric data into a string buffer. Beyond
>>>> that, the given user mode program must decode this data and possibly
>>>> perform filtering to obtain the desired stats. This process can be
>>>> expensive for programs that periodically sample this data over a large
>>>> enough fleet.
>>>>
>>>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>>>> fetching specific memcg stats from within cgroup iterator based bpf
>>>> programs. This approach allows for numeric values to be transferred
>>>> directly from the kernel to user mode via the mapped memory of the bpf
>>>> program's elf data section. Reading stats this way effectively eliminates
>>>> the numeric conversion work needed to be performed in both kernel and user
>>>> mode. It also eliminates the need for filtering in a user mode program.
>>>> i.e. where reading memory.stat returns all stats, this new approach allows
>>>> returning only select stats.
> 
> It seems like I've most of these functions implemented as part of
> bpfoom: https://lkml.org/lkml/2025/8/18/1403
> 
> So I definitely find them useful. Would be nice to merge our efforts.

Sounds great. I see in your series that you allow the kfuncs to accept
integers as item numbers. Would my approach of using typed enums work
for you? I wanted to take advantage of libbpf core so that the bpf
program could gracefully handle cases where a given enumerator is not
present in a given kernel version. I made use of this in the selftests.

I'm planning on sending out a v3 so let me know if you would like to see
any alterations that would align with bpfoom.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
  2025-10-15 23:12   ` Song Liu
  2025-10-16  4:18     ` Yonghong Song
@ 2025-10-16 20:28     ` JP Kobryn
  1 sibling, 0 replies; 16+ messages in thread
From: JP Kobryn @ 2025-10-16 20:28 UTC (permalink / raw)
  To: Song Liu
  Cc: shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm,
	linux-kernel, cgroups, linux-mm, bpf, kernel-team

On 10/15/25 4:12 PM, Song Liu wrote:
> On Wed, Oct 15, 2025 at 12:08 PM JP Kobryn <inwardvessel@gmail.com> wrote:
>>
> [...]
>> ---
>>   mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 67 insertions(+)
>>
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 4deda33625f4..6547c27d4430 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -871,6 +871,73 @@ unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>>   }
>>   #endif
>>
>> +static inline struct mem_cgroup *memcg_from_cgroup(struct cgroup *cgrp)
>> +{
>> +       return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
>> +}
>> +
> 
> We should add __bpf_kfunc_start_defs() here, and __bpf_kfunc_end_defs()
> after all the kfuncs.

Good call.

> 
>> +__bpf_kfunc static void memcg_flush_stats(struct cgroup *cgrp)
> 
> We mostly do not make kfunc static, but it seems to also work.
> 
>> +{
>> +       struct mem_cgroup *memcg = memcg_from_cgroup(cgrp);
>> +
>> +       if (!memcg)
>> +               return;
> 
> Maybe we can let memcg_flush_stats return int, and return -EINVAL
> on memcg == NULL cases?

Sure, I'll do that in v3.

> 
>> +
>> +       mem_cgroup_flush_stats(memcg);
>> +}
>> +
> [...]


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs
  2025-10-16  5:04   ` Yonghong Song
@ 2025-10-16 20:45     ` JP Kobryn
  0 siblings, 0 replies; 16+ messages in thread
From: JP Kobryn @ 2025-10-16 20:45 UTC (permalink / raw)
  To: Yonghong Song, shakeel.butt, andrii, ast, mkoutny, yosryahmed,
	hannes, tj, akpm
  Cc: linux-kernel, cgroups, linux-mm, bpf, kernel-team

On 10/15/25 10:04 PM, Yonghong Song wrote:
> 
> 
> On 10/15/25 12:08 PM, JP Kobryn wrote:
>> Add test coverage for the kfuncs that fetch memcg stats. Using some 
>> common
>> stats, test before and after scenarios ensuring that the given stat
>> increases by some arbitrary amount. The stats selected cover the three
>> categories represented by the enums: node_stat_item, memcg_stat_item,
>> vm_event_item.
>>
>> Since only a subset of all stats are queried, use a static struct made up
>> of fields for each stat. Write to the struct with the fetched values when
>> the bpf program is invoked and read the fields in the user mode 
>> program for
>> verification.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>> ---
>>   .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
>>   .../bpf/prog_tests/cgroup_iter_memcg.c        | 295 ++++++++++++++++++
>>   .../selftests/bpf/progs/cgroup_iter_memcg.c   |  61 ++++
>>   3 files changed, 374 insertions(+)
>>   create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
>>   create mode 100644 tools/testing/selftests/bpf/prog_tests/ 
>> cgroup_iter_memcg.c
>>   create mode 100644 tools/testing/selftests/bpf/progs/ 
>> cgroup_iter_memcg.c
>>
>> diff --git a/tools/testing/selftests/bpf/cgroup_iter_memcg.h b/tools/ 
>> testing/selftests/bpf/cgroup_iter_memcg.h
>> new file mode 100644
>> index 000000000000..5f4c6502d9f1
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
>> @@ -0,0 +1,18 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
>> +#ifndef __CGROUP_ITER_MEMCG_H
>> +#define __CGROUP_ITER_MEMCG_H
>> +
>> +struct memcg_query {
>> +    /* some node_stat_item's */
>> +    long nr_anon_mapped;
>> +    long nr_shmem;
>> +    long nr_file_pages;
>> +    long nr_file_mapped;
>> +    /* some memcg_stat_item */
>> +    long memcg_kmem;
>> +    /* some vm_event_item */
>> +    long pgfault;
>> +};
>> +
>> +#endif /* __CGROUP_ITER_MEMCG_H */
>> diff --git a/tools/testing/selftests/bpf/prog_tests/ 
>> cgroup_iter_memcg.c b/tools/testing/selftests/bpf/prog_tests/ 
>> cgroup_iter_memcg.c
>> new file mode 100644
>> index 000000000000..264dc3c9ec30
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
>> @@ -0,0 +1,295 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
>> +#include <test_progs.h>
>> +#include <bpf/libbpf.h>
>> +#include <bpf/btf.h>
>> +#include <fcntl.h>
>> +#include <sys/mman.h>
>> +#include <unistd.h>
>> +#include "cgroup_helpers.h"
>> +#include "cgroup_iter_memcg.h"
>> +#include "cgroup_iter_memcg.skel.h"
>> +
>> +int read_stats(struct bpf_link *link)
> 
> static int read_stats(...)
> 
>> +{
>> +    int fd, ret = 0;
>> +    ssize_t bytes;
>> +
>> +    fd = bpf_iter_create(bpf_link__fd(link));
>> +    if (!ASSERT_OK_FD(fd, "bpf_iter_create"))
>> +        return 1;
>> +
>> +    /*
>> +     * Invoke iter program by reading from its fd. We're not 
>> expecting any
>> +     * data to be written by the bpf program so the result should be 
>> zero.
>> +     * Results will be read directly through the custom data section
>> +     * accessible through skel->data_query.memcg_query.
>> +     */
>> +    bytes = read(fd, NULL, 0);
>> +    if (!ASSERT_EQ(bytes, 0, "read fd"))
>> +        ret = 1;
>> +
>> +    close(fd);
>> +    return ret;
>> +}
>> +
>> +static void test_anon(struct bpf_link *link,
>> +        struct memcg_query *memcg_query)
> 
> Alignment between arguments? Actually two arguments can be in the same 
> line.

Thanks. I was using a limit of 75 chars but will increase to 100 where
applicable.

[...]
>> +{
>> +    int fds[2];
>> +    int err;
>> +    ssize_t bytes;
>> +    size_t len;
>> +    char *buf;
>> +    long val;
>> +
>> +    len = sysconf(_SC_PAGESIZE) * 1024;
>> +
>> +    if (!ASSERT_OK(read_stats(link), "read stats"))
>> +        return;
>> +
>> +    val = memcg_query->memcg_kmem;
>> +    if (!ASSERT_GE(val, 0, "initial kmem val"))
>> +        return;
>> +
>> +    err = pipe2(fds, O_NONBLOCK);
>> +    if (!ASSERT_OK(err, "pipe"))
>> +        return;
>> +
>> +    buf = malloc(len);
> 
> buf could be NULL?

Thanks. I'll add the check unless this test is simplified and a buffer
is not needed for v3.

[...]
>> +
>> +    skel = cgroup_iter_memcg__open();
>> +    if (!ASSERT_OK_PTR(skel, "cgroup_iter_memcg__open"))
>> +        goto cleanup_cgroup_fd;
>> +
>> +    err = cgroup_iter_memcg__load(skel);
>> +    if (!ASSERT_OK(err, "cgroup_iter_memcg__load"))
>> +        goto cleanup_skel;
> 
> The above two can be combined with cgroup_iter_memcg__open_and_load().

I'll do that in v3.

> 
>> +
>> +    DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
>> +    union bpf_iter_link_info linfo = {
>> +        .cgroup.cgroup_fd = cgroup_fd,
>> +        .cgroup.order = BPF_CGROUP_ITER_SELF_ONLY,
>> +    };
>> +    opts.link_info = &linfo;
>> +    opts.link_info_len = sizeof(linfo);
>> +
>> +    link = bpf_program__attach_iter(skel->progs.cgroup_memcg_query, 
>> &opts);
>> +    if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
>> +        goto cleanup_cgroup_fd;
> 
> goto cleanup_skel;

Good catch.

> 
>> +
>> +    if (test__start_subtest("cgroup_iter_memcg__anon"))
>> +        test_anon(link, &skel->data_query->memcg_query);
>> +    if (test__start_subtest("cgroup_iter_memcg__shmem"))
>> +        test_shmem(link, &skel->data_query->memcg_query);
>> +    if (test__start_subtest("cgroup_iter_memcg__file"))
>> +        test_file(link, &skel->data_query->memcg_query);
>> +    if (test__start_subtest("cgroup_iter_memcg__kmem"))
>> +        test_kmem(link, &skel->data_query->memcg_query);
>> +    if (test__start_subtest("cgroup_iter_memcg__pgfault"))
>> +        test_pgfault(link, &skel->data_query->memcg_query);
>> +
>> +    bpf_link__destroy(link);
>> +cleanup_skel:
>> +    cgroup_iter_memcg__destroy(skel);
>> +cleanup_cgroup_fd:
>> +    close(cgroup_fd);
>> +    cleanup_cgroup_environment();
>> +}
>> diff --git a/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c b/ 
>> tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
>> new file mode 100644
>> index 000000000000..0d913d72b68d
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
>> @@ -0,0 +1,61 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
>> +#include <vmlinux.h>
>> +#include <bpf/bpf_core_read.h>
>> +#include "cgroup_iter_memcg.h"
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +extern void memcg_flush_stats(struct cgroup *cgrp) __ksym;
>> +extern unsigned long memcg_stat_fetch(struct cgroup *cgrp,
>> +        enum memcg_stat_item item) __ksym;
>> +extern unsigned long memcg_node_stat_fetch(struct cgroup *cgrp,
>> +        enum node_stat_item item) __ksym;
>> +extern unsigned long memcg_vm_event_fetch(struct cgroup *cgrp,
>> +        enum vm_event_item item) __ksym;
> 
> The above four extern functions are not needed. They should be included
> in vmlinux.h if the latest pahole version (1.30) is used.

Thanks, was not aware of that. Will remove.

> 
>> +
>> +/* The latest values read are stored here. */
>> +struct memcg_query memcg_query SEC(".data.query");
>> +
>> +/*
>> + * Helpers for fetching any of the three different types of memcg stats.
>> + * BPF core macros are used to ensure an enumerator is present in the 
>> given
>> + * kernel. Falling back on -1 indicates its absence.
>> + */
>> +#define node_stat_fetch_if_exists(cgrp, item) \
>> +    bpf_core_enum_value_exists(enum node_stat_item, item) ? \
>> +        memcg_node_stat_fetch((cgrp), bpf_core_enum_value( \
>> +                     enum node_stat_item, item)) : -1
>> +
>> +#define memcg_stat_fetch_if_exists(cgrp, item) \
>> +    bpf_core_enum_value_exists(enum memcg_stat_item, item) ? \
>> +        memcg_node_stat_fetch((cgrp), bpf_core_enum_value( \
>> +                     enum memcg_stat_item, item)) : -1
>> +
>> +#define vm_event_fetch_if_exists(cgrp, item) \
>> +    bpf_core_enum_value_exists(enum vm_event_item, item) ? \
>> +        memcg_vm_event_fetch((cgrp), bpf_core_enum_value( \
>> +                     enum vm_event_item, item)) : -1
>> +
>> +SEC("iter.s/cgroup")
>> +int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
>> +{
>> +    struct cgroup *cgrp = ctx->cgroup;
>> +
>> +    if (!cgrp)
>> +        return 1;
>> +
>> +    memcg_flush_stats(cgrp);
>> +
>> +    memcg_query.nr_anon_mapped = node_stat_fetch_if_exists(cgrp,
>> +            NR_ANON_MAPPED);
>> +    memcg_query.nr_shmem = node_stat_fetch_if_exists(cgrp, NR_SHMEM);
>> +    memcg_query.nr_file_pages = node_stat_fetch_if_exists(cgrp,
>> +            NR_FILE_PAGES);
>> +    memcg_query.nr_file_mapped = node_stat_fetch_if_exists(cgrp,
>> +            NR_FILE_MAPPED);
>> +    memcg_query.memcg_kmem = memcg_stat_fetch_if_exists(cgrp, 
>> MEMCG_KMEM);
>> +    memcg_query.pgfault = vm_event_fetch_if_exists(cgrp, PGFAULT);
> 
> There is a type mismatch:
> 
> +struct memcg_query {
> +    /* some node_stat_item's */
> +    long nr_anon_mapped;
> +    long nr_shmem;
> +    long nr_file_pages;
> +    long nr_file_mapped;
> +    /* some memcg_stat_item */
> +    long memcg_kmem;
> +    /* some vm_event_item */
> +    long pgfault;
> +};
> 
> memcg_query.nr_anon_mapped is long, but node_stat_fetch_if_exists
> (...) return value type is unsigned long. It would be good if two
> types are the same.

Good call, will do.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
  2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
  2025-10-15 20:48   ` Shakeel Butt
  2025-10-15 23:12   ` Song Liu
@ 2025-10-16 22:28   ` kernel test robot
  2 siblings, 0 replies; 16+ messages in thread
From: kernel test robot @ 2025-10-16 22:28 UTC (permalink / raw)
  To: JP Kobryn, shakeel.butt, andrii, ast, mkoutny, yosryahmed, hannes,
	tj, akpm
  Cc: oe-kbuild-all, linux-kernel, cgroups, linux-mm, bpf, kernel-team

Hi JP,

kernel test robot noticed the following build warnings:

[auto build test WARNING on bpf-next/net]
[also build test WARNING on bpf-next/master bpf/master akpm-mm/mm-everything linus/master v6.18-rc1 next-20251016]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/JP-Kobryn/memcg-introduce-kfuncs-for-fetching-memcg-stats/20251016-030920
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git net
patch link:    https://lore.kernel.org/r/20251015190813.80163-2-inwardvessel%40gmail.com
patch subject: [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats
config: x86_64-randconfig-121-20251016 (https://download.01.org/0day-ci/archive/20251017/202510170654.s2j4GuCs-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251017/202510170654.s2j4GuCs-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202510170654.s2j4GuCs-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   mm/memcontrol.c:4236:52: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:4236:52: sparse:    struct task_struct [noderef] __rcu *
   mm/memcontrol.c:4236:52: sparse:    struct task_struct *
>> mm/memcontrol.c:876:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct cgroup_subsys_state *css @@     got struct cgroup_subsys_state [noderef] __rcu * @@
   mm/memcontrol.c:876:55: sparse:     expected struct cgroup_subsys_state *css
   mm/memcontrol.c:876:55: sparse:     got struct cgroup_subsys_state [noderef] __rcu *
>> mm/memcontrol.c:876:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct cgroup_subsys_state *css @@     got struct cgroup_subsys_state [noderef] __rcu * @@
   mm/memcontrol.c:876:55: sparse:     expected struct cgroup_subsys_state *css
   mm/memcontrol.c:876:55: sparse:     got struct cgroup_subsys_state [noderef] __rcu *
>> mm/memcontrol.c:876:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct cgroup_subsys_state *css @@     got struct cgroup_subsys_state [noderef] __rcu * @@
   mm/memcontrol.c:876:55: sparse:     expected struct cgroup_subsys_state *css
   mm/memcontrol.c:876:55: sparse:     got struct cgroup_subsys_state [noderef] __rcu *
>> mm/memcontrol.c:876:55: sparse: sparse: incorrect type in argument 1 (different address spaces) @@     expected struct cgroup_subsys_state *css @@     got struct cgroup_subsys_state [noderef] __rcu * @@
   mm/memcontrol.c:876:55: sparse:     expected struct cgroup_subsys_state *css
   mm/memcontrol.c:876:55: sparse:     got struct cgroup_subsys_state [noderef] __rcu *
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:729:9: sparse: sparse: context imbalance in 'folio_lruvec_lock' - wrong count at exit
   include/linux/memcontrol.h:729:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irq' - wrong count at exit
   include/linux/memcontrol.h:729:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irqsave' - wrong count at exit

vim +876 mm/memcontrol.c

   873	
   874	static inline struct mem_cgroup *memcg_from_cgroup(struct cgroup *cgrp)
   875	{
 > 876		return cgrp ? mem_cgroup_from_css(cgrp->subsys[memory_cgrp_id]) : NULL;
   877	}
   878	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
  2025-10-16 20:26       ` JP Kobryn
@ 2025-10-16 23:00         ` Roman Gushchin
  0 siblings, 0 replies; 16+ messages in thread
From: Roman Gushchin @ 2025-10-16 23:00 UTC (permalink / raw)
  To: JP Kobryn
  Cc: Shakeel Butt, andrii, ast, mkoutny, yosryahmed, hannes, tj, akpm,
	linux-kernel, cgroups, linux-mm, bpf, kernel-team, mhocko,
	muchun.song

JP Kobryn <inwardvessel@gmail.com> writes:

> On 10/15/25 6:10 PM, Roman Gushchin wrote:
>> JP Kobryn <inwardvessel@gmail.com> writes:
>> 
>>> On 10/15/25 1:46 PM, Shakeel Butt wrote:
>>>> Cc memcg maintainers.
>>>> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>>>>> When reading cgroup memory.stat files there is significant kernel overhead
>>>>> in the formatting and encoding of numeric data into a string buffer. Beyond
>>>>> that, the given user mode program must decode this data and possibly
>>>>> perform filtering to obtain the desired stats. This process can be
>>>>> expensive for programs that periodically sample this data over a large
>>>>> enough fleet.
>>>>>
>>>>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>>>>> fetching specific memcg stats from within cgroup iterator based bpf
>>>>> programs. This approach allows for numeric values to be transferred
>>>>> directly from the kernel to user mode via the mapped memory of the bpf
>>>>> program's elf data section. Reading stats this way effectively eliminates
>>>>> the numeric conversion work needed to be performed in both kernel and user
>>>>> mode. It also eliminates the need for filtering in a user mode program.
>>>>> i.e. where reading memory.stat returns all stats, this new approach allows
>>>>> returning only select stats.
>> It seems like I've most of these functions implemented as part of
>> bpfoom: https://lkml.org/lkml/2025/8/18/1403
>> So I definitely find them useful. Would be nice to merge our
>> efforts.
>
> Sounds great. I see in your series that you allow the kfuncs to accept
> integers as item numbers. Would my approach of using typed enums work
> for you? I wanted to take advantage of libbpf core so that the bpf
> program could gracefully handle cases where a given enumerator is not
> present in a given kernel version. I made use of this in the
> selftests.

Good point, I'm going to change it in the next version, which I'm about
to send out: tomorrow or early next week.

> I'm planning on sending out a v3 so let me know if you would like to see
> any alterations that would align with bpfoom.

I kinda prefer my version regarding taking a memcg argument instead of cgroup
and also regarding naming. I also think it's safer to expose the
rate-limited version of stats flushing function. But I do lack the
node-level statistics (which I don't need)

If it's ok with you, maybe you can rebase your patches on top of my v2
and I can include your patches in the series?

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-10-16 23:01 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15 19:08 [PATCH v2 0/2] memcg: reading memcg stats more efficiently JP Kobryn
2025-10-15 19:08 ` [PATCH v2 1/2] memcg: introduce kfuncs for fetching memcg stats JP Kobryn
2025-10-15 20:48   ` Shakeel Butt
2025-10-15 23:12   ` Song Liu
2025-10-16  4:18     ` Yonghong Song
2025-10-16 20:28     ` JP Kobryn
2025-10-16 22:28   ` kernel test robot
2025-10-15 19:08 ` [PATCH v2 2/2] memcg: selftests for memcg stat kfuncs JP Kobryn
2025-10-15 23:17   ` Shakeel Butt
2025-10-16  5:04   ` Yonghong Song
2025-10-16 20:45     ` JP Kobryn
2025-10-15 20:46 ` [PATCH v2 0/2] memcg: reading memcg stats more efficiently Shakeel Butt
2025-10-16  0:21   ` JP Kobryn
2025-10-16  1:10     ` Roman Gushchin
2025-10-16 20:26       ` JP Kobryn
2025-10-16 23:00         ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).