linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events
@ 2025-10-27 23:21 Roman Gushchin
  2025-10-27 23:21 ` [PATCH v2 12/23] bpf: selftests: selftests for memcg stat kfuncs Roman Gushchin
                   ` (11 more replies)
  0 siblings, 12 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo

From: JP Kobryn <inwardvessel@gmail.com>

Introduce BPF kfunc to access memory events, e.g.:
MEMCG_LOW, MEMCG_MAX, MEMCG_OOM, MEMCG_OOM_KILL etc.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 mm/bpf_memcontrol.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 387255b8ab88..458ad022b036 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -99,6 +99,23 @@ __bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
 	return page_counter_read(&memcg->memory);
 }
 
+/**
+ * bpf_mem_cgroup_events - Read memory cgroup's page state counter
+ * bpf_mem_cgroup_memory_events - Read memory cgroup's memory event value
+ * @memcg: memory cgroup
+ * @event: memory event id
+ *
+ * Returns current memory event count.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_memory_events(struct mem_cgroup *memcg,
+						enum memcg_memory_event event)
+{
+	if (event >= MEMCG_NR_MEMORY_EVENTS)
+		return (unsigned long)-1;
+
+	return atomic_long_read(&memcg->memory_events[event]);
+}
+
 /**
  * bpf_mem_cgroup_page_state - Read memory cgroup's page state counter
  * @memcg: memory cgroup
@@ -133,6 +150,7 @@ BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
 
 BTF_ID_FLAGS(func, bpf_mem_cgroup_vm_events, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_memory_events, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS | KF_SLEEPABLE)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 12/23] bpf: selftests: selftests for memcg stat kfuncs
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
@ 2025-10-27 23:21 ` Roman Gushchin
  2025-10-27 23:21 ` [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo

From: JP Kobryn <inwardvessel@gmail.com>

Add test coverage for the kfuncs that fetch memcg stats. Using some common
stats, test scenarios ensuring that the given stat increases by some
arbitrary amount. The stats selected cover the three categories represented
by the enums: node_stat_item, memcg_stat_item, vm_event_item.

Since only a subset of all stats are queried, use a static struct made up
of fields for each stat. Write to the struct with the fetched values when
the bpf program is invoked and read the fields in the user mode program for
verification.

Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 ++
 .../bpf/prog_tests/cgroup_iter_memcg.c        | 223 ++++++++++++++++++
 .../selftests/bpf/progs/cgroup_iter_memcg.c   |  42 ++++
 3 files changed, 283 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c

diff --git a/tools/testing/selftests/bpf/cgroup_iter_memcg.h b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
new file mode 100644
index 000000000000..3f59b127943b
--- /dev/null
+++ b/tools/testing/selftests/bpf/cgroup_iter_memcg.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#ifndef __CGROUP_ITER_MEMCG_H
+#define __CGROUP_ITER_MEMCG_H
+
+struct memcg_query {
+	/* some node_stat_item's */
+	unsigned long nr_anon_mapped;
+	unsigned long nr_shmem;
+	unsigned long nr_file_pages;
+	unsigned long nr_file_mapped;
+	/* some memcg_stat_item */
+	unsigned long memcg_kmem;
+	/* some vm_event_item */
+	unsigned long pgfault;
+};
+
+#endif /* __CGROUP_ITER_MEMCG_H */
diff --git a/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
new file mode 100644
index 000000000000..215e4c98c76f
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
@@ -0,0 +1,223 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <test_progs.h>
+#include <bpf/libbpf.h>
+#include <bpf/btf.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+#include <unistd.h>
+#include "cgroup_helpers.h"
+#include "cgroup_iter_memcg.h"
+#include "cgroup_iter_memcg.skel.h"
+
+static int read_stats(struct bpf_link *link)
+{
+	int fd, ret = 0;
+	ssize_t bytes;
+
+	fd = bpf_iter_create(bpf_link__fd(link));
+	if (!ASSERT_OK_FD(fd, "bpf_iter_create"))
+		return 1;
+
+	/*
+	 * Invoke iter program by reading from its fd. We're not expecting any
+	 * data to be written by the bpf program so the result should be zero.
+	 * Results will be read directly through the custom data section
+	 * accessible through skel->data_query.memcg_query.
+	 */
+	bytes = read(fd, NULL, 0);
+	if (!ASSERT_EQ(bytes, 0, "read fd"))
+		ret = 1;
+
+	close(fd);
+	return ret;
+}
+
+static void test_anon(struct bpf_link *link, struct memcg_query *memcg_query)
+{
+	void *map;
+	size_t len;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	/*
+	 * Increase memcg anon usage by mapping and writing
+	 * to a new anon region.
+	 */
+	map = mmap(NULL, len, PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap anon"))
+		return;
+
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->nr_anon_mapped, 0, "final anon mapped val");
+
+cleanup:
+	munmap(map, len);
+}
+
+static void test_file(struct bpf_link *link, struct memcg_query *memcg_query)
+{
+	void *map;
+	size_t len;
+	char *path;
+	int fd;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+	path = "/tmp/test_cgroup_iter_memcg";
+
+	/*
+	 * Increase memcg file usage by creating and writing
+	 * to a mapped file.
+	 */
+	fd = open(path, O_CREAT | O_RDWR, 0644);
+	if (!ASSERT_OK_FD(fd, "open fd"))
+		return;
+	if (!ASSERT_OK(ftruncate(fd, len), "ftruncate"))
+		goto cleanup_fd;
+
+	map = mmap(NULL, len, PROT_WRITE, MAP_SHARED, fd, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap file"))
+		goto cleanup_fd;
+
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup_map;
+
+	ASSERT_GT(memcg_query->nr_file_pages, 0, "final file value");
+	ASSERT_GT(memcg_query->nr_file_mapped, 0, "final file mapped value");
+
+cleanup_map:
+	munmap(map, len);
+cleanup_fd:
+	close(fd);
+	unlink(path);
+}
+
+static void test_shmem(struct bpf_link *link, struct memcg_query *memcg_query)
+{
+	size_t len;
+	int fd;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	/*
+	 * Increase memcg shmem usage by creating and writing
+	 * to a shmem object.
+	 */
+	fd = shm_open("/tmp_shmem", O_CREAT | O_RDWR, 0644);
+	if (!ASSERT_OK_FD(fd, "shm_open"))
+		return;
+
+	if (!ASSERT_OK(fallocate(fd, 0, 0, len), "fallocate"))
+		goto cleanup;
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->nr_shmem, 0, "final shmem value");
+
+cleanup:
+	close(fd);
+	shm_unlink("/tmp_shmem");
+}
+
+#define NR_PIPES 64
+static void test_kmem(struct bpf_link *link, struct memcg_query *memcg_query)
+{
+	int fds[NR_PIPES][2], i;
+
+	/*
+	 * Increase kmem value by creating pipes which will allocate some
+	 * kernel buffers.
+	 */
+	for (i = 0; i < NR_PIPES; i++) {
+		if (!ASSERT_OK(pipe(fds[i]), "pipe"))
+			goto cleanup;
+	}
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->memcg_kmem, 0, "kmem value");
+
+cleanup:
+	for (i = 0; i < NR_PIPES; i++) {
+		close(fds[i][0]);
+		close(fds[i][1]);
+	}
+}
+
+static void test_pgfault(struct bpf_link *link, struct memcg_query *memcg_query)
+{
+	void *map;
+	size_t len;
+
+	len = sysconf(_SC_PAGESIZE) * 1024;
+
+	/* Create region to use for triggering a page fault. */
+	map = mmap(NULL, len, PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
+	if (!ASSERT_NEQ(map, MAP_FAILED, "mmap anon"))
+		return;
+
+	/* Trigger page fault. */
+	memset(map, 1, len);
+
+	if (!ASSERT_OK(read_stats(link), "read stats"))
+		goto cleanup;
+
+	ASSERT_GT(memcg_query->pgfault, 0, "final pgfault val");
+
+cleanup:
+	munmap(map, len);
+}
+
+void test_cgroup_iter_memcg(void)
+{
+	char *cgroup_rel_path = "/cgroup_iter_memcg_test";
+	struct cgroup_iter_memcg *skel;
+	struct bpf_link *link;
+	int cgroup_fd;
+
+	cgroup_fd = cgroup_setup_and_join(cgroup_rel_path);
+	if (!ASSERT_OK_FD(cgroup_fd, "cgroup_setup_and_join"))
+		return;
+
+	skel = cgroup_iter_memcg__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "cgroup_iter_memcg__open_and_load"))
+		goto cleanup_cgroup_fd;
+
+	DECLARE_LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+	union bpf_iter_link_info linfo = {
+		.cgroup.cgroup_fd = cgroup_fd,
+		.cgroup.order = BPF_CGROUP_ITER_SELF_ONLY,
+	};
+	opts.link_info = &linfo;
+	opts.link_info_len = sizeof(linfo);
+
+	link = bpf_program__attach_iter(skel->progs.cgroup_memcg_query, &opts);
+	if (!ASSERT_OK_PTR(link, "bpf_program__attach_iter"))
+		goto cleanup_skel;
+
+	if (test__start_subtest("cgroup_iter_memcg__anon"))
+		test_anon(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__shmem"))
+		test_shmem(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__file"))
+		test_file(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__kmem"))
+		test_kmem(link, &skel->data_query->memcg_query);
+	if (test__start_subtest("cgroup_iter_memcg__pgfault"))
+		test_pgfault(link, &skel->data_query->memcg_query);
+
+	bpf_link__destroy(link);
+cleanup_skel:
+	cgroup_iter_memcg__destroy(skel);
+cleanup_cgroup_fd:
+	close(cgroup_fd);
+	cleanup_cgroup_environment();
+}
diff --git a/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
new file mode 100644
index 000000000000..92db5fd11391
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2025 Meta Platforms, Inc. and affiliates. */
+#include <vmlinux.h>
+#include <bpf/bpf_core_read.h>
+#include "cgroup_iter_memcg.h"
+
+char _license[] SEC("license") = "GPL";
+
+/* The latest values read are stored here. */
+struct memcg_query memcg_query SEC(".data.query");
+
+SEC("iter.s/cgroup")
+int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
+{
+	struct cgroup *cgrp = ctx->cgroup;
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg;
+
+	if (!cgrp)
+		return 1;
+
+	css = container_of(cgrp, struct cgroup_subsys_state, cgroup);
+	if (!css)
+		return 1;
+
+	memcg = bpf_get_mem_cgroup(css);
+	if (!memcg)
+		return 1;
+
+	bpf_mem_cgroup_flush_stats(memcg);
+
+	memcg_query.nr_anon_mapped = bpf_mem_cgroup_page_state(memcg, NR_ANON_MAPPED);
+	memcg_query.nr_shmem = bpf_mem_cgroup_page_state(memcg, NR_SHMEM);
+	memcg_query.nr_file_pages = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
+	memcg_query.nr_file_mapped = bpf_mem_cgroup_page_state(memcg, NR_FILE_MAPPED);
+	memcg_query.memcg_kmem = bpf_mem_cgroup_page_state(memcg, MEMCG_KMEM);
+	memcg_query.pgfault = bpf_mem_cgroup_vm_events(memcg, PGFAULT);
+
+	bpf_put_mem_cgroup(memcg);
+
+	return 0;
+}
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
  2025-10-27 23:21 ` [PATCH v2 12/23] bpf: selftests: selftests for memcg stat kfuncs Roman Gushchin
@ 2025-10-27 23:21 ` Roman Gushchin
  2025-10-27 23:57   ` bot+bpf-ci
  2025-11-10  9:46   ` Michal Hocko
  2025-10-27 23:21 ` [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers Roman Gushchin
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
an out of memory events and trigger the corresponding kernel OOM
handling mechanism.

It takes a trusted memcg pointer (or NULL for system-wide OOMs)
as an argument, as well as the page order.

If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
can be declared and handled in the system at once, so if the function
is called in parallel to another OOM handling, it bails out with -EBUSY.
This mode is suited for global OOM's: any concurrent OOMs will likely
do the job and release some memory. In a blocking mode (which is
suited for memcg OOMs) the execution will wait on the oom_lock mutex.

The function is declared as sleepable. It guarantees that it won't
be called from an atomic context. It's required by the OOM handling
code, which shouldn't be called from a non-blocking context.

Handling of a memcg OOM almost always requires taking of the
css_set_lock spinlock. The fact that bpf_out_of_memory() is sleepable
also guarantees that it can't be called with acquired css_set_lock,
so the kernel can't deadlock on it.

Please, note that this function will be inaccessible as of now.
Calling bpf_out_of_memory() from a random context is dangerous
because e.g. it's easy to deadlock the system on oom_lock.
The following commit in the series will provide one safe context
where this kfunc can be used.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/oom.h |  5 ++++
 mm/oom_kill.c       | 63 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 721087952d04..3cbdcd013274 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -21,6 +21,11 @@ enum oom_constraint {
 	CONSTRAINT_MEMCG,
 };
 
+enum bpf_oom_flags {
+	BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK = 1 << 0,
+	BPF_OOM_FLAGS_LAST = 1 << 1,
+};
+
 /*
  * Details of the page allocation that triggered the oom killer that are used to
  * determine what should be killed.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 3c86cd755371..d7fca4bf575b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1330,15 +1330,78 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
 	return 0;
 }
 
+/**
+ * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
+ * @memcg__nullable: memcg or NULL for system-wide OOMs
+ * @order: order of page which wasn't allocated
+ * @flags: flags
+ * @constraint_text__nullable: custom constraint description for the OOM report
+ *
+ * Declares the Out Of Memory state and invokes the OOM killer.
+ *
+ * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
+ * is true, the function will wait on it. Otherwise it bails out with -EBUSY
+ * if oom_lock is contended.
+ *
+ * Generally it's advised to pass wait_on_oom_lock=false for global OOMs
+ * and wait_on_oom_lock=true for memcg-scoped OOMs.
+ *
+ * Returns 1 if the forward progress was achieved and some memory was freed.
+ * Returns a negative value if an error occurred.
+ */
+__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
+				  int order, u64 flags)
+{
+	struct oom_control oc = {
+		.memcg = memcg__nullable,
+		.order = order,
+	};
+	int ret;
+
+	if (flags & ~(BPF_OOM_FLAGS_LAST - 1))
+		return -EINVAL;
+
+	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
+		return -EINVAL;
+
+	if (flags & BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK) {
+		ret = mutex_lock_killable(&oom_lock);
+		if (ret)
+			return ret;
+	} else if (!mutex_trylock(&oom_lock))
+		return -EBUSY;
+
+	ret = out_of_memory(&oc);
+
+	mutex_unlock(&oom_lock);
+	return ret;
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
+BTF_SET_START(bpf_oom_declare_oom_kfuncs)
+BTF_ID(func, bpf_out_of_memory)
+BTF_SET_END(bpf_oom_declare_oom_kfuncs)
+
+extern struct bpf_struct_ops bpf_psi_bpf_ops;
+
+static int bpf_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (!btf_id_set_contains(&bpf_oom_declare_oom_kfuncs, kfunc_id))
+		return 0;
+
+	return -EACCES;
+}
+
 static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
 	.owner          = THIS_MODULE,
 	.set            = &bpf_oom_kfuncs,
+	.filter         = bpf_oom_kfunc_filter,
 };
 
 static int __init bpf_oom_init(void)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
  2025-10-27 23:21 ` [PATCH v2 12/23] bpf: selftests: selftests for memcg stat kfuncs Roman Gushchin
  2025-10-27 23:21 ` [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
@ 2025-10-27 23:21 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-11-10  9:31   ` Michal Hocko
  2025-10-27 23:21 ` [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Currently there is a hard-coded list of possible oom constraints:
NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
Also, add an ability to specify a custom constraint name
when calling bpf_out_of_memory(). If an empty string is passed
as an argument, CONSTRAINT_BPF is displayed.

The resulting output in dmesg will look like this:

[  315.224875] kworker/u17:0 invoked oom-killer: gfp_mask=0x0(), order=0, oom_score_adj=0
               oom_policy=default
[  315.226532] CPU: 1 UID: 0 PID: 74 Comm: kworker/u17:0 Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
[  315.226534] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  315.226536] Workqueue: bpf_psi_wq bpf_psi_handle_event_fn
[  315.226542] Call Trace:
[  315.226545]  <TASK>
[  315.226548]  dump_stack_lvl+0x4d/0x70
[  315.226555]  dump_header+0x59/0x1c6
[  315.226561]  oom_kill_process.cold+0x8/0xef
[  315.226565]  out_of_memory+0x111/0x5c0
[  315.226577]  bpf_out_of_memory+0x6f/0xd0
[  315.226580]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226589]  bpf_prog_3018b0cf55d2c6bb_handle_psi_event+0x5d/0x76
[  315.226594]  bpf__bpf_psi_ops_handle_psi_event+0x47/0xa7
[  315.226599]  bpf_psi_handle_event_fn+0x63/0xb0
[  315.226604]  process_one_work+0x1fc/0x580
[  315.226616]  ? srso_alias_return_thunk+0x5/0xfbef5
[  315.226624]  worker_thread+0x1d9/0x3b0
[  315.226629]  ? __pfx_worker_thread+0x10/0x10
[  315.226632]  kthread+0x128/0x270
[  315.226637]  ? lock_release+0xd4/0x2d0
[  315.226645]  ? __pfx_kthread+0x10/0x10
[  315.226649]  ret_from_fork+0x81/0xd0
[  315.226652]  ? __pfx_kthread+0x10/0x10
[  315.226655]  ret_from_fork_asm+0x1a/0x30
[  315.226667]  </TASK>
[  315.239745] memory: usage 42240kB, limit 9007199254740988kB, failcnt 0
[  315.240231] swap: usage 0kB, limit 0kB, failcnt 0
[  315.240585] Memory cgroup stats for /cgroup-test-work-dir673/oom_test/cg2:
[  315.240603] anon 42897408
[  315.241317] file 0
[  315.241493] kernel 98304
...
[  315.255946] Tasks state (memory values in pages):
[  315.256292] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[  315.257107] [    675]     0   675   162013    10969    10712      257         0   155648        0             0 test_progs
[  315.257927] oom-kill:constraint=CONSTRAINT_BPF_PSI_MEM,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/cgroup-test-work-dir673/oom_test/cg2,task_memcg=/cgroup-test-work-dir673/oom_test/cg2,task=test_progs,pid=675,uid=0
[  315.259371] Memory cgroup out of memory: Killed process 675 (test_progs) total-vm:648052kB, anon-rss:42848kB, file-rss:1028kB, shmem-rss:0kB, UID:0 pgtables:152kB oom_score_adj:0

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/oom.h |  4 ++++
 mm/oom_kill.c       | 38 +++++++++++++++++++++++++++++---------
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 3cbdcd013274..704fc0e786c6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -19,6 +19,7 @@ enum oom_constraint {
 	CONSTRAINT_CPUSET,
 	CONSTRAINT_MEMORY_POLICY,
 	CONSTRAINT_MEMCG,
+	CONSTRAINT_BPF,
 };
 
 enum bpf_oom_flags {
@@ -63,6 +64,9 @@ struct oom_control {
 
 	/* Policy name */
 	const char *bpf_policy_name;
+
+	/* BPF-specific constraint name */
+	const char *bpf_constraint;
 #endif
 };
 
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7fca4bf575b..72a346261c79 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -240,13 +240,6 @@ long oom_badness(struct task_struct *p, unsigned long totalpages)
 	return points;
 }
 
-static const char * const oom_constraint_text[] = {
-	[CONSTRAINT_NONE] = "CONSTRAINT_NONE",
-	[CONSTRAINT_CPUSET] = "CONSTRAINT_CPUSET",
-	[CONSTRAINT_MEMORY_POLICY] = "CONSTRAINT_MEMORY_POLICY",
-	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
-};
-
 static const char *oom_policy_name(struct oom_control *oc)
 {
 #ifdef CONFIG_BPF_SYSCALL
@@ -256,6 +249,27 @@ static const char *oom_policy_name(struct oom_control *oc)
 	return "default";
 }
 
+static const char *oom_constraint_text(struct oom_control *oc)
+{
+	switch (oc->constraint) {
+	case CONSTRAINT_NONE:
+		return "CONSTRAINT_NONE";
+	case CONSTRAINT_CPUSET:
+		return "CONSTRAINT_CPUSET";
+	case CONSTRAINT_MEMORY_POLICY:
+		return "CONSTRAINT_MEMORY_POLICY";
+	case CONSTRAINT_MEMCG:
+		return "CONSTRAINT_MEMCG";
+#ifdef CONFIG_BPF_SYSCALL
+	case CONSTRAINT_BPF:
+		return oc->bpf_constraint ? : "CONSTRAINT_BPF";
+#endif
+	default:
+		WARN_ON_ONCE(1);
+		return "";
+	}
+}
+
 /*
  * Determine the type of allocation constraint.
  */
@@ -267,6 +281,9 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	bool cpuset_limited = false;
 	int nid;
 
+	if (oc->constraint == CONSTRAINT_BPF)
+		return CONSTRAINT_BPF;
+
 	if (is_memcg_oom(oc)) {
 		oc->totalpages = mem_cgroup_get_max(oc->memcg) ?: 1;
 		return CONSTRAINT_MEMCG;
@@ -458,7 +475,7 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
 {
 	/* one line summary of the oom killer context. */
 	pr_info("oom-kill:constraint=%s,nodemask=%*pbl",
-			oom_constraint_text[oc->constraint],
+			oom_constraint_text(oc),
 			nodemask_pr_args(oc->nodemask));
 	cpuset_print_current_mems_allowed();
 	mem_cgroup_print_oom_context(oc->memcg, victim);
@@ -1350,11 +1367,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
  * Returns a negative value if an error occurred.
  */
 __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
-				  int order, u64 flags)
+				  int order, u64 flags,
+				  const char *constraint_text__nullable)
 {
 	struct oom_control oc = {
 		.memcg = memcg__nullable,
 		.order = order,
+		.constraint = CONSTRAINT_BPF,
+		.bpf_constraint = constraint_text__nullable,
 	};
 	int ret;
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (2 preceding siblings ...)
  2025-10-27 23:21 ` [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers Roman Gushchin
@ 2025-10-27 23:21 ` Roman Gushchin
  2025-10-28 17:32   ` Tejun Heo
  2025-10-27 23:21 ` [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Export tsk_is_oom_victim() helper as a BPF kfunc.
It's very useful to avoid redundant oom kills.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 72a346261c79..90bb86dee3cf 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1397,11 +1397,25 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
 	return ret;
 }
 
+/**
+ * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
+ * @task: task to check
+ *
+ * Returns true if the task has been previously selected by the OOM killer
+ * to be killed. It's expected that the task will be destroyed soon and some
+ * memory will be freed, so maybe no additional actions required.
+ */
+__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
+{
+	return tsk_is_oom_victim(task);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_oom_kfuncs)
 BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
 BTF_ID_FLAGS(func, bpf_out_of_memory, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_task_is_oom_victim, KF_TRUSTED_ARGS)
 BTF_KFUNCS_END(bpf_oom_kfuncs)
 
 BTF_SET_START(bpf_oom_declare_oom_kfuncs)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts()
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (3 preceding siblings ...)
  2025-10-27 23:21 ` [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
@ 2025-10-27 23:21 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-10-27 23:22 ` [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Introduce bpf_map__attach_struct_ops_opts(), an extended version of
bpf_map__attach_struct_ops(), which takes additional struct
bpf_struct_ops_opts argument.

struct bpf_struct_ops_opts has the relative_fd member, which allows
to pass an additional file descriptor argument. It can be used to
attach struct ops maps to cgroups.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/lib/bpf/bpf.c      |  8 ++++++++
 tools/lib/bpf/libbpf.c   | 18 ++++++++++++++++--
 tools/lib/bpf/libbpf.h   | 14 ++++++++++++++
 tools/lib/bpf/libbpf.map |  1 +
 4 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 339b19797237..4c8944f8d6ba 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -883,6 +883,14 @@ int bpf_link_create(int prog_fd, int target_fd,
 		if (!OPTS_ZEROED(opts, cgroup))
 			return libbpf_err(-EINVAL);
 		break;
+	case BPF_STRUCT_OPS:
+		relative_fd = OPTS_GET(opts, cgroup.relative_fd, 0);
+		attr.link_create.cgroup.relative_fd = relative_fd;
+		attr.link_create.cgroup.expected_revision =
+			OPTS_GET(opts, cgroup.expected_revision, 0);
+		if (!OPTS_ZEROED(opts, cgroup))
+			return libbpf_err(-EINVAL);
+		break;
 	default:
 		if (!OPTS_ZEROED(opts, flags))
 			return libbpf_err(-EINVAL);
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index b90574f39d1c..be56a5dee505 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -13196,12 +13196,19 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
 	return close(link->fd);
 }
 
-struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+						 const struct bpf_struct_ops_opts *opts)
 {
+	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
 	struct bpf_link_struct_ops *link;
 	__u32 zero = 0;
 	int err, fd;
 
+	if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
+		pr_warn("map '%s': invalid opts\n", map->name);
+		return libbpf_err_ptr(-EINVAL);
+	}
+
 	if (!bpf_map__is_struct_ops(map)) {
 		pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
 		return libbpf_err_ptr(-EINVAL);
@@ -13237,7 +13244,9 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 		return &link->link;
 	}
 
-	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
+	link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
+
+	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);
 	if (fd < 0) {
 		free(link);
 		return libbpf_err_ptr(fd);
@@ -13249,6 +13258,11 @@ struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
 	return &link->link;
 }
 
+struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
+{
+	return bpf_map__attach_struct_ops_opts(map, NULL);
+}
+
 /*
  * Swap the back struct_ops of a link with a new struct_ops map.
  */
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 5118d0a90e24..dc84898715cf 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -922,6 +922,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
 struct bpf_map;
 
 LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
+
+struct bpf_struct_ops_opts {
+	/* size of this struct, for forward/backward compatibility */
+	size_t sz;
+	__u32 flags;
+	__u32 relative_fd;
+	__u64 expected_revision;
+	size_t :0;
+};
+#define bpf_struct_ops_opts__last_field expected_revision
+
+LIBBPF_API struct bpf_link *
+bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
+				const struct bpf_struct_ops_opts *opts);
 LIBBPF_API int bpf_link__update_map(struct bpf_link *link, const struct bpf_map *map);
 
 struct bpf_iter_attach_opts {
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 8ed8749907d4..bc00089343ce 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -451,4 +451,5 @@ LIBBPF_1.7.0 {
 	global:
 		bpf_map__set_exclusive_program;
 		bpf_map__exclusive_program;
+		bpf_map__attach_struct_ops_opts;
 } LIBBPF_1.6.0;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (4 preceding siblings ...)
  2025-10-27 23:21 ` [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-10-27 23:22 ` [PATCH v2 18/23] bpf: selftests: BPF OOM handler test Roman Gushchin
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Implement read_cgroup_file() helper to read from cgroup control files,
e.g. statistics.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 tools/testing/selftests/bpf/cgroup_helpers.c | 39 ++++++++++++++++++++
 tools/testing/selftests/bpf/cgroup_helpers.h |  2 +
 2 files changed, 41 insertions(+)

diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
index 20cede4db3ce..8fb02fe4c4aa 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.c
+++ b/tools/testing/selftests/bpf/cgroup_helpers.c
@@ -126,6 +126,45 @@ int enable_controllers(const char *relative_path, const char *controllers)
 	return __enable_controllers(cgroup_path, controllers);
 }
 
+static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
+				 char *buf, size_t size)
+{
+	char file_path[PATH_MAX + 1];
+	size_t ret;
+	int fd;
+
+	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
+	fd = open(file_path, O_RDONLY);
+	if (fd < 0) {
+		log_err("Opening %s", file_path);
+		return -1;
+	}
+
+	ret = read(fd, buf, size);
+	close(fd);
+	return ret;
+}
+
+/**
+ * read_cgroup_file() - Read to a cgroup file
+ * @relative_path: The cgroup path, relative to the workdir
+ * @file: The name of the file in cgroupfs to read to
+ * @buf: Buffer to read from the file
+ * @size: Size of the buffer
+ *
+ * Read to a file in the given cgroup's directory.
+ *
+ * If successful, the number of read bytes is returned.
+ */
+size_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size)
+{
+	char cgroup_path[PATH_MAX - 24];
+
+	format_cgroup_path(cgroup_path, relative_path);
+	return __read_cgroup_file(cgroup_path, file, buf, size);
+}
+
 static int __write_cgroup_file(const char *cgroup_path, const char *file,
 			       const char *buf)
 {
diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
index 3857304be874..9f9bb6b5d992 100644
--- a/tools/testing/selftests/bpf/cgroup_helpers.h
+++ b/tools/testing/selftests/bpf/cgroup_helpers.h
@@ -11,6 +11,8 @@
 
 /* cgroupv2 related */
 int enable_controllers(const char *relative_path, const char *controllers);
+size_t read_cgroup_file(const char *relative_path, const char *file,
+			char *buf, size_t size);
 int write_cgroup_file(const char *relative_path, const char *file,
 		      const char *buf);
 int write_cgroup_file_parent(const char *relative_path, const char *file,
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 18/23] bpf: selftests: BPF OOM handler test
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (5 preceding siblings ...)
  2025-10-27 23:22 ` [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:22 ` [PATCH v2 19/23] sched: psi: refactor psi_trigger_create() Roman Gushchin
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Implement a kselftest for the OOM handling functionality.

The OOM handling policy which is implemented in BPF is to
kill all tasks belonging to the biggest leaf cgroup, which
doesn't contain unkillable tasks (tasks with oom_score_adj
set to -1000). Pagecache size is excluded from the accounting.

The test creates a hierarchy of memory cgroups, causes an
OOM at the top level, checks that the expected process will be
killed and checks memcg's oom statistics.

Please, note that the same BPF OOM policy is attached to a memory
cgroup and system-wide. In the first case the program does nothing
and returns false, so it's executed the second time, when it properly
handles the OOM.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 .../selftests/bpf/prog_tests/test_oom.c       | 249 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_oom.c  | 118 +++++++++
 2 files changed, 367 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_oom.c b/tools/testing/selftests/bpf/prog_tests/test_oom.c
new file mode 100644
index 000000000000..6126d961aba3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_oom.c
@@ -0,0 +1,249 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "test_oom.skel.h"
+
+struct cgroup_desc {
+	const char *path;
+	int fd;
+	unsigned long long id;
+	int pid;
+	size_t target;
+	size_t max;
+	int oom_score_adj;
+	bool victim;
+};
+
+#define MB (1024 * 1024)
+#define OOM_SCORE_ADJ_MIN	(-1000)
+#define OOM_SCORE_ADJ_MAX	1000
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/oom_test", .max = 80 * MB},
+	{ .path = "/oom_test/cg1", .target = 10 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MAX },
+	{ .path = "/oom_test/cg2", .target = 40 * MB,
+	  .oom_score_adj = OOM_SCORE_ADJ_MIN },
+	{ .path = "/oom_test/cg3" },
+	{ .path = "/oom_test/cg3/cg4", .target = 30 * MB,
+	  .victim = true },
+	{ .path = "/oom_test/cg3/cg5", .target = 20 * MB },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	if (desc->oom_score_adj) {
+		char buf[64];
+		int fd = open("/proc/self/oom_score_adj", O_WRONLY);
+
+		if (fd < 0)
+			return -1;
+
+		snprintf(buf, sizeof(buf), "%d", desc->oom_score_adj);
+		write(fd, buf, sizeof(buf));
+		close(fd);
+	}
+
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		/* Freeze the top-level cgroup */
+		if (i == 0) {
+			/* Freeze the top-level cgroup */
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+		}
+
+		/* Recursively enable the memory controller */
+		if (!cgroups[i].target) {
+
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		/* Set memory.max */
+		if (cgroups[i].max) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].max);
+			err = write_cgroup_file(cgroups[i].path, "memory.max", buf);
+			if (!ASSERT_OK(err, "set memory.max"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+
+		}
+
+		/* Spawn tasks creating memory pressure */
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	/* Unfreeze the top-level cgroup */
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "freeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+
+		first = false;
+
+		/* Check which process was terminated first */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			/* Check the memcg oom counter */
+			size = read_cgroup_file(cgroups[i].path,
+						"memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		/* Kill all remaining tasks */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_oom(void)
+{
+	DECLARE_LIBBPF_OPTS(bpf_struct_ops_opts, opts);
+	struct test_oom *skel;
+	struct bpf_link *link1, *link2;
+	int err = 0;
+
+	setup_environment();
+
+	skel = test_oom__open_and_load();
+	if (!skel) {
+		err = -errno;
+		CHECK_FAIL(err);
+		goto cleanup;
+	}
+
+	opts.relative_fd = cgroups[0].fd;
+	link1 = bpf_map__attach_struct_ops_opts(skel->maps.test_bpf_oom, &opts);
+	if (!link1) {
+		err = -errno;
+		CHECK_FAIL(err);
+		goto cleanup;
+	}
+
+	opts.relative_fd = 0; /* attach system-wide */
+	link2 = bpf_map__attach_struct_ops_opts(skel->maps.test_bpf_oom, &opts);
+	if (!link2) {
+		err = -errno;
+		CHECK_FAIL(err);
+		goto cleanup;
+	}
+
+	/* Unfreeze all child tasks and create the memory pressure */
+	err = run_and_wait_for_oom();
+	CHECK_FAIL(err);
+
+cleanup:
+	cleanup_cgroup_environment();
+	test_oom__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_oom.c b/tools/testing/selftests/bpf/progs/test_oom.c
new file mode 100644
index 000000000000..ded5c0375df7
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_oom.c
@@ -0,0 +1,118 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define OOM_SCORE_ADJ_MIN	(-1000)
+
+static bool mem_cgroup_killable(struct mem_cgroup *memcg)
+{
+	struct task_struct *task;
+	bool ret = true;
+
+	bpf_for_each(css_task, task, &memcg->css, CSS_TASK_ITER_PROCS)
+		if (task->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+			return false;
+
+	return ret;
+}
+
+/*
+ * Find the largest leaf cgroup (ignoring page cache) without unkillable tasks
+ * and kill all belonging tasks.
+ */
+SEC("struct_ops.s/handle_out_of_memory")
+int BPF_PROG(test_out_of_memory, struct bpf_oom_ctx *exec_ctx, struct oom_control *oc)
+{
+	struct task_struct *task;
+	struct mem_cgroup *root_memcg = oc->memcg;
+	struct mem_cgroup *memcg, *victim = NULL;
+	struct cgroup_subsys_state *css_pos;
+	unsigned long usage, max_usage = 0;
+	unsigned long pagecache = 0;
+	int ret = 0;
+
+	/* Pass to the system-level bpf_oom ops */
+	if (exec_ctx->cgroup_id)
+		return 0;
+
+	if (root_memcg)
+		root_memcg = bpf_get_mem_cgroup(&root_memcg->css);
+	else
+		root_memcg = bpf_get_root_mem_cgroup();
+
+	if (!root_memcg)
+		return 0;
+
+	bpf_rcu_read_lock();
+	bpf_for_each(css, css_pos, &root_memcg->css, BPF_CGROUP_ITER_DESCENDANTS_POST) {
+		if (css_pos->cgroup->nr_descendants + css_pos->cgroup->nr_dying_descendants)
+			continue;
+
+		memcg = bpf_get_mem_cgroup(css_pos);
+		if (!memcg)
+			continue;
+
+		usage = bpf_mem_cgroup_usage(memcg);
+		pagecache = bpf_mem_cgroup_page_state(memcg, NR_FILE_PAGES);
+
+		if (usage > pagecache)
+			usage -= pagecache;
+		else
+			usage = 0;
+
+		if ((usage > max_usage) && mem_cgroup_killable(memcg)) {
+			max_usage = usage;
+			if (victim)
+				bpf_put_mem_cgroup(victim);
+			victim = bpf_get_mem_cgroup(&memcg->css);
+		}
+
+		bpf_put_mem_cgroup(memcg);
+	}
+	bpf_rcu_read_unlock();
+
+	if (!victim)
+		goto exit;
+
+	bpf_for_each(css_task, task, &victim->css, CSS_TASK_ITER_PROCS) {
+		struct task_struct *t = bpf_task_acquire(task);
+
+		if (t) {
+			/*
+			 * If the task is already an OOM victim, it will
+			 * quit soon and release some memory.
+			 */
+			if (bpf_task_is_oom_victim(task)) {
+				bpf_task_release(t);
+				ret = 1;
+				break;
+			}
+
+			bpf_oom_kill_process(oc, task, "bpf oom test");
+			bpf_task_release(t);
+			ret = 1;
+		}
+	}
+
+	bpf_put_mem_cgroup(victim);
+exit:
+	bpf_put_mem_cgroup(root_memcg);
+
+	return ret;
+}
+
+SEC("struct_ops.s/handle_cgroup_offline")
+int BPF_PROG(test_cgroup_offline, struct bpf_oom_ctx *exec_ctx, u64 cgroup_id)
+{
+	return 0;
+}
+
+SEC(".struct_ops.link")
+struct bpf_oom_ops test_bpf_oom = {
+	.name = "bpf_test_policy",
+	.handle_out_of_memory = (void *)test_out_of_memory,
+	.handle_cgroup_offline = (void *)test_cgroup_offline,
+};
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 19/23] sched: psi: refactor psi_trigger_create()
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (6 preceding siblings ...)
  2025-10-27 23:22 ` [PATCH v2 18/23] bpf: selftests: BPF OOM handler test Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:22 ` [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops Roman Gushchin
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Currently psi_trigger_create() does a lot of things:
parses the user text input, allocates and initializes
the psi_trigger structure and turns on the trigger.
It does it slightly different for two existing types
of psi_triggers: system-wide and cgroup-wide.

In order to support a new type of PSI triggers, which
will be owned by a BPF program and won't have a user's
text description, let's refactor psi_trigger_create().

1. Introduce psi_trigger_type enum:
   currently PSI_SYSTEM and PSI_CGROUP are valid values.
2. Introduce psi_trigger_params structure to avoid passing
   a large number of parameters to psi_trigger_create().
3. Move out the user's input parsing into the new
   psi_trigger_parse() helper.
4. Move out the capabilities check into the new
   psi_file_privileged() helper.
5. Stop relying on t->of for detecting trigger type.

This commit is a pure refactoring and doesn't bring any
functional changes.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/psi.h       | 15 +++++--
 include/linux/psi_types.h | 33 ++++++++++++++-
 kernel/cgroup/cgroup.c    | 14 ++++++-
 kernel/sched/psi.c        | 88 +++++++++++++++++++++++++--------------
 4 files changed, 113 insertions(+), 37 deletions(-)

diff --git a/include/linux/psi.h b/include/linux/psi.h
index e0745873e3f2..8178e998d94b 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -23,14 +23,23 @@ void psi_memstall_enter(unsigned long *flags);
 void psi_memstall_leave(unsigned long *flags);
 
 int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res);
-struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
-				       enum psi_res res, struct file *file,
-				       struct kernfs_open_file *of);
+int psi_trigger_parse(struct psi_trigger_params *params, const char *buf);
+struct psi_trigger *psi_trigger_create(struct psi_group *group,
+				const struct psi_trigger_params *param);
 void psi_trigger_destroy(struct psi_trigger *t);
 
 __poll_t psi_trigger_poll(void **trigger_ptr, struct file *file,
 			poll_table *wait);
 
+static inline bool psi_file_privileged(struct file *file)
+{
+	/*
+	 * Checking the privilege here on file->f_cred implies that a privileged user
+	 * could open the file and delegate the write to an unprivileged one.
+	 */
+	return cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
+}
+
 #ifdef CONFIG_CGROUPS
 static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
 {
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index dd10c22299ab..aa5ed39592cb 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -119,7 +119,38 @@ struct psi_window {
 	u64 prev_growth;
 };
 
+enum psi_trigger_type {
+	PSI_SYSTEM,
+	PSI_CGROUP,
+};
+
+struct psi_trigger_params {
+	/* Trigger type */
+	enum psi_trigger_type type;
+
+	/* Resource to be monitored */
+	enum psi_res res;
+
+	/* True if all threads should be stalled to trigger */
+	bool full;
+
+	/* Threshold in us */
+	u32 threshold_us;
+
+	/* Window in us */
+	u32 window_us;
+
+	/* Privileged triggers are treated differently */
+	bool privileged;
+
+	/* Link to kernfs open file, only for PSI_CGROUP */
+	struct kernfs_open_file *of;
+};
+
 struct psi_trigger {
+	/* Trigger type */
+	enum psi_trigger_type type;
+
 	/* PSI state being monitored by the trigger */
 	enum psi_states state;
 
@@ -135,7 +166,7 @@ struct psi_trigger {
 	/* Wait queue for polling */
 	wait_queue_head_t event_wait;
 
-	/* Kernfs file for cgroup triggers */
+	/* Kernfs file for PSI_CGROUP triggers */
 	struct kernfs_open_file *of;
 
 	/* Pending event flag */
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6ae5f48cf64e..836b28676abc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4000,6 +4000,12 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
 	struct psi_trigger *new;
 	struct cgroup *cgrp;
 	struct psi_group *psi;
+	struct psi_trigger_params params;
+	int err;
+
+	err = psi_trigger_parse(&params, buf);
+	if (err)
+		return err;
 
 	cgrp = cgroup_kn_lock_live(of->kn, false);
 	if (!cgrp)
@@ -4015,7 +4021,13 @@ static ssize_t pressure_write(struct kernfs_open_file *of, char *buf,
 	}
 
 	psi = cgroup_psi(cgrp);
-	new = psi_trigger_create(psi, buf, res, of->file, of);
+
+	params.type = PSI_CGROUP;
+	params.res = res;
+	params.privileged = psi_file_privileged(of->file);
+	params.of = of;
+
+	new = psi_trigger_create(psi, &params);
 	if (IS_ERR(new)) {
 		cgroup_put(cgrp);
 		return PTR_ERR(new);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 59fdb7ebbf22..73fdc79b5602 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -511,7 +511,7 @@ static void update_triggers(struct psi_group *group, u64 now,
 
 		/* Generate an event */
 		if (cmpxchg(&t->event, 0, 1) == 0) {
-			if (t->of)
+			if (t->type == PSI_CGROUP)
 				kernfs_notify(t->of->kn);
 			else
 				wake_up_interruptible(&t->event_wait);
@@ -1292,74 +1292,88 @@ int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res)
 	return 0;
 }
 
-struct psi_trigger *psi_trigger_create(struct psi_group *group, char *buf,
-				       enum psi_res res, struct file *file,
-				       struct kernfs_open_file *of)
+int psi_trigger_parse(struct psi_trigger_params *params, const char *buf)
 {
-	struct psi_trigger *t;
-	enum psi_states state;
-	u32 threshold_us;
-	bool privileged;
-	u32 window_us;
+	u32 threshold_us, window_us;
 
 	if (static_branch_likely(&psi_disabled))
-		return ERR_PTR(-EOPNOTSUPP);
-
-	/*
-	 * Checking the privilege here on file->f_cred implies that a privileged user
-	 * could open the file and delegate the write to an unprivileged one.
-	 */
-	privileged = cap_raised(file->f_cred->cap_effective, CAP_SYS_RESOURCE);
+		return -EOPNOTSUPP;
 
 	if (sscanf(buf, "some %u %u", &threshold_us, &window_us) == 2)
-		state = PSI_IO_SOME + res * 2;
+		params->full = false;
 	else if (sscanf(buf, "full %u %u", &threshold_us, &window_us) == 2)
-		state = PSI_IO_FULL + res * 2;
+		params->full = true;
 	else
-		return ERR_PTR(-EINVAL);
+		return -EINVAL;
+
+	params->threshold_us = threshold_us;
+	params->window_us = window_us;
+	return 0;
+}
+
+struct psi_trigger *psi_trigger_create(struct psi_group *group,
+				       const struct psi_trigger_params *params)
+{
+	struct psi_trigger *t;
+	enum psi_states state;
+
+	if (static_branch_likely(&psi_disabled))
+		return ERR_PTR(-EOPNOTSUPP);
+
+	state = params->full ? PSI_IO_FULL : PSI_IO_SOME;
+	state += params->res * 2;
 
 #ifdef CONFIG_IRQ_TIME_ACCOUNTING
-	if (res == PSI_IRQ && --state != PSI_IRQ_FULL)
+	if (params->res == PSI_IRQ && --state != PSI_IRQ_FULL)
 		return ERR_PTR(-EINVAL);
 #endif
 
 	if (state >= PSI_NONIDLE)
 		return ERR_PTR(-EINVAL);
 
-	if (window_us == 0 || window_us > WINDOW_MAX_US)
+	if (params->window_us == 0 || params->window_us > WINDOW_MAX_US)
 		return ERR_PTR(-EINVAL);
 
 	/*
 	 * Unprivileged users can only use 2s windows so that averages aggregation
 	 * work is used, and no RT threads need to be spawned.
 	 */
-	if (!privileged && window_us % 2000000)
+	if (!params->privileged && params->window_us % 2000000)
 		return ERR_PTR(-EINVAL);
 
 	/* Check threshold */
-	if (threshold_us == 0 || threshold_us > window_us)
+	if (params->threshold_us == 0 || params->threshold_us > params->window_us)
 		return ERR_PTR(-EINVAL);
 
 	t = kmalloc(sizeof(*t), GFP_KERNEL);
 	if (!t)
 		return ERR_PTR(-ENOMEM);
 
+	t->type = params->type;
 	t->group = group;
 	t->state = state;
-	t->threshold = threshold_us * NSEC_PER_USEC;
-	t->win.size = window_us * NSEC_PER_USEC;
+	t->threshold = params->threshold_us * NSEC_PER_USEC;
+	t->win.size = params->window_us * NSEC_PER_USEC;
 	window_reset(&t->win, sched_clock(),
 			group->total[PSI_POLL][t->state], 0);
 
 	t->event = 0;
 	t->last_event_time = 0;
-	t->of = of;
-	if (!of)
+
+	switch (params->type) {
+	case PSI_SYSTEM:
 		init_waitqueue_head(&t->event_wait);
+		t->of = NULL;
+		break;
+	case PSI_CGROUP:
+		t->of = params->of;
+		break;
+	}
+
 	t->pending_event = false;
-	t->aggregator = privileged ? PSI_POLL : PSI_AVGS;
+	t->aggregator = params->privileged ? PSI_POLL : PSI_AVGS;
 
-	if (privileged) {
+	if (params->privileged) {
 		mutex_lock(&group->rtpoll_trigger_lock);
 
 		if (!rcu_access_pointer(group->rtpoll_task)) {
@@ -1412,7 +1426,7 @@ void psi_trigger_destroy(struct psi_trigger *t)
 	 * being accessed later. Can happen if cgroup is deleted from under a
 	 * polling process.
 	 */
-	if (t->of)
+	if (t->type == PSI_CGROUP)
 		kernfs_notify(t->of->kn);
 	else
 		wake_up_interruptible(&t->event_wait);
@@ -1492,7 +1506,7 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
 	if (!t)
 		return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
 
-	if (t->of)
+	if (t->type == PSI_CGROUP)
 		kernfs_generic_poll(t->of, wait);
 	else
 		poll_wait(file, &t->event_wait, wait);
@@ -1541,6 +1555,8 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
 	size_t buf_size;
 	struct seq_file *seq;
 	struct psi_trigger *new;
+	struct psi_trigger_params params;
+	int err;
 
 	if (static_branch_likely(&psi_disabled))
 		return -EOPNOTSUPP;
@@ -1554,6 +1570,10 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
 
 	buf[buf_size - 1] = '\0';
 
+	err = psi_trigger_parse(&params, buf);
+	if (err)
+		return err;
+
 	seq = file->private_data;
 
 	/* Take seq->lock to protect seq->private from concurrent writes */
@@ -1565,7 +1585,11 @@ static ssize_t psi_write(struct file *file, const char __user *user_buf,
 		return -EBUSY;
 	}
 
-	new = psi_trigger_create(&psi_system, buf, res, file, NULL);
+	params.type = PSI_SYSTEM;
+	params.res = res;
+	params.privileged = psi_file_privileged(file);
+
+	new = psi_trigger_create(&psi_system, &params);
 	if (IS_ERR(new)) {
 		mutex_unlock(&seq->lock);
 		return PTR_ERR(new);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (7 preceding siblings ...)
  2025-10-27 23:22 ` [PATCH v2 19/23] sched: psi: refactor psi_trigger_create() Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 17:40   ` Tejun Heo
  2025-10-27 23:22 ` [PATCH v2 21/23] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

This patch implements a BPF struct ops-based mechanism to create
PSI triggers, attach them to cgroups or system wide and handle
PSI events in BPF.

The struct ops provides 3 callbacks:
  - init() called once at load, handy for creating PSI triggers
  - handle_psi_event() called every time a PSI trigger fires
  - handle_cgroup_online() called when a new cgroup is created
  - handle_cgroup_offline() called if a cgroup with an attached
    trigger is deleted

A single struct ops can create a number of PSI triggers, both
cgroup-scoped and system-wide.

All 4 struct ops callbacks can be sleepable. handle_psi_event()
handlers are executed using a separate workqueue, so it won't
affect the latency of other PSI triggers.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf_psi.h      |  87 ++++++++++
 include/linux/psi_types.h    |  43 ++++-
 kernel/bpf/cgroup.c          |   3 +
 kernel/sched/bpf_psi.c       | 302 +++++++++++++++++++++++++++++++++++
 kernel/sched/build_utility.c |   4 +
 kernel/sched/psi.c           |  48 ++++--
 mm/oom_kill.c                |   3 +
 7 files changed, 478 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/bpf_psi.h
 create mode 100644 kernel/sched/bpf_psi.c

diff --git a/include/linux/bpf_psi.h b/include/linux/bpf_psi.h
new file mode 100644
index 000000000000..023bef0595ee
--- /dev/null
+++ b/include/linux/bpf_psi.h
@@ -0,0 +1,87 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_PSI_H
+#define __BPF_PSI_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/srcu.h>
+#include <linux/psi_types.h>
+
+struct cgroup;
+struct bpf_psi;
+struct psi_trigger;
+struct psi_trigger_params;
+
+#define BPF_PSI_FULL 0x80000000
+
+struct bpf_psi_ops {
+	/**
+	 * @init: Initialization callback, suited for creating psi triggers.
+	 * @bpf_psi: bpf_psi pointer, can be passed to bpf_psi_create_trigger().
+	 *
+	 * A non-0 return value means the initialization has been failed.
+	 */
+	int (*init)(struct bpf_psi *bpf_psi);
+
+	/**
+	 * @handle_psi_event: PSI event callback
+	 * @t: psi_trigger pointer
+	 */
+	void (*handle_psi_event)(struct bpf_psi *bpf_psi, struct psi_trigger *t);
+
+	/**
+	 * @handle_cgroup_online: Cgroup online callback
+	 * @cgroup_id: Id of the new cgroup
+	 *
+	 * Called every time a new cgroup is created. Can be used
+	 * to create new psi triggers.
+	 */
+	void (*handle_cgroup_online)(struct bpf_psi *bpf_psi, u64 cgroup_id);
+
+	/**
+	 * @handle_cgroup_offline: Cgroup offline callback
+	 * @cgroup_id: Id of offlined cgroup
+	 *
+	 * Called every time a cgroup with an attached bpf psi trigger is
+	 * offlined.
+	 */
+	void (*handle_cgroup_offline)(struct bpf_psi *bpf_psi, u64 cgroup_id);
+
+	/* private */
+	struct bpf_psi *bpf_psi;
+};
+
+struct bpf_psi {
+	spinlock_t lock;
+	struct list_head triggers;
+	struct bpf_psi_ops *ops;
+	struct srcu_struct srcu;
+	struct list_head node; /* Protected by bpf_psi_lock */
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+void bpf_psi_add_trigger(struct psi_trigger *t,
+			 const struct psi_trigger_params *params);
+void bpf_psi_remove_trigger(struct psi_trigger *t);
+void bpf_psi_handle_event(struct psi_trigger *t);
+
+#else /* CONFIG_BPF_SYSCALL */
+static inline void bpf_psi_add_trigger(struct psi_trigger *t,
+			const struct psi_trigger_params *params) {}
+static inline void bpf_psi_remove_trigger(struct psi_trigger *t) {}
+static inline void bpf_psi_handle_event(struct psi_trigger *t) {}
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#if (defined(CONFIG_CGROUPS) && defined(CONFIG_PSI) && defined(CONFIG_BPF_SYSCALL))
+void bpf_psi_cgroup_online(struct cgroup *cgroup);
+void bpf_psi_cgroup_offline(struct cgroup *cgroup);
+
+#else /* CONFIG_CGROUPS && CONFIG_PSI && CONFIG_BPF_SYSCALL */
+static inline void bpf_psi_cgroup_online(struct cgroup *cgroup) {}
+static inline void bpf_psi_cgroup_offline(struct cgroup *cgroup) {}
+
+#endif /* CONFIG_CGROUPS && CONFIG_PSI && CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_PSI_H */
diff --git a/include/linux/psi_types.h b/include/linux/psi_types.h
index aa5ed39592cb..e551df9d6336 100644
--- a/include/linux/psi_types.h
+++ b/include/linux/psi_types.h
@@ -122,6 +122,7 @@ struct psi_window {
 enum psi_trigger_type {
 	PSI_SYSTEM,
 	PSI_CGROUP,
+	PSI_BPF,
 };
 
 struct psi_trigger_params {
@@ -143,8 +144,15 @@ struct psi_trigger_params {
 	/* Privileged triggers are treated differently */
 	bool privileged;
 
-	/* Link to kernfs open file, only for PSI_CGROUP */
-	struct kernfs_open_file *of;
+	union {
+		/* Link to kernfs open file, only for PSI_CGROUP */
+		struct kernfs_open_file *of;
+
+#ifdef CONFIG_BPF_SYSCALL
+		/* Link to bpf_psi structure, only for BPF_PSI */
+		struct bpf_psi *bpf_psi;
+#endif
+	};
 };
 
 struct psi_trigger {
@@ -186,6 +194,31 @@ struct psi_trigger {
 
 	/* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
 	enum psi_aggregators aggregator;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Fields specific to PSI_BPF triggers */
+
+	/* Bpf psi structure for events handling */
+	struct bpf_psi *bpf_psi;
+
+	/* List node inside bpf_psi->triggers list */
+	struct list_head bpf_psi_node;
+
+	/* List node inside group->bpf_triggers list */
+	struct list_head bpf_group_node;
+
+	/* Work structure, used to execute event handlers */
+	struct work_struct bpf_work;
+
+	/*
+	 * Whether the trigger is being pinned in memory.
+	 * Protected by group->bpf_triggers_lock.
+	 */
+	bool pinned;
+
+	/* Cgroup Id */
+	u64 cgroup_id;
+#endif
 };
 
 struct psi_group {
@@ -234,6 +267,12 @@ struct psi_group {
 	u64 rtpoll_total[NR_PSI_STATES - 1];
 	u64 rtpoll_next_update;
 	u64 rtpoll_until;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* List of triggers owned by bpf and corresponding lock */
+	spinlock_t bpf_triggers_lock;
+	struct list_head bpf_triggers;
+#endif
 };
 
 #else /* CONFIG_PSI */
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 248f517d66d0..4df4c49ba179 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -15,6 +15,7 @@
 #include <linux/bpf.h>
 #include <linux/bpf-cgroup.h>
 #include <linux/bpf_lsm.h>
+#include <linux/bpf_psi.h>
 #include <linux/bpf_verifier.h>
 #include <net/sock.h>
 #include <net/bpf_sk_storage.h>
@@ -557,9 +558,11 @@ static int cgroup_bpf_lifetime_notify(struct notifier_block *nb,
 
 	switch (action) {
 	case CGROUP_LIFETIME_ONLINE:
+		bpf_psi_cgroup_online(cgrp);
 		ret = cgroup_bpf_inherit(cgrp);
 		break;
 	case CGROUP_LIFETIME_OFFLINE:
+		bpf_psi_cgroup_offline(cgrp);
 		cgroup_bpf_offline(cgrp);
 		break;
 	}
diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
new file mode 100644
index 000000000000..c383a20119a6
--- /dev/null
+++ b/kernel/sched/bpf_psi.c
@@ -0,0 +1,302 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BPF PSI event handlers
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/bpf_psi.h>
+#include <linux/cgroup-defs.h>
+
+static struct workqueue_struct *bpf_psi_wq;
+
+static DEFINE_MUTEX(bpf_psi_lock);
+static LIST_HEAD(bpf_psi_notify_list);
+static DEFINE_STATIC_KEY_FALSE(bpf_psi_notify_key);
+
+static struct bpf_psi *bpf_psi_create(struct bpf_psi_ops *ops)
+{
+	struct bpf_psi *bpf_psi;
+
+	bpf_psi = kzalloc(sizeof(*bpf_psi), GFP_KERNEL);
+	if (!bpf_psi)
+		return NULL;
+
+	if (init_srcu_struct(&bpf_psi->srcu)) {
+		kfree(bpf_psi);
+		return NULL;
+	}
+
+	spin_lock_init(&bpf_psi->lock);
+	bpf_psi->ops = ops;
+	INIT_LIST_HEAD(&bpf_psi->triggers);
+	ops->bpf_psi = bpf_psi;
+
+	if (ops->handle_cgroup_online) {
+		mutex_lock(&bpf_psi_lock);
+		list_add(&bpf_psi->node, &bpf_psi_notify_list);
+		mutex_unlock(&bpf_psi_lock);
+		static_branch_inc(&bpf_psi_notify_key);
+	} else {
+		INIT_LIST_HEAD(&bpf_psi->node);
+	}
+
+	return bpf_psi;
+}
+
+static void bpf_psi_handle_event_fn(struct work_struct *work)
+{
+	struct psi_trigger *t;
+	struct bpf_psi *bpf_psi;
+	int idx;
+
+	t = container_of(work, struct psi_trigger, bpf_work);
+	bpf_psi = READ_ONCE(t->bpf_psi);
+
+	if (likely(bpf_psi)) {
+		idx = srcu_read_lock(&bpf_psi->srcu);
+		bpf_psi->ops->handle_psi_event(bpf_psi, t);
+		srcu_read_unlock(&bpf_psi->srcu, idx);
+	}
+}
+
+void bpf_psi_add_trigger(struct psi_trigger *t,
+			 const struct psi_trigger_params *params)
+{
+	t->bpf_psi = params->bpf_psi;
+	t->pinned = false;
+	INIT_WORK(&t->bpf_work, bpf_psi_handle_event_fn);
+
+	spin_lock(&t->bpf_psi->lock);
+	list_add(&t->bpf_psi_node, &t->bpf_psi->triggers);
+	spin_unlock(&t->bpf_psi->lock);
+
+	spin_lock(&t->group->bpf_triggers_lock);
+	list_add(&t->bpf_group_node, &t->group->bpf_triggers);
+	spin_unlock(&t->group->bpf_triggers_lock);
+}
+
+void bpf_psi_remove_trigger(struct psi_trigger *t)
+{
+	spin_lock(&t->group->bpf_triggers_lock);
+	list_del(&t->bpf_group_node);
+	spin_unlock(&t->group->bpf_triggers_lock);
+
+	spin_lock(&t->bpf_psi->lock);
+	list_del(&t->bpf_psi_node);
+	spin_unlock(&t->bpf_psi->lock);
+}
+
+#ifdef CONFIG_CGROUPS
+void bpf_psi_cgroup_online(struct cgroup *cgroup)
+{
+	struct bpf_psi *bpf_psi;
+	int idx;
+
+	if (!static_branch_likely(&bpf_psi_notify_key))
+		return;
+
+	mutex_lock(&bpf_psi_lock);
+	list_for_each_entry(bpf_psi, &bpf_psi_notify_list, node) {
+		idx = srcu_read_lock(&bpf_psi->srcu);
+		if (bpf_psi->ops->handle_cgroup_online)
+			bpf_psi->ops->handle_cgroup_online(bpf_psi,
+						cgroup_id(cgroup));
+		srcu_read_unlock(&bpf_psi->srcu, idx);
+	}
+	mutex_unlock(&bpf_psi_lock);
+}
+
+void bpf_psi_cgroup_offline(struct cgroup *cgroup)
+{
+	struct psi_group *group = cgroup->psi;
+	u64 cgrp_id = cgroup_id(cgroup);
+	struct psi_trigger *t, *p;
+	struct bpf_psi *bpf_psi;
+	LIST_HEAD(to_destroy);
+	int idx;
+
+	if (!group)
+		return;
+
+	spin_lock(&group->bpf_triggers_lock);
+	list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
+		if (!t->pinned) {
+			t->pinned = true;
+			list_move(&t->bpf_group_node, &to_destroy);
+		}
+	}
+	spin_unlock(&group->bpf_triggers_lock);
+
+	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
+		bpf_psi = READ_ONCE(t->bpf_psi);
+
+		idx = srcu_read_lock(&bpf_psi->srcu);
+		if (bpf_psi->ops->handle_cgroup_offline)
+			bpf_psi->ops->handle_cgroup_offline(bpf_psi, cgrp_id);
+		srcu_read_unlock(&bpf_psi->srcu, idx);
+
+		spin_lock(&bpf_psi->lock);
+		list_del(&t->bpf_psi_node);
+		spin_unlock(&bpf_psi->lock);
+
+		WRITE_ONCE(t->bpf_psi, NULL);
+		flush_workqueue(bpf_psi_wq);
+		synchronize_srcu(&bpf_psi->srcu);
+		psi_trigger_destroy(t);
+	}
+}
+#endif
+
+void bpf_psi_handle_event(struct psi_trigger *t)
+{
+	queue_work(bpf_psi_wq, &t->bpf_work);
+}
+
+/* BPF struct ops */
+
+static int __bpf_psi_init(struct bpf_psi *bpf_psi) { return 0; }
+static void __bpf_psi_handle_psi_event(struct bpf_psi *bpf_psi, struct psi_trigger *t) {}
+static void __bpf_psi_handle_cgroup_online(struct bpf_psi *bpf_psi, u64 cgroup_id) {}
+static void __bpf_psi_handle_cgroup_offline(struct bpf_psi *bpf_psi, u64 cgroup_id) {}
+
+static struct bpf_psi_ops __bpf_psi_ops = {
+	.init = __bpf_psi_init,
+	.handle_psi_event = __bpf_psi_handle_psi_event,
+	.handle_cgroup_online = __bpf_psi_handle_cgroup_online,
+	.handle_cgroup_offline = __bpf_psi_handle_cgroup_offline,
+};
+
+static const struct bpf_func_proto *
+bpf_psi_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return tracing_prog_func_proto(func_id, prog);
+}
+
+static bool bpf_psi_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
+	.get_func_proto = bpf_psi_func_proto,
+	.is_valid_access = bpf_psi_ops_is_valid_access,
+};
+
+static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_psi_ops *ops = kdata;
+	struct bpf_psi *bpf_psi;
+
+	bpf_psi = bpf_psi_create(ops);
+	if (!bpf_psi)
+		return -ENOMEM;
+
+	return ops->init(bpf_psi);
+}
+
+static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_psi_ops *ops = kdata;
+	struct bpf_psi *bpf_psi = ops->bpf_psi;
+	struct psi_trigger *t, *p;
+	LIST_HEAD(to_destroy);
+
+	spin_lock(&bpf_psi->lock);
+	list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
+		spin_lock(&t->group->bpf_triggers_lock);
+		if (!t->pinned) {
+			t->pinned = true;
+			list_move(&t->bpf_group_node, &to_destroy);
+			list_del(&t->bpf_psi_node);
+
+			WRITE_ONCE(t->bpf_psi, NULL);
+		}
+		spin_unlock(&t->group->bpf_triggers_lock);
+	}
+	spin_unlock(&bpf_psi->lock);
+
+	flush_workqueue(bpf_psi_wq);
+	synchronize_srcu(&bpf_psi->srcu);
+
+	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
+		psi_trigger_destroy(t);
+
+	if (!list_empty(&bpf_psi->node)) {
+		mutex_lock(&bpf_psi_lock);
+		list_del(&bpf_psi->node);
+		mutex_unlock(&bpf_psi_lock);
+		static_branch_dec(&bpf_psi_notify_key);
+	}
+
+	cleanup_srcu_struct(&bpf_psi->srcu);
+	kfree(bpf_psi);
+}
+
+static int bpf_psi_ops_check_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_psi_ops, init):
+		fallthrough;
+	case offsetof(struct bpf_psi_ops, handle_psi_event):
+		if (!prog)
+			return -EINVAL;
+		break;
+	}
+
+	return 0;
+}
+
+static int bpf_psi_ops_init_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   void *kdata, const void *udata)
+{
+	return 0;
+}
+
+static int bpf_psi_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+struct bpf_struct_ops bpf_psi_bpf_ops = {
+	.verifier_ops = &bpf_psi_verifier_ops,
+	.reg = bpf_psi_ops_reg,
+	.unreg = bpf_psi_ops_unreg,
+	.check_member = bpf_psi_ops_check_member,
+	.init_member = bpf_psi_ops_init_member,
+	.init = bpf_psi_ops_init,
+	.name = "bpf_psi_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_psi_ops
+};
+
+static int __init bpf_psi_struct_ops_init(void)
+{
+	int wq_flags = WQ_MEM_RECLAIM | WQ_UNBOUND | WQ_HIGHPRI;
+	int err;
+
+	bpf_psi_wq = alloc_workqueue("bpf_psi_wq", wq_flags, 0);
+	if (!bpf_psi_wq)
+		return -ENOMEM;
+
+	err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
+	if (err) {
+		pr_warn("error while registering bpf psi struct ops: %d", err);
+		goto err;
+	}
+
+	return 0;
+
+err:
+	destroy_workqueue(bpf_psi_wq);
+	return err;
+}
+late_initcall(bpf_psi_struct_ops_init);
diff --git a/kernel/sched/build_utility.c b/kernel/sched/build_utility.c
index e2cf3b08d4e9..1f90781781a1 100644
--- a/kernel/sched/build_utility.c
+++ b/kernel/sched/build_utility.c
@@ -19,6 +19,7 @@
 #include <linux/sched/rseq_api.h>
 #include <linux/sched/task_stack.h>
 
+#include <linux/bpf_psi.h>
 #include <linux/cpufreq.h>
 #include <linux/cpumask_api.h>
 #include <linux/cpuset.h>
@@ -91,6 +92,9 @@
 
 #ifdef CONFIG_PSI
 # include "psi.c"
+# ifdef CONFIG_BPF_SYSCALL
+#  include "bpf_psi.c"
+# endif
 #endif
 
 #ifdef CONFIG_MEMBARRIER
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 73fdc79b5602..26de772750e8 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -223,6 +223,10 @@ static void group_init(struct psi_group *group)
 	init_waitqueue_head(&group->rtpoll_wait);
 	timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
 	rcu_assign_pointer(group->rtpoll_task, NULL);
+#ifdef CONFIG_BPF_SYSCALL
+	spin_lock_init(&group->bpf_triggers_lock);
+	INIT_LIST_HEAD(&group->bpf_triggers);
+#endif
 }
 
 void __init psi_init(void)
@@ -511,10 +515,17 @@ static void update_triggers(struct psi_group *group, u64 now,
 
 		/* Generate an event */
 		if (cmpxchg(&t->event, 0, 1) == 0) {
-			if (t->type == PSI_CGROUP)
-				kernfs_notify(t->of->kn);
-			else
+			switch (t->type) {
+			case PSI_SYSTEM:
 				wake_up_interruptible(&t->event_wait);
+				break;
+			case PSI_CGROUP:
+				kernfs_notify(t->of->kn);
+				break;
+			case PSI_BPF:
+				bpf_psi_handle_event(t);
+				break;
+			}
 		}
 		t->last_event_time = now;
 		/* Reset threshold breach flag once event got generated */
@@ -1368,6 +1379,9 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 	case PSI_CGROUP:
 		t->of = params->of;
 		break;
+	case PSI_BPF:
+		bpf_psi_add_trigger(t, params);
+		break;
 	}
 
 	t->pending_event = false;
@@ -1381,8 +1395,10 @@ struct psi_trigger *psi_trigger_create(struct psi_group *group,
 
 			task = kthread_create(psi_rtpoll_worker, group, "psimon");
 			if (IS_ERR(task)) {
-				kfree(t);
 				mutex_unlock(&group->rtpoll_trigger_lock);
+				if (t->type == PSI_BPF)
+					bpf_psi_remove_trigger(t);
+				kfree(t);
 				return ERR_CAST(task);
 			}
 			atomic_set(&group->rtpoll_wakeup, 0);
@@ -1426,10 +1442,16 @@ void psi_trigger_destroy(struct psi_trigger *t)
 	 * being accessed later. Can happen if cgroup is deleted from under a
 	 * polling process.
 	 */
-	if (t->type == PSI_CGROUP)
-		kernfs_notify(t->of->kn);
-	else
+	switch (t->type) {
+	case PSI_SYSTEM:
 		wake_up_interruptible(&t->event_wait);
+		break;
+	case PSI_CGROUP:
+		kernfs_notify(t->of->kn);
+		break;
+	case PSI_BPF:
+		break;
+	}
 
 	if (t->aggregator == PSI_AVGS) {
 		mutex_lock(&group->avgs_lock);
@@ -1506,10 +1528,16 @@ __poll_t psi_trigger_poll(void **trigger_ptr,
 	if (!t)
 		return DEFAULT_POLLMASK | EPOLLERR | EPOLLPRI;
 
-	if (t->type == PSI_CGROUP)
-		kernfs_generic_poll(t->of, wait);
-	else
+	switch (t->type) {
+	case PSI_SYSTEM:
 		poll_wait(file, &t->event_wait, wait);
+		break;
+	case PSI_CGROUP:
+		kernfs_generic_poll(t->of, wait);
+		break;
+	case PSI_BPF:
+		break;
+	}
 
 	if (cmpxchg(&t->event, 1, 0) == 1)
 		ret |= EPOLLPRI;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 90bb86dee3cf..65a3b4c1fc72 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1429,6 +1429,9 @@ static int bpf_oom_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
 	if (!btf_id_set_contains(&bpf_oom_declare_oom_kfuncs, kfunc_id))
 		return 0;
 
+	if (IS_ENABLED(CONFIG_PSI) && prog->aux->st_ops == &bpf_psi_bpf_ops)
+		return 0;
+
 	return -EACCES;
 }
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 21/23] sched: psi: implement bpf_psi_create_trigger() kfunc
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (8 preceding siblings ...)
  2025-10-27 23:22 ` [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:22 ` [PATCH v2 22/23] bpf: selftests: add config for psi Roman Gushchin
  2025-10-27 23:22 ` [PATCH v2 23/23] bpf: selftests: PSI struct ops test Roman Gushchin
  11 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Implement a new bpf_psi_create_trigger() BPF kfunc, which allows
to create new PSI triggers and attach them to cgroups or be
system-wide.

Created triggers will exist until the struct ops is loaded and
if they are attached to a cgroup until the cgroup exists.

Due to a limitation of 5 arguments, the resource type and the "full"
bit are squeezed into a single u32.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/cgroup.h |  4 ++
 include/linux/psi.h    |  6 +++
 kernel/sched/bpf_psi.c | 94 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 104 insertions(+)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 6ed477338b16..1a99da44999e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -707,6 +707,10 @@ static inline bool task_under_cgroup_hierarchy(struct task_struct *task,
 
 static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
 {}
+static inline struct cgroup *cgroup_get_from_id(u64 id)
+{
+	return NULL;
+}
 #endif /* !CONFIG_CGROUPS */
 
 #ifdef CONFIG_CGROUPS
diff --git a/include/linux/psi.h b/include/linux/psi.h
index 8178e998d94b..8ffe84cd8571 100644
--- a/include/linux/psi.h
+++ b/include/linux/psi.h
@@ -50,6 +50,12 @@ int psi_cgroup_alloc(struct cgroup *cgrp);
 void psi_cgroup_free(struct cgroup *cgrp);
 void cgroup_move_task(struct task_struct *p, struct css_set *to);
 void psi_cgroup_restart(struct psi_group *group);
+
+#else
+static inline struct psi_group *cgroup_psi(struct cgroup *cgrp)
+{
+	return &psi_system;
+}
 #endif
 
 #else /* CONFIG_PSI */
diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
index c383a20119a6..7974de56594f 100644
--- a/kernel/sched/bpf_psi.c
+++ b/kernel/sched/bpf_psi.c
@@ -8,6 +8,7 @@
 #include <linux/bpf_psi.h>
 #include <linux/cgroup-defs.h>
 
+struct bpf_struct_ops bpf_psi_bpf_ops;
 static struct workqueue_struct *bpf_psi_wq;
 
 static DEFINE_MUTEX(bpf_psi_lock);
@@ -186,6 +187,92 @@ static const struct bpf_verifier_ops bpf_psi_verifier_ops = {
 	.is_valid_access = bpf_psi_ops_is_valid_access,
 };
 
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_psi_create_trigger - Create a PSI trigger
+ * @bpf_psi: bpf_psi struct to attach the trigger to
+ * @cgroup_id: cgroup Id to attach the trigger; 0 for system-wide scope
+ * @resource: resource to monitor (PSI_MEM, PSI_IO, etc) and the full bit.
+ * @threshold_us: threshold in us
+ * @window_us: window in us
+ *
+ * Creates a PSI trigger and attached is to bpf_psi. The trigger will be
+ * active unless bpf struct ops is unloaded or the corresponding cgroup
+ * is deleted.
+ *
+ * Resource's most significant bit encodes whether "some" or "full"
+ * PSI state should be tracked.
+ *
+ * Returns 0 on success and the error code on failure.
+ */
+__bpf_kfunc int bpf_psi_create_trigger(struct bpf_psi *bpf_psi,
+				       u64 cgroup_id, u32 resource,
+				       u32 threshold_us, u32 window_us)
+{
+	enum psi_res res = resource & ~BPF_PSI_FULL;
+	bool full = resource & BPF_PSI_FULL;
+	struct psi_trigger_params params;
+	struct cgroup *cgroup __maybe_unused = NULL;
+	struct psi_group *group;
+	struct psi_trigger *t;
+	int ret = 0;
+
+	if (res >= NR_PSI_RESOURCES)
+		return -EINVAL;
+
+	if (IS_ENABLED(CONFIG_CGROUPS) && cgroup_id) {
+		cgroup = cgroup_get_from_id(cgroup_id);
+		if (IS_ERR_OR_NULL(cgroup))
+			return PTR_ERR(cgroup);
+
+		group = cgroup_psi(cgroup);
+	} else {
+		group = &psi_system;
+	}
+
+	params.type = PSI_BPF;
+	params.bpf_psi = bpf_psi;
+	params.privileged = capable(CAP_SYS_RESOURCE);
+	params.res = res;
+	params.full = full;
+	params.threshold_us = threshold_us;
+	params.window_us = window_us;
+
+	t = psi_trigger_create(group, &params);
+	if (IS_ERR(t))
+		ret = PTR_ERR(t);
+	else
+		t->cgroup_id = cgroup_id;
+
+#ifdef CONFIG_CGROUPS
+	if (cgroup)
+		cgroup_put(cgroup);
+#endif
+
+	return ret;
+}
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_psi_kfuncs)
+BTF_ID_FLAGS(func, bpf_psi_create_trigger, KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_psi_kfuncs)
+
+static int bpf_psi_kfunc_filter(const struct bpf_prog *prog, u32 kfunc_id)
+{
+	if (btf_id_set8_contains(&bpf_psi_kfuncs, kfunc_id) &&
+	    prog->aux->st_ops != &bpf_psi_bpf_ops)
+		return -EACCES;
+
+	return 0;
+}
+
+static const struct btf_kfunc_id_set bpf_psi_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_psi_kfuncs,
+	.filter         = bpf_psi_kfunc_filter,
+};
+
 static int bpf_psi_ops_reg(void *kdata, struct bpf_link *link)
 {
 	struct bpf_psi_ops *ops = kdata;
@@ -287,6 +374,13 @@ static int __init bpf_psi_struct_ops_init(void)
 	if (!bpf_psi_wq)
 		return -ENOMEM;
 
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_psi_kfunc_set);
+	if (err) {
+		pr_warn("error while registering bpf psi kfuncs: %d", err);
+		goto err;
+	}
+
 	err = register_bpf_struct_ops(&bpf_psi_bpf_ops, bpf_psi_ops);
 	if (err) {
 		pr_warn("error while registering bpf psi struct ops: %d", err);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 22/23] bpf: selftests: add config for psi
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (9 preceding siblings ...)
  2025-10-27 23:22 ` [PATCH v2 21/23] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:22 ` [PATCH v2 23/23] bpf: selftests: PSI struct ops test Roman Gushchin
  11 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo

From: JP Kobryn <inwardvessel@gmail.com>

Include CONFIG_PSI to allow dependent tests to build.

Suggested-by: Song Liu <song@kernel.org>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 tools/testing/selftests/bpf/config | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/config b/tools/testing/selftests/bpf/config
index 70b28c1e653e..178c840c844b 100644
--- a/tools/testing/selftests/bpf/config
+++ b/tools/testing/selftests/bpf/config
@@ -110,6 +110,7 @@ CONFIG_IP6_NF_IPTABLES=y
 CONFIG_IP6_NF_FILTER=y
 CONFIG_NF_NAT=y
 CONFIG_PACKET=y
+CONFIG_PSI=y
 CONFIG_RC_CORE=y
 CONFIG_SECURITY=y
 CONFIG_SECURITYFS=y
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [PATCH v2 23/23] bpf: selftests: PSI struct ops test
  2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
                   ` (10 preceding siblings ...)
  2025-10-27 23:22 ` [PATCH v2 22/23] bpf: selftests: add config for psi Roman Gushchin
@ 2025-10-27 23:22 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-11-10  9:48   ` Michal Hocko
  11 siblings, 2 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Add a PSI struct ops test.

The test creates a cgroup with two child sub-cgroups, sets up
memory.high for one of those and puts there a memory hungry
process (initially frozen).

Then it creates 2 PSI triggers from within a init() BPF callback and
attaches them to these cgroups.  Then it deletes the first cgroup,
creates another one and runs the memory hungry task. From the cgroup
creation callback the test is creating another trigger.

The memory hungry task is creating a high memory pressure in one
memory cgroup, which triggers a PSI event. The PSI BPF handler
declares a memcg oom in the corresponding cgroup. Finally the checks
that both handle_cgroup_free() and handle_psi_event() handlers were
executed, the correct process was killed and oom counters were
updated.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 .../selftests/bpf/prog_tests/test_psi.c       | 238 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_psi.c  |  82 ++++++
 2 files changed, 320 insertions(+)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

diff --git a/tools/testing/selftests/bpf/prog_tests/test_psi.c b/tools/testing/selftests/bpf/prog_tests/test_psi.c
new file mode 100644
index 000000000000..b294cea0a6fe
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/test_psi.c
@@ -0,0 +1,238 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include <test_progs.h>
+#include <bpf/btf.h>
+#include <bpf/bpf.h>
+
+#include "cgroup_helpers.h"
+#include "test_psi.skel.h"
+
+enum psi_res {
+	PSI_IO,
+	PSI_MEM,
+	PSI_CPU,
+	PSI_IRQ,
+	NR_PSI_RESOURCES,
+};
+
+struct cgroup_desc {
+	const char *path;
+	unsigned long long id;
+	int pid;
+	int fd;
+	size_t target;
+	size_t high;
+	bool victim;
+};
+
+#define MB (1024 * 1024)
+
+static struct cgroup_desc cgroups[] = {
+	{ .path = "/psi_test" },
+	{ .path = "/psi_test/cg1" },
+	{ .path = "/psi_test/cg2", .target = 500 * MB,
+	  .high = 40 * MB, .victim = true },
+};
+
+static int spawn_task(struct cgroup_desc *desc)
+{
+	char *ptr;
+	int pid;
+
+	pid = fork();
+	if (pid < 0)
+		return pid;
+
+	if (pid > 0) {
+		/* parent */
+		desc->pid = pid;
+		return 0;
+	}
+
+	/* child */
+	ptr = (char *)malloc(desc->target);
+	if (!ptr)
+		return -ENOMEM;
+
+	memset(ptr, 'a', desc->target);
+
+	while (1)
+		sleep(1000);
+
+	return 0;
+}
+
+static void setup_environment(void)
+{
+	int i, err;
+
+	err = setup_cgroup_environment();
+	if (!ASSERT_OK(err, "setup_cgroup_environment"))
+		goto cleanup;
+
+	for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+		cgroups[i].fd = create_and_get_cgroup(cgroups[i].path);
+		if (!ASSERT_GE(cgroups[i].fd, 0, "create_and_get_cgroup"))
+			goto cleanup;
+
+		cgroups[i].id = get_cgroup_id(cgroups[i].path);
+		if (!ASSERT_GT(cgroups[i].id, 0, "get_cgroup_id"))
+			goto cleanup;
+
+		/* Freeze the top-level cgroup and enable the memory controller */
+		if (i == 0) {
+			err = write_cgroup_file(cgroups[i].path, "cgroup.freeze", "1");
+			if (!ASSERT_OK(err, "freeze cgroup"))
+				goto cleanup;
+
+			err = write_cgroup_file(cgroups[i].path, "cgroup.subtree_control",
+						"+memory");
+			if (!ASSERT_OK(err, "enable memory controller"))
+				goto cleanup;
+		}
+
+		/* Set memory.high */
+		if (cgroups[i].high) {
+			char buf[256];
+
+			snprintf(buf, sizeof(buf), "%lu", cgroups[i].high);
+			err = write_cgroup_file(cgroups[i].path, "memory.high", buf);
+			if (!ASSERT_OK(err, "set memory.high"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "0");
+			write_cgroup_file(cgroups[i].path, "memory.swap.max", buf);
+		}
+
+		/* Spawn tasks creating memory pressure */
+		if (cgroups[i].target) {
+			char buf[256];
+
+			err = spawn_task(&cgroups[i]);
+			if (!ASSERT_OK(err, "spawn task"))
+				goto cleanup;
+
+			snprintf(buf, sizeof(buf), "%d", cgroups[i].pid);
+			err = write_cgroup_file(cgroups[i].path, "cgroup.procs", buf);
+			if (!ASSERT_OK(err, "put child into a cgroup"))
+				goto cleanup;
+		}
+	}
+
+	return;
+
+cleanup:
+	cleanup_cgroup_environment();
+}
+
+static int run_and_wait_for_oom(void)
+{
+	int ret = -1;
+	bool first = true;
+	char buf[4096] = {};
+	size_t size;
+
+	/* Unfreeze the top-level cgroup */
+	ret = write_cgroup_file(cgroups[0].path, "cgroup.freeze", "0");
+	if (!ASSERT_OK(ret, "unfreeze cgroup"))
+		return -1;
+
+	for (;;) {
+		int i, status;
+		pid_t pid = wait(&status);
+
+		if (pid == -1) {
+			if (errno == EINTR)
+				continue;
+			/* ECHILD */
+			break;
+		}
+
+		if (!first)
+			continue;
+		first = false;
+
+		/* Check which process was terminated first */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++) {
+			if (!ASSERT_OK(cgroups[i].victim !=
+				       (pid == cgroups[i].pid),
+				       "correct process was killed")) {
+				ret = -1;
+				break;
+			}
+
+			if (!cgroups[i].victim)
+				continue;
+
+			/* Check the memcg oom counter */
+			size = read_cgroup_file(cgroups[i].path, "memory.events",
+						buf, sizeof(buf));
+			if (!ASSERT_OK(size <= 0, "read memory.events")) {
+				ret = -1;
+				break;
+			}
+
+			if (!ASSERT_OK(strstr(buf, "oom_kill 1") == NULL,
+				       "oom_kill count check")) {
+				ret = -1;
+				break;
+			}
+		}
+
+		/* Kill all remaining tasks */
+		for (i = 0; i < ARRAY_SIZE(cgroups); i++)
+			if (cgroups[i].pid && cgroups[i].pid != pid)
+				kill(cgroups[i].pid, SIGKILL);
+	}
+
+	return ret;
+}
+
+void test_psi(void)
+{
+	struct test_psi *skel;
+	u64 deleted_cgroup_id;
+	int new_cgroup_fd;
+	u64 new_cgroup_id;
+	int err;
+
+	setup_environment();
+
+	skel = test_psi__open_and_load();
+	err = libbpf_get_error(skel);
+	if (CHECK_FAIL(err))
+		goto cleanup;
+
+	skel->bss->deleted_cgroup_id = cgroups[1].id;
+	skel->bss->high_pressure_cgroup_id = cgroups[2].id;
+
+	err = test_psi__attach(skel);
+	if (CHECK_FAIL(err))
+		goto cleanup;
+
+	/* Delete the first cgroup, it should trigger handle_cgroup_offline() */
+	remove_cgroup(cgroups[1].path);
+
+	new_cgroup_fd = create_and_get_cgroup("/psi_test_new");
+	if (!ASSERT_GE(new_cgroup_fd, 0, "create_and_get_cgroup"))
+		goto cleanup;
+
+	new_cgroup_id = get_cgroup_id("/psi_test_new");
+	if (!ASSERT_GT(new_cgroup_id, 0, "get_cgroup_id"))
+		goto cleanup;
+
+	/* Unfreeze all child tasks and create the memory pressure */
+	err = run_and_wait_for_oom();
+	CHECK_FAIL(err);
+
+	/* Check the result of the handle_cgroup_offline() handler */
+	deleted_cgroup_id = skel->bss->deleted_cgroup_id;
+	ASSERT_EQ(deleted_cgroup_id, cgroups[1].id, "deleted cgroup id");
+
+	/* Check the result of the handle_cgroup_online() handler */
+	ASSERT_EQ(skel->bss->new_cgroup_id, new_cgroup_id,
+		  "new cgroup id");
+
+cleanup:
+	cleanup_cgroup_environment();
+	test_psi__destroy(skel);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/selftests/bpf/progs/test_psi.c
new file mode 100644
index 000000000000..4ddec7ec3eda
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_psi.c
@@ -0,0 +1,82 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "vmlinux.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+
+char _license[] SEC("license") = "GPL";
+
+#define PSI_FULL 0x80000000
+
+/* cgroup which will experience the high memory pressure */
+u64 high_pressure_cgroup_id;
+
+/* cgroup which will be deleted */
+u64 deleted_cgroup_id;
+
+/* cgroup which will be created */
+u64 new_cgroup_id;
+
+/* cgroup which was deleted */
+u64 deleted_cgroup_id;
+
+char constraint_name[] = "CONSTRAINT_BPF_PSI_MEM";
+
+SEC("struct_ops.s/init")
+int BPF_PROG(psi_init, struct bpf_psi *bpf_psi)
+{
+	int ret;
+
+	ret = bpf_psi_create_trigger(bpf_psi, high_pressure_cgroup_id,
+				     PSI_MEM | PSI_FULL, 100000, 1000000);
+	if (ret)
+		return ret;
+
+	return bpf_psi_create_trigger(bpf_psi, deleted_cgroup_id,
+				      PSI_IO, 100000, 1000000);
+}
+
+SEC("struct_ops.s/handle_psi_event")
+void BPF_PROG(handle_psi_event, struct bpf_psi *bpf_psi, struct psi_trigger *t)
+{
+	u64 cgroup_id = t->cgroup_id;
+	struct mem_cgroup *memcg;
+	struct cgroup *cgroup;
+
+	cgroup = bpf_cgroup_from_id(cgroup_id);
+	if (!cgroup)
+		return;
+
+	memcg = bpf_get_mem_cgroup(&cgroup->self);
+	if (!memcg) {
+		bpf_cgroup_release(cgroup);
+		return;
+	}
+
+	bpf_out_of_memory(memcg, 0, BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK,
+			  constraint_name);
+
+	bpf_put_mem_cgroup(memcg);
+	bpf_cgroup_release(cgroup);
+}
+
+SEC("struct_ops.s/handle_cgroup_online")
+void BPF_PROG(handle_cgroup_online, struct bpf_psi *bpf_psi, u64 cgroup_id)
+{
+	new_cgroup_id = cgroup_id;
+
+	bpf_psi_create_trigger(bpf_psi, cgroup_id, PSI_IO, 100000, 1000000);
+}
+
+SEC("struct_ops.s/handle_cgroup_offline")
+void BPF_PROG(handle_cgroup_offline, struct bpf_psi *bpf_psi, u64 cgroup_id)
+{
+	deleted_cgroup_id = cgroup_id;
+}
+
+SEC(".struct_ops.link")
+struct bpf_psi_ops test_bpf_psi = {
+	.init = (void *)psi_init,
+	.handle_psi_event = (void *)handle_psi_event,
+	.handle_cgroup_online = (void *)handle_cgroup_online,
+	.handle_cgroup_offline = (void *)handle_cgroup_offline,
+};
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops
  2025-10-27 23:22 ` [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 17:40   ` Tejun Heo
  1 sibling, 0 replies; 43+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 4274 bytes --]


```
commit 264db34cd79e7c07b4b1e58e2efae3d632df9cb2
Author: Roman Gushchin <roman.gushchin@linux.dev>

sched: psi: implement bpf_psi struct ops

This patch implements a BPF struct ops-based mechanism to create PSI
triggers, attach them to cgroups or system wide and handle PSI events
in BPF.

> diff --git a/kernel/sched/bpf_psi.c b/kernel/sched/bpf_psi.c
> new file mode 100644
> index 000000000..c383a2011
> --- /dev/null
> +++ b/kernel/sched/bpf_psi.c

[ ... ]

> +void bpf_psi_cgroup_offline(struct cgroup *cgroup)
> +{
> +	struct psi_group *group = cgroup->psi;
> +	u64 cgrp_id = cgroup_id(cgroup);
> +	struct psi_trigger *t, *p;
> +	struct bpf_psi *bpf_psi;
> +	LIST_HEAD(to_destroy);
> +	int idx;
> +
> +	if (!group)
> +		return;
> +
> +	spin_lock(&group->bpf_triggers_lock);
> +	list_for_each_entry_safe(t, p, &group->bpf_triggers, bpf_group_node) {
> +		if (!t->pinned) {
> +			t->pinned = true;
> +			list_move(&t->bpf_group_node, &to_destroy);
> +		}
> +	}
> +	spin_unlock(&group->bpf_triggers_lock);
> +
> +	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node) {
> +		bpf_psi = READ_ONCE(t->bpf_psi);
> +
> +		idx = srcu_read_lock(&bpf_psi->srcu);
> +		if (bpf_psi->ops->handle_cgroup_offline)
> +			bpf_psi->ops->handle_cgroup_offline(bpf_psi, cgrp_id);
> +		srcu_read_unlock(&bpf_psi->srcu, idx);
> +
> +		spin_lock(&bpf_psi->lock);
> +		list_del(&t->bpf_psi_node);
> +		spin_unlock(&bpf_psi->lock);
> +
> +		WRITE_ONCE(t->bpf_psi, NULL);
> +		flush_workqueue(bpf_psi_wq);
> +		synchronize_srcu(&bpf_psi->srcu);
> +		psi_trigger_destroy(t);
> +	}
> +}

Can bpf_psi_cgroup_offline() race with bpf_psi_ops_unreg() and access
freed memory?

Consider this sequence:

Thread A (bpf_psi_cgroup_offline):
  - Pins triggers under group->bpf_triggers_lock
  - Reads bpf_psi = READ_ONCE(t->bpf_psi) outside any lock

Thread B (bpf_psi_ops_unreg):
  - Iterates bpf_psi->triggers, skips pinned triggers
  - Calls synchronize_srcu(&bpf_psi->srcu)
  - Calls kfree(bpf_psi)

Thread A:
  - Tries srcu_read_lock(&bpf_psi->srcu) on freed memory

The pinning mechanism prevents Thread B from NULLing t->bpf_psi when
Thread A has pinned the trigger. However, Thread B unconditionally
frees the bpf_psi object itself even when it didn't process any
triggers because they were all pinned by Thread A.

After Thread A pins all triggers and releases group->bpf_triggers_lock,
Thread B can complete bpf_psi_ops_unreg(), call synchronize_srcu() (no
readers yet), and free bpf_psi. Then Thread A accesses the freed
bpf_psi via srcu_read_lock() and spin_lock().

> +static void bpf_psi_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_psi_ops *ops = kdata;
> +	struct bpf_psi *bpf_psi = ops->bpf_psi;
> +	struct psi_trigger *t, *p;
> +	LIST_HEAD(to_destroy);
> +
> +	spin_lock(&bpf_psi->lock);
> +	list_for_each_entry_safe(t, p, &bpf_psi->triggers, bpf_psi_node) {
> +		spin_lock(&t->group->bpf_triggers_lock);
> +		if (!t->pinned) {
> +			t->pinned = true;
> +			list_move(&t->bpf_group_node, &to_destroy);
> +			list_del(&t->bpf_psi_node);
> +
> +			WRITE_ONCE(t->bpf_psi, NULL);
> +		}
> +		spin_unlock(&t->group->bpf_triggers_lock);
> +	}
> +	spin_unlock(&bpf_psi->lock);
> +
> +	flush_workqueue(bpf_psi_wq);
> +	synchronize_srcu(&bpf_psi->srcu);
> +
> +	list_for_each_entry_safe(t, p, &to_destroy, bpf_group_node)
> +		psi_trigger_destroy(t);
> +
> +	if (!list_empty(&bpf_psi->node)) {
> +		mutex_lock(&bpf_psi_lock);
> +		list_del(&bpf_psi->node);
> +		mutex_unlock(&bpf_psi_lock);
> +		static_branch_dec(&bpf_psi_notify_key);
> +	}
> +
> +	cleanup_srcu_struct(&bpf_psi->srcu);
> +	kfree(bpf_psi);
> +}

When bpf_psi_ops_unreg() skips all triggers because they're pinned
(by bpf_psi_cgroup_offline()), does it still free bpf_psi too early?

The to_destroy list will be empty if all triggers were pinned, but
the function still proceeds to cleanup_srcu_struct() and kfree(bpf_psi)
while bpf_psi_cgroup_offline() may be using that bpf_psi pointer.


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `sched: psi: implement bpf_psi struct ops`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts()
  2025-10-27 23:21 ` [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 17:07     ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3308 bytes --]


```
commit b58d54ed56054e3a550e02c324982feb20a2671e
Author: Roman Gushchin <roman.gushchin@linux.dev>

libbpf: introduce bpf_map__attach_struct_ops_opts()

This commit introduces a new API function that extends the existing
bpf_map__attach_struct_ops() to accept additional options, specifically
allowing attachment of struct_ops maps to cgroups via a relative_fd.

> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index b90574f39..be56a5dee 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -13196,12 +13196,19 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
>  	return close(link->fd);
>  }
>
> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> +struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> +						 const struct bpf_struct_ops_opts *opts)
>  {
> +	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
>  	struct bpf_link_struct_ops *link;
>  	__u32 zero = 0;
>  	int err, fd;
>
> +	if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
> +		pr_warn("map '%s': invalid opts\n", map->name);
> +		return libbpf_err_ptr(-EINVAL);
> +	}
> +
>  	if (!bpf_map__is_struct_ops(map)) {
>  		pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
>  		return libbpf_err_ptr(-EINVAL);

[ ... ]

>  	link = calloc(1, sizeof(*link));
>  	if (!link)
>  		return libbpf_err_ptr(-EINVAL);

Should this return -ENOMEM instead of -EINVAL? All other calloc failures
in libbpf.c return -ENOMEM (see lines 11006, 11802, 12288, 12746, 12829,
12891, 13089, 13129 in the same file).

[ ... ]

> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> +	link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
> +
> +	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);

Is the expected_revision field being propagated correctly? The new
bpf_struct_ops_opts structure includes an expected_revision field
(see libbpf.h), and bpf_link_create() with BPF_STRUCT_OPS reads
opts->cgroup.expected_revision (see bpf.c line 889), but this code only
sets relative_fd in link_opts. Should it also include:

    link_opts.cgroup.expected_revision = OPTS_GET(opts, expected_revision, 0);

> diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
> index 5118d0a90..dc8489871 100644
> --- a/tools/lib/bpf/libbpf.h
> +++ b/tools/lib/bpf/libbpf.h
> @@ -922,6 +922,20 @@ bpf_program__attach_cgroup_opts(const struct bpf_program *prog, int cgroup_fd,
>  struct bpf_map;
>
>  LIBBPF_API struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map);
> +
> +struct bpf_struct_ops_opts {
> +	/* size of this struct, for forward/backward compatibility */
> +	size_t sz;
> +	__u32 flags;
> +	__u32 relative_fd;
> +	__u64 expected_revision;
> +	size_t :0;
> +};
> +#define bpf_struct_ops_opts__last_field expected_revision
> +
> +LIBBPF_API struct bpf_link *
> +bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> +				const struct bpf_struct_ops_opts *opts);

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `libbpf: introduce bpf_map__attach_struct_ops_opts()`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper
  2025-10-27 23:22 ` [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 16:31     ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3586 bytes --]


```
commit 712137918bba1b0a9c7c670bab9200b0f0d6a7e8
Author: Roman Gushchin <roman.gushchin@linux.dev>

bpf: selftests: introduce read_cgroup_file() helper

This patch adds a new helper function read_cgroup_file() to read from
cgroup control files in BPF selftests. The function opens a file,
reads into a buffer, and returns the result.

> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
> index 20cede4db..8fb02fe4c 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c
> @@ -126,6 +126,45 @@ int enable_controllers(const char *relative_path, const char *controllers)
>  	return __enable_controllers(cgroup_path, controllers);
>  }
>
> +static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
> +				 char *buf, size_t size)
> +{
> +	char file_path[PATH_MAX + 1];
> +	size_t ret;
> +	int fd;
> +
> +	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
> +	fd = open(file_path, O_RDONLY);
> +	if (fd < 0) {
> +		log_err("Opening %s", file_path);
> +		return -1;

Can __read_cgroup_file() return -1 when its return type is size_t?

The function signature declares size_t (unsigned), but returns -1 here.
When -1 is implicitly converted to size_t, it becomes SIZE_MAX
(18446744073709551615 on 64-bit systems). Callers checking for errors
with "if (ret < 0)" will fail because size_t is unsigned, and callers
checking "if (ret == -1)" will also fail after the cast.

> +	}
> +
> +	ret = read(fd, buf, size);
> +	close(fd);
> +	return ret;

Does this handle read() errors correctly?

The read() syscall returns ssize_t which can be negative on error.
Assigning this to size_t (line 143) converts negative error codes into
large positive numbers. For example, read() returning -1 becomes SIZE_MAX
when stored in ret.

Should the return type be ssize_t instead of size_t to properly propagate
both success (byte count) and error (negative values) to callers?

> +}
> +
> +/**
> + * read_cgroup_file() - Read to a cgroup file
> + * @relative_path: The cgroup path, relative to the workdir
> + * @file: The name of the file in cgroupfs to read to
> + * @buf: Buffer to read from the file
> + * @size: Size of the buffer
> + *
> + * Read to a file in the given cgroup's directory.
> + *
> + * If successful, the number of read bytes is returned.
> + */
> +size_t read_cgroup_file(const char *relative_path, const char *file,
> +			char *buf, size_t size)
> +{
> +	char cgroup_path[PATH_MAX - 24];
> +
> +	format_cgroup_path(cgroup_path, relative_path);
> +	return __read_cgroup_file(cgroup_path, file, buf, size);
> +}
> +

[ ... ]

> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.h b/tools/testing/selftests/bpf/cgroup_helpers.h
> index 3857304be..9f9bb6b5d 100644
> --- a/tools/testing/selftests/bpf/cgroup_helpers.h
> +++ b/tools/testing/selftests/bpf/cgroup_helpers.h
> @@ -11,6 +11,8 @@
>
>  /* cgroupv2 related */
>  int enable_controllers(const char *relative_path, const char *controllers);
> +size_t read_cgroup_file(const char *relative_path, const char *file,
> +			char *buf, size_t size);
>  int write_cgroup_file(const char *relative_path, const char *file,
>  		      const char *buf);

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `bpf: selftests: introduce read_cgroup_file() helper`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 23/23] bpf: selftests: PSI struct ops test
  2025-10-27 23:22 ` [PATCH v2 23/23] bpf: selftests: PSI struct ops test Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 17:13     ` Roman Gushchin
  2025-11-10  9:48   ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1916 bytes --]


```
commit 70a4d534559f5708a811241190cd158e83451a42
Author: Roman Gushchin <roman.gushchin@linux.dev>

bpf: selftests: PSI struct ops test

This patch adds a BPF selftest for PSI struct ops functionality,
including cgroup lifecycle callbacks, PSI trigger creation, and
memory pressure event handling.

> diff --git a/tools/testing/selftests/bpf/prog_tests/test_psi.c b/tools/testing/selftests/bpf/prog_tests/test_psi.c
> new file mode 100644
> index 000000000..b294cea0a
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_psi.c

[ ... ]

> diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/selftests/bpf/progs/test_psi.c
> new file mode 100644
> index 000000000..4ddec7ec3
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_psi.c
> @@ -0,0 +1,82 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#include "vmlinux.h"
> +#include <bpf/bpf_helpers.h>
> +#include <bpf/bpf_tracing.h>
> +
> +char _license[] SEC("license") = "GPL";
> +
> +#define PSI_FULL 0x80000000
> +
> +/* cgroup which will experience the high memory pressure */
> +u64 high_pressure_cgroup_id;
> +
> +/* cgroup which will be deleted */
> +u64 deleted_cgroup_id;
> +
> +/* cgroup which will be created */
> +u64 new_cgroup_id;
> +
> +/* cgroup which was deleted */
> +u64 deleted_cgroup_id;
>    ^^^^^^^^^^^^^^^^^^

Is deleted_cgroup_id intentionally declared twice here? This appears
to be a duplicate global variable declaration - the same variable is
declared at line 13 with comment "cgroup which will be deleted" and
again at line 19 with comment "cgroup which was deleted".

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `bpf: selftests: PSI struct ops test`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-10-27 23:21 ` [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 15:58     ` Chris Mason
  2025-11-10  9:31   ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2807 bytes --]


```
commit 5cd8d912032d2c2105c04660338cf4ad47d24697
Author: Roman Gushchin <roman.gushchin@linux.dev>

mm: allow specifying custom oom constraint for BPF triggers

This patch adds a new OOM constraint type CONSTRAINT_BPF and allows
BPF programs to specify a custom constraint name when calling
bpf_out_of_memory().

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 3cbdcd013..704fc0e78 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -19,6 +19,7 @@ enum oom_constraint {
>  	CONSTRAINT_CPUSET,
>  	CONSTRAINT_MEMORY_POLICY,
>  	CONSTRAINT_MEMCG,
> +	CONSTRAINT_BPF,
>  };
>
>  enum bpf_oom_flags {

[ ... ]

> @@ -63,6 +64,9 @@ struct oom_control {
>
>  	/* Policy name */
>  	const char *bpf_policy_name;
> +
> +	/* BPF-specific constraint name */
> +	const char *bpf_constraint;
>  #endif
>  };

[ ... ]

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index d7fca4bf5..72a346261 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

[ ... ]

> @@ -267,6 +281,9 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
>  	bool cpuset_limited = false;
>  	int nid;
>
> +	if (oc->constraint == CONSTRAINT_BPF)
> +		return CONSTRAINT_BPF;
> +
>  	if (is_memcg_oom(oc)) {
>  		oc->totalpages = mem_cgroup_get_max(oc->memcg) ?: 1;
>  		return CONSTRAINT_MEMCG;

[ ... ]

> @@ -1350,11 +1367,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>   * Returns a negative value if an error occurred.
>   */
>  __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> -				  int order, u64 flags)
> +				  int order, u64 flags,
> +				  const char *constraint_text__nullable)
>  {
>  	struct oom_control oc = {
>  		.memcg = memcg__nullable,
>  		.order = order,
> +		.constraint = CONSTRAINT_BPF,
> +		.bpf_constraint = constraint_text__nullable,
>  	};
>  	int ret;

When CONSTRAINT_BPF is set in bpf_out_of_memory(), the early return in
constrained_alloc() prevents oc->totalpages from being initialized.  This
leaves totalpages at zero (from the designated initializer).

Later in the call chain out_of_memory()->select_bad_process()->
oom_evaluate_task()->oom_badness(), the code performs division by
totalpages at line 237:

    adj *= totalpages / 1000;

Can this cause a division by zero?  The path is reachable when a BPF
program calls bpf_out_of_memory() and either no BPF OOM handler is
registered or the handler fails to free memory, causing execution to fall
through to select_bad_process().


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `mm: allow specifying custom oom constraint for BPF triggers`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc
  2025-10-27 23:21 ` [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
@ 2025-10-27 23:57   ` bot+bpf-ci
  2025-10-28 16:43     ` Roman Gushchin
  2025-11-10  9:46   ` Michal Hocko
  1 sibling, 1 reply; 43+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:57 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3628 bytes --]


```
commit bce57b9b96e05731a14c7e10a364fdd9b733cda2
Author: Roman Gushchin <roman.gushchin@linux.dev>

mm: introduce bpf_out_of_memory() BPF kfunc

This patch introduces bpf_out_of_memory() BPF kfunc to allow BPF programs
to declare OOM events and trigger the kernel OOM handling mechanism. The
function accepts a nullable memcg pointer for system-wide or memcg-scoped
OOMs.

> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 721087952..3cbdcd013 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -21,6 +21,11 @@ enum oom_constraint {
>  	CONSTRAINT_MEMCG,
>  };
>
> +enum bpf_oom_flags {
> +	BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK = 1 << 0,
> +	BPF_OOM_FLAGS_LAST = 1 << 1,
> +};
> +
>  /*
>   * Details of the page allocation that triggered the oom killer that are used to
>   * determine what should be killed.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 3c86cd755..d7fca4bf5 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c

[ ... ]

> +/**
> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
> + * @memcg__nullable: memcg or NULL for system-wide OOMs
> + * @order: order of page which wasn't allocated
> + * @flags: flags
> + * @constraint_text__nullable: custom constraint description for the OOM report
> + *
> + * Declares the Out Of Memory state and invokes the OOM killer.
> + *
> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
> + * if oom_lock is contended.
> + *
> + * Generally it's advised to pass wait_on_oom_lock=false for global OOMs
> + * and wait_on_oom_lock=true for memcg-scoped OOMs.
> + *
> + * Returns 1 if the forward progress was achieved and some memory was freed.
> + * Returns a negative value if an error occurred.
> + */
> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
> +				  int order, u64 flags)
> +{
> +	struct oom_control oc = {
> +		.memcg = memcg__nullable,
> +		.order = order,
> +	};

Can bpf_out_of_memory() work correctly for system-wide OOMs with
uninitialized gfp_mask?

Looking at out_of_memory():

    if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
        return true;

With gfp_mask=0 and memcg=NULL (system-wide OOM), this evaluates to:
!(0 & __GFP_FS) && !false = true, causing an early return without
invoking the OOM killer. The comment in out_of_memory() explains this
check is about allocation context: "The OOM killer does not compensate
for IO-less reclaim."

All other callers of out_of_memory() initialize gfp_mask:
- mm/page_alloc.c: .gfp_mask = gfp_mask
- mm/memcontrol.c: .gfp_mask = gfp_mask

For memcg OOMs, the code works because is_memcg_oom(oc) is true,
bypassing the check. But for system-wide OOMs (NULL memcg), won't this
cause the function to return immediately without killing any process?

> +	int ret;
> +
> +	if (flags & ~(BPF_OOM_FLAGS_LAST - 1))
> +		return -EINVAL;
> +
> +	if (oc.order < 0 || oc.order > MAX_PAGE_ORDER)
> +		return -EINVAL;
> +
> +	if (flags & BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK) {
> +		ret = mutex_lock_killable(&oom_lock);
> +		if (ret)
> +			return ret;
> +	} else if (!mutex_trylock(&oom_lock))
> +		return -EBUSY;
> +
> +	ret = out_of_memory(&oc);
> +
> +	mutex_unlock(&oom_lock);
> +	return ret;
> +}

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `mm: introduce bpf_out_of_memory() BPF kfunc`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 15:58     ` Chris Mason
  2025-10-28 16:20       ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: Chris Mason @ 2025-10-28 15:58 UTC (permalink / raw)
  To: bot+bpf-ci, roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, daniel, eddyz87, yonghong.song, ihor.solodrai

On 10/27/25 7:48 PM, bot+bpf-ci@kernel.org wrote:
>> @@ -1350,11 +1367,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>>   * Returns a negative value if an error occurred.
>>   */
>>  __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>> -				  int order, u64 flags)
>> +				  int order, u64 flags,
>> +				  const char *constraint_text__nullable)
>>  {
>>  	struct oom_control oc = {
>>  		.memcg = memcg__nullable,
>>  		.order = order,
>> +		.constraint = CONSTRAINT_BPF,
>> +		.bpf_constraint = constraint_text__nullable,
>>  	};
>>  	int ret;
> 
> When CONSTRAINT_BPF is set in bpf_out_of_memory(), the early return in
> constrained_alloc() prevents oc->totalpages from being initialized.  This
> leaves totalpages at zero (from the designated initializer).
> 
> Later in the call chain out_of_memory()->select_bad_process()->
> oom_evaluate_task()->oom_badness(), the code performs division by
> totalpages at line 237:
> 
>     adj *= totalpages / 1000;
> 
> Can this cause a division by zero?  The path is reachable when a BPF
> program calls bpf_out_of_memory() and either no BPF OOM handler is
> registered or the handler fails to free memory, causing execution to fall
> through to select_bad_process().

Looks like the AI got a little excited about finding the uninit variable
chain and forgot what dividing by zero really means.  I'll add a false
positive check for this.

-chris


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-10-28 15:58     ` Chris Mason
@ 2025-10-28 16:20       ` Roman Gushchin
  2025-10-28 16:35         ` Chris Mason
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 16:20 UTC (permalink / raw)
  To: Chris Mason
  Cc: bot+bpf-ci, akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt,
	hannes, andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau,
	song, memxor, tj, daniel, eddyz87, yonghong.song, ihor.solodrai

Chris Mason <clm@meta.com> writes:

> On 10/27/25 7:48 PM, bot+bpf-ci@kernel.org wrote:
>>> @@ -1350,11 +1367,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>>>   * Returns a negative value if an error occurred.
>>>   */
>>>  __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>>> -				  int order, u64 flags)
>>> +				  int order, u64 flags,
>>> +				  const char *constraint_text__nullable)
>>>  {
>>>  	struct oom_control oc = {
>>>  		.memcg = memcg__nullable,
>>>  		.order = order,
>>> +		.constraint = CONSTRAINT_BPF,
>>> +		.bpf_constraint = constraint_text__nullable,
>>>  	};
>>>  	int ret;
>> 
>> When CONSTRAINT_BPF is set in bpf_out_of_memory(), the early return in
>> constrained_alloc() prevents oc->totalpages from being initialized.  This
>> leaves totalpages at zero (from the designated initializer).
>> 
>> Later in the call chain out_of_memory()->select_bad_process()->
>> oom_evaluate_task()->oom_badness(), the code performs division by
>> totalpages at line 237:
>> 
>>     adj *= totalpages / 1000;
>> 
>> Can this cause a division by zero?  The path is reachable when a BPF
>> program calls bpf_out_of_memory() and either no BPF OOM handler is
>> registered or the handler fails to free memory, causing execution to fall
>> through to select_bad_process().
>
> Looks like the AI got a little excited about finding the uninit variable
> chain and forgot what dividing by zero really means.  I'll add a false
> positive check for this.

Yup, it was *almost* correct :)

But overall I'm really impressed: it found few legit bugs as well.
The only thing: I wish I could run it privately before posting to
public mailing lists...

Thanks,
Chris!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 16:31     ` Roman Gushchin
  0 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 16:31 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit 712137918bba1b0a9c7c670bab9200b0f0d6a7e8
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> bpf: selftests: introduce read_cgroup_file() helper
>
> This patch adds a new helper function read_cgroup_file() to read from
> cgroup control files in BPF selftests. The function opens a file,
> reads into a buffer, and returns the result.
>
>> diff --git a/tools/testing/selftests/bpf/cgroup_helpers.c b/tools/testing/selftests/bpf/cgroup_helpers.c
>> index 20cede4db..8fb02fe4c 100644
>> --- a/tools/testing/selftests/bpf/cgroup_helpers.c
>> +++ b/tools/testing/selftests/bpf/cgroup_helpers.c
>> @@ -126,6 +126,45 @@ int enable_controllers(const char *relative_path, const char *controllers)
>>  	return __enable_controllers(cgroup_path, controllers);
>>  }
>>
>> +static size_t __read_cgroup_file(const char *cgroup_path, const char *file,
>> +				 char *buf, size_t size)
>> +{
>> +	char file_path[PATH_MAX + 1];
>> +	size_t ret;
>> +	int fd;
>> +
>> +	snprintf(file_path, sizeof(file_path), "%s/%s", cgroup_path, file);
>> +	fd = open(file_path, O_RDONLY);
>> +	if (fd < 0) {
>> +		log_err("Opening %s", file_path);
>> +		return -1;
>
> Can __read_cgroup_file() return -1 when its return type is size_t?
>
> The function signature declares size_t (unsigned), but returns -1 here.
> When -1 is implicitly converted to size_t, it becomes SIZE_MAX
> (18446744073709551615 on 64-bit systems). Callers checking for errors
> with "if (ret < 0)" will fail because size_t is unsigned, and callers
> checking "if (ret == -1)" will also fail after the cast.
>
>> +	}
>> +
>> +	ret = read(fd, buf, size);
>> +	close(fd);
>> +	return ret;
>
> Does this handle read() errors correctly?
>
> The read() syscall returns ssize_t which can be negative on error.
> Assigning this to size_t (line 143) converts negative error codes into
> large positive numbers. For example, read() returning -1 becomes SIZE_MAX
> when stored in ret.
>
> Should the return type be ssize_t instead of size_t to properly propagate
> both success (byte count) and error (negative values) to callers?

Correct, fixed to ssize_t. Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-10-28 16:20       ` Roman Gushchin
@ 2025-10-28 16:35         ` Chris Mason
  0 siblings, 0 replies; 43+ messages in thread
From: Chris Mason @ 2025-10-28 16:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bot+bpf-ci, akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt,
	hannes, andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau,
	song, memxor, tj, daniel, eddyz87, yonghong.song, ihor.solodrai



On 10/28/25 12:20 PM, Roman Gushchin wrote:
> Chris Mason <clm@meta.com> writes:
> 
>> On 10/27/25 7:48 PM, bot+bpf-ci@kernel.org wrote:
>>>> @@ -1350,11 +1367,14 @@ __bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
>>>>   * Returns a negative value if an error occurred.
>>>>   */
>>>>  __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>>>> -				  int order, u64 flags)
>>>> +				  int order, u64 flags,
>>>> +				  const char *constraint_text__nullable)
>>>>  {
>>>>  	struct oom_control oc = {
>>>>  		.memcg = memcg__nullable,
>>>>  		.order = order,
>>>> +		.constraint = CONSTRAINT_BPF,
>>>> +		.bpf_constraint = constraint_text__nullable,
>>>>  	};
>>>>  	int ret;
>>>
>>> When CONSTRAINT_BPF is set in bpf_out_of_memory(), the early return in
>>> constrained_alloc() prevents oc->totalpages from being initialized.  This
>>> leaves totalpages at zero (from the designated initializer).
>>>
>>> Later in the call chain out_of_memory()->select_bad_process()->
>>> oom_evaluate_task()->oom_badness(), the code performs division by
>>> totalpages at line 237:
>>>
>>>     adj *= totalpages / 1000;
>>>
>>> Can this cause a division by zero?  The path is reachable when a BPF
>>> program calls bpf_out_of_memory() and either no BPF OOM handler is
>>> registered or the handler fails to free memory, causing execution to fall
>>> through to select_bad_process().
>>
>> Looks like the AI got a little excited about finding the uninit variable
>> chain and forgot what dividing by zero really means.  I'll add a false
>> positive check for this.
> 
> Yup, it was *almost* correct :)
> 
> But overall I'm really impressed: it found few legit bugs as well.
> The only thing: I wish I could run it privately before posting to
> public mailing lists...

I'm pretty happy with the false positive rate, and definitely appreciate
people engaging with the AI reviews to help improve things.

The BPF CI is directly running my review prompts github
(https://github.com/masoncl/review-prompts), so it's possible to run
locally with claude-code, and I'm assuming any of the other agents.

I've been refining the prompts against claude, but welcome patches to
make it work well with any of the others.

-chris



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc
  2025-10-27 23:57   ` bot+bpf-ci
@ 2025-10-28 16:43     ` Roman Gushchin
  0 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 16:43 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit bce57b9b96e05731a14c7e10a364fdd9b733cda2
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> mm: introduce bpf_out_of_memory() BPF kfunc
>
> This patch introduces bpf_out_of_memory() BPF kfunc to allow BPF programs
> to declare OOM events and trigger the kernel OOM handling mechanism. The
> function accepts a nullable memcg pointer for system-wide or memcg-scoped
> OOMs.
>
>> diff --git a/include/linux/oom.h b/include/linux/oom.h
>> index 721087952..3cbdcd013 100644
>> --- a/include/linux/oom.h
>> +++ b/include/linux/oom.h
>> @@ -21,6 +21,11 @@ enum oom_constraint {
>>  	CONSTRAINT_MEMCG,
>>  };
>>
>> +enum bpf_oom_flags {
>> +	BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK = 1 << 0,
>> +	BPF_OOM_FLAGS_LAST = 1 << 1,
>> +};
>> +
>>  /*
>>   * Details of the page allocation that triggered the oom killer that are used to
>>   * determine what should be killed.
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 3c86cd755..d7fca4bf5 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>
> [ ... ]
>
>> +/**
>> + * bpf_out_of_memory - declare Out Of Memory state and invoke OOM killer
>> + * @memcg__nullable: memcg or NULL for system-wide OOMs
>> + * @order: order of page which wasn't allocated
>> + * @flags: flags
>> + * @constraint_text__nullable: custom constraint description for the OOM report
>> + *
>> + * Declares the Out Of Memory state and invokes the OOM killer.
>> + *
>> + * OOM handlers are synchronized using the oom_lock mutex. If wait_on_oom_lock
>> + * is true, the function will wait on it. Otherwise it bails out with -EBUSY
>> + * if oom_lock is contended.
>> + *
>> + * Generally it's advised to pass wait_on_oom_lock=false for global OOMs
>> + * and wait_on_oom_lock=true for memcg-scoped OOMs.
>> + *
>> + * Returns 1 if the forward progress was achieved and some memory was freed.
>> + * Returns a negative value if an error occurred.
>> + */
>> +__bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>> +				  int order, u64 flags)
>> +{
>> +	struct oom_control oc = {
>> +		.memcg = memcg__nullable,
>> +		.order = order,
>> +	};
>
> Can bpf_out_of_memory() work correctly for system-wide OOMs with
> uninitialized gfp_mask?
>
> Looking at out_of_memory():
>
>     if (!(oc->gfp_mask & __GFP_FS) && !is_memcg_oom(oc))
>         return true;
>
> With gfp_mask=0 and memcg=NULL (system-wide OOM), this evaluates to:
> !(0 & __GFP_FS) && !false = true, causing an early return without
> invoking the OOM killer. The comment in out_of_memory() explains this
> check is about allocation context: "The OOM killer does not compensate
> for IO-less reclaim."
>
> All other callers of out_of_memory() initialize gfp_mask:
> - mm/page_alloc.c: .gfp_mask = gfp_mask
> - mm/memcontrol.c: .gfp_mask = gfp_mask
>
> For memcg OOMs, the code works because is_memcg_oom(oc) is true,
> bypassing the check. But for system-wide OOMs (NULL memcg), won't this
> cause the function to return immediately without killing any process?

This is a good catch! It must be .gfp_mask = GFP_KERNEL.

Fixed.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts()
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 17:07     ` Roman Gushchin
  2025-10-28 17:24       ` Andrii Nakryiko
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 17:07 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit b58d54ed56054e3a550e02c324982feb20a2671e
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> libbpf: introduce bpf_map__attach_struct_ops_opts()
>
> This commit introduces a new API function that extends the existing
> bpf_map__attach_struct_ops() to accept additional options, specifically
> allowing attachment of struct_ops maps to cgroups via a relative_fd.
>
>> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
>> index b90574f39..be56a5dee 100644
>> --- a/tools/lib/bpf/libbpf.c
>> +++ b/tools/lib/bpf/libbpf.c
>> @@ -13196,12 +13196,19 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
>>  	return close(link->fd);
>>  }
>>
>> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
>> +struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
>> +						 const struct bpf_struct_ops_opts *opts)
>>  {
>> +	DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
>>  	struct bpf_link_struct_ops *link;
>>  	__u32 zero = 0;
>>  	int err, fd;
>>
>> +	if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
>> +		pr_warn("map '%s': invalid opts\n", map->name);
>> +		return libbpf_err_ptr(-EINVAL);
>> +	}
>> +
>>  	if (!bpf_map__is_struct_ops(map)) {
>>  		pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
>>  		return libbpf_err_ptr(-EINVAL);
>
> [ ... ]
>
>>  	link = calloc(1, sizeof(*link));
>>  	if (!link)
>>  		return libbpf_err_ptr(-EINVAL);
>
> Should this return -ENOMEM instead of -EINVAL? All other calloc failures
> in libbpf.c return -ENOMEM (see lines 11006, 11802, 12288, 12746, 12829,
> 12891, 13089, 13129 in the same file).

This code was introduced by e9fc3ce99b34 ("libbpf: Streamline error
reporting for high-level APIs"), so it's new. I agree that ENOMEM is
correct, however it might be too late to fix. I'll let libbpf
maintainers to decide.

>
> [ ... ]
>
>> -	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
>> +	link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
>> +
>> +	fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);
>
> Is the expected_revision field being propagated correctly? The new
> bpf_struct_ops_opts structure includes an expected_revision field
> (see libbpf.h), and bpf_link_create() with BPF_STRUCT_OPS reads
> opts->cgroup.expected_revision (see bpf.c line 889), but this code only
> sets relative_fd in link_opts. Should it also include:
>
>     link_opts.cgroup.expected_revision = OPTS_GET(opts,
>     expected_revision, 0);

Correct, fixed.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 23/23] bpf: selftests: PSI struct ops test
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 17:13     ` Roman Gushchin
  2025-10-28 17:30       ` Alexei Starovoitov
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 17:13 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit 70a4d534559f5708a811241190cd158e83451a42
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> bpf: selftests: PSI struct ops test
>
> This patch adds a BPF selftest for PSI struct ops functionality,
> including cgroup lifecycle callbacks, PSI trigger creation, and
> memory pressure event handling.
>
>> diff --git a/tools/testing/selftests/bpf/prog_tests/test_psi.c b/tools/testing/selftests/bpf/prog_tests/test_psi.c
>> new file mode 100644
>> index 000000000..b294cea0a
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/prog_tests/test_psi.c
>
> [ ... ]
>
>> diff --git a/tools/testing/selftests/bpf/progs/test_psi.c b/tools/testing/selftests/bpf/progs/test_psi.c
>> new file mode 100644
>> index 000000000..4ddec7ec3
>> --- /dev/null
>> +++ b/tools/testing/selftests/bpf/progs/test_psi.c
>> @@ -0,0 +1,82 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +#include "vmlinux.h"
>> +#include <bpf/bpf_helpers.h>
>> +#include <bpf/bpf_tracing.h>
>> +
>> +char _license[] SEC("license") = "GPL";
>> +
>> +#define PSI_FULL 0x80000000
>> +
>> +/* cgroup which will experience the high memory pressure */
>> +u64 high_pressure_cgroup_id;
>> +
>> +/* cgroup which will be deleted */
>> +u64 deleted_cgroup_id;
>> +
>> +/* cgroup which will be created */
>> +u64 new_cgroup_id;
>> +
>> +/* cgroup which was deleted */
>> +u64 deleted_cgroup_id;
>>    ^^^^^^^^^^^^^^^^^^
>
> Is deleted_cgroup_id intentionally declared twice here? This appears
> to be a duplicate global variable declaration - the same variable is
> declared at line 13 with comment "cgroup which will be deleted" and
> again at line 19 with comment "cgroup which was deleted".

Correct, fixed.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts()
  2025-10-28 17:07     ` Roman Gushchin
@ 2025-10-28 17:24       ` Andrii Nakryiko
  0 siblings, 0 replies; 43+ messages in thread
From: Andrii Nakryiko @ 2025-10-28 17:24 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bot+bpf-ci, akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt,
	hannes, andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau,
	song, memxor, tj, daniel, eddyz87, yonghong.song, clm,
	ihor.solodrai

On Tue, Oct 28, 2025 at 10:07 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> bot+bpf-ci@kernel.org writes:
>
> > ```
> > commit b58d54ed56054e3a550e02c324982feb20a2671e
> > Author: Roman Gushchin <roman.gushchin@linux.dev>
> >
> > libbpf: introduce bpf_map__attach_struct_ops_opts()
> >
> > This commit introduces a new API function that extends the existing
> > bpf_map__attach_struct_ops() to accept additional options, specifically
> > allowing attachment of struct_ops maps to cgroups via a relative_fd.
> >
> >> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> >> index b90574f39..be56a5dee 100644
> >> --- a/tools/lib/bpf/libbpf.c
> >> +++ b/tools/lib/bpf/libbpf.c
> >> @@ -13196,12 +13196,19 @@ static int bpf_link__detach_struct_ops(struct bpf_link *link)
> >>      return close(link->fd);
> >>  }
> >>
> >> -struct bpf_link *bpf_map__attach_struct_ops(const struct bpf_map *map)
> >> +struct bpf_link *bpf_map__attach_struct_ops_opts(const struct bpf_map *map,
> >> +                                             const struct bpf_struct_ops_opts *opts)
> >>  {
> >> +    DECLARE_LIBBPF_OPTS(bpf_link_create_opts, link_opts);
> >>      struct bpf_link_struct_ops *link;
> >>      __u32 zero = 0;
> >>      int err, fd;
> >>
> >> +    if (!OPTS_VALID(opts, bpf_struct_ops_opts)) {
> >> +            pr_warn("map '%s': invalid opts\n", map->name);
> >> +            return libbpf_err_ptr(-EINVAL);
> >> +    }
> >> +
> >>      if (!bpf_map__is_struct_ops(map)) {
> >>              pr_warn("map '%s': can't attach non-struct_ops map\n", map->name);
> >>              return libbpf_err_ptr(-EINVAL);
> >
> > [ ... ]
> >
> >>      link = calloc(1, sizeof(*link));
> >>      if (!link)
> >>              return libbpf_err_ptr(-EINVAL);
> >
> > Should this return -ENOMEM instead of -EINVAL? All other calloc failures
> > in libbpf.c return -ENOMEM (see lines 11006, 11802, 12288, 12746, 12829,
> > 12891, 13089, 13129 in the same file).
>
> This code was introduced by e9fc3ce99b34 ("libbpf: Streamline error
> reporting for high-level APIs"), so it's new. I agree that ENOMEM is
> correct, however it might be too late to fix. I'll let libbpf
> maintainers to decide.

yeah, let's fix this to return -ENOMEM

>
> >
> > [ ... ]
> >
> >> -    fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, NULL);
> >> +    link_opts.cgroup.relative_fd = OPTS_GET(opts, relative_fd, 0);
> >> +
> >> +    fd = bpf_link_create(map->fd, 0, BPF_STRUCT_OPS, &link_opts);
> >
> > Is the expected_revision field being propagated correctly? The new
> > bpf_struct_ops_opts structure includes an expected_revision field
> > (see libbpf.h), and bpf_link_create() with BPF_STRUCT_OPS reads
> > opts->cgroup.expected_revision (see bpf.c line 889), but this code only
> > sets relative_fd in link_opts. Should it also include:
> >
> >     link_opts.cgroup.expected_revision = OPTS_GET(opts,
> >     expected_revision, 0);
>
> Correct, fixed.

I haven't looked at the rest of patches, but this use of relative_fd
seems wrong. relative_fd/relative_id and expected_version are there
for ordering of programs within the same attach target (e.g., same
cgroup). If you just want to specify cgroup to attach to, I think you
should use attr.link_create.target_fd (which is already handled a bit
lower generically)

>
> Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 23/23] bpf: selftests: PSI struct ops test
  2025-10-28 17:13     ` Roman Gushchin
@ 2025-10-28 17:30       ` Alexei Starovoitov
  0 siblings, 0 replies; 43+ messages in thread
From: Alexei Starovoitov @ 2025-10-28 17:30 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bot+bpf-ci, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, inwardvessel, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Daniel Borkmann, Eduard,
	Yonghong Song, Chris Mason, Ihor Solodrai

On Tue, Oct 28, 2025 at 10:13 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
> >> +
> >> +/* cgroup which will be deleted */
> >> +u64 deleted_cgroup_id;
> >> +
> >> +/* cgroup which will be created */
> >> +u64 new_cgroup_id;
> >> +
> >> +/* cgroup which was deleted */
> >> +u64 deleted_cgroup_id;
> >>    ^^^^^^^^^^^^^^^^^^
> >
> > Is deleted_cgroup_id intentionally declared twice here? This appears
> > to be a duplicate global variable declaration - the same variable is
> > declared at line 13 with comment "cgroup which will be deleted" and
> > again at line 19 with comment "cgroup which was deleted".
>
> Correct, fixed.

wow. TIL.

I didn't know that C allows such double variable definition
outside of function bodies.
Even with -Wall gcc/clang don't warn about it.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc
  2025-10-27 23:21 ` [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
@ 2025-10-28 17:32   ` Tejun Heo
  2025-10-28 18:09     ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2025-10-28 17:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

On Mon, Oct 27, 2025 at 04:21:58PM -0700, Roman Gushchin wrote:
> Export tsk_is_oom_victim() helper as a BPF kfunc.
> It's very useful to avoid redundant oom kills.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  mm/oom_kill.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 72a346261c79..90bb86dee3cf 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -1397,11 +1397,25 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>  	return ret;
>  }
>  
> +/**
> + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
> + * @task: task to check
> + *
> + * Returns true if the task has been previously selected by the OOM killer
> + * to be killed. It's expected that the task will be destroyed soon and some
> + * memory will be freed, so maybe no additional actions required.
> + */
> +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
> +{
> +	return tsk_is_oom_victim(task);
> +}

In general, I'm not sure it's a good idea to add kfuncs for things which are
trivially accessible. Why can't things like this be provided as BPF helpers?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops
  2025-10-27 23:22 ` [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 17:40   ` Tejun Heo
  2025-10-28 18:29     ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2025-10-28 17:40 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Hello,

On Mon, Oct 27, 2025 at 04:22:03PM -0700, Roman Gushchin wrote:
> This patch implements a BPF struct ops-based mechanism to create
> PSI triggers, attach them to cgroups or system wide and handle
> PSI events in BPF.
> 
> The struct ops provides 3 callbacks:
>   - init() called once at load, handy for creating PSI triggers
>   - handle_psi_event() called every time a PSI trigger fires
>   - handle_cgroup_online() called when a new cgroup is created
>   - handle_cgroup_offline() called if a cgroup with an attached
>     trigger is deleted
> 
> A single struct ops can create a number of PSI triggers, both
> cgroup-scoped and system-wide.
> 
> All 4 struct ops callbacks can be sleepable. handle_psi_event()
> handlers are executed using a separate workqueue, so it won't
> affect the latency of other PSI triggers.

Here, too, I wonder whether it's necessary to build a hard-coded
infrastructure to hook into PSI's triggers. psi_avgs_work() is what triggers
these events and it's not that hot. Wouldn't a fexit attachment to that
function that reads the updated values be enough? We can also easily add a
TP there if a more structured access is desirable.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc
  2025-10-28 17:32   ` Tejun Heo
@ 2025-10-28 18:09     ` Roman Gushchin
  2025-10-28 18:31       ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 18:09 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Tejun Heo <tj@kernel.org> writes:

> On Mon, Oct 27, 2025 at 04:21:58PM -0700, Roman Gushchin wrote:
>> Export tsk_is_oom_victim() helper as a BPF kfunc.
>> It's very useful to avoid redundant oom kills.
>> 
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  mm/oom_kill.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>> 
>> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
>> index 72a346261c79..90bb86dee3cf 100644
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1397,11 +1397,25 @@ __bpf_kfunc int bpf_out_of_memory(struct mem_cgroup *memcg__nullable,
>>  	return ret;
>>  }
>>  
>> +/**
>> + * bpf_task_is_oom_victim - Check if the task has been marked as an OOM victim
>> + * @task: task to check
>> + *
>> + * Returns true if the task has been previously selected by the OOM killer
>> + * to be killed. It's expected that the task will be destroyed soon and some
>> + * memory will be freed, so maybe no additional actions required.
>> + */
>> +__bpf_kfunc bool bpf_task_is_oom_victim(struct task_struct *task)
>> +{
>> +	return tsk_is_oom_victim(task);
>> +}
>
> In general, I'm not sure it's a good idea to add kfuncs for things which are
> trivially accessible. Why can't things like this be provided as BPF
> helpers?

I agree that this one might be too trivial, but I added it based on the
request from Michal Hocko. But with other helpers (e.g. for accessing
memcg stats) the idea is to provide a relatively stable interface for
bpf programs, which is not dependent on the implementation details. This
will simplify the maintenance of bpf programs across multiple kernel
versions.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops
  2025-10-28 17:40   ` Tejun Heo
@ 2025-10-28 18:29     ` Roman Gushchin
  2025-10-28 18:35       ` Tejun Heo
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 18:29 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Tejun Heo <tj@kernel.org> writes:

> Hello,
>
> On Mon, Oct 27, 2025 at 04:22:03PM -0700, Roman Gushchin wrote:
>> This patch implements a BPF struct ops-based mechanism to create
>> PSI triggers, attach them to cgroups or system wide and handle
>> PSI events in BPF.
>> 
>> The struct ops provides 3 callbacks:
>>   - init() called once at load, handy for creating PSI triggers
>>   - handle_psi_event() called every time a PSI trigger fires
>>   - handle_cgroup_online() called when a new cgroup is created
>>   - handle_cgroup_offline() called if a cgroup with an attached
>>     trigger is deleted
>> 
>> A single struct ops can create a number of PSI triggers, both
>> cgroup-scoped and system-wide.
>> 
>> All 4 struct ops callbacks can be sleepable. handle_psi_event()
>> handlers are executed using a separate workqueue, so it won't
>> affect the latency of other PSI triggers.
>
> Here, too, I wonder whether it's necessary to build a hard-coded
> infrastructure to hook into PSI's triggers. psi_avgs_work() is what triggers
> these events and it's not that hot. Wouldn't a fexit attachment to that
> function that reads the updated values be enough? We can also easily add a
> TP there if a more structured access is desirable.

Idk, it would require re-implementing parts of the kernel PSI trigger code
in BPF, without clear benefits.

Handling PSI in BPF might be quite useful outside of the OOM handling,
e.g. it can be used for scheduling decisions, networking throttling,
memory tiering, etc. So maybe I'm biased (and I'm obviously am here), but
I'm not too concerned about adding infrastructure which won't be used.

But I understand your point. I personally feel that the added complexity of
the infrastructure makes writing and maintaining BPF PSI programs
simpler, but I'm open to other opinions here.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc
  2025-10-28 18:09     ` Roman Gushchin
@ 2025-10-28 18:31       ` Tejun Heo
  0 siblings, 0 replies; 43+ messages in thread
From: Tejun Heo @ 2025-10-28 18:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Hello,

On Tue, Oct 28, 2025 at 11:09:28AM -0700, Roman Gushchin wrote:
> > In general, I'm not sure it's a good idea to add kfuncs for things which are
> > trivially accessible. Why can't things like this be provided as BPF
> > helpers?
> 
> I agree that this one might be too trivial, but I added it based on the
> request from Michal Hocko. But with other helpers (e.g. for accessing
> memcg stats) the idea is to provide a relatively stable interface for
> bpf programs, which is not dependent on the implementation details. This
> will simplify the maintenance of bpf programs across multiple kernel
> versions.

This is an abstract subject and thus a bit difficult to argue concretely.
I'll just share my take on it based on my experience w/ sched_ext.

The main problem with "I'll add enough interfaces to keep the BPF programs
stable" is that it's really difficult to foresee how BPF programs will
actually use them. You may have certain ideas on what information they would
consume and how but other people may have completely different ideas. After
all, that's why we want this to be BPF defined.

Projecting to the future, there's a pretty good chance that some programs
will be using mix of these hard coded interfaces and other generic BPF hooks
and mechanisms. At that point, when you want to add access to something new,
the decision becomes awakward. Adding a new hard coded interface doesn't
really enable anything that isn't possible otherwise while creating
compatibility problems for older kernels.

The other side of that coin is that BPF has a lot of mechanisms that support
backward and forward binary compatibility such as CO-RE relocations,
polymorhpic struct_ops and kfunc matching, type-aware symbol existence
testing and so on. It really is not that difficult to maintain a pretty
large sliding window of compatibility using these mechanisms and I believe
concerns over interface stability is overblown.

So, my take is that trying to bolster interface stability doesn't really
solve serious enough problems that justify the downsides of adding those
hard coded interface. There are just better and more flexible ways to deal
with them.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops
  2025-10-28 18:29     ` Roman Gushchin
@ 2025-10-28 18:35       ` Tejun Heo
  2025-10-28 19:54         ` Roman Gushchin
  0 siblings, 1 reply; 43+ messages in thread
From: Tejun Heo @ 2025-10-28 18:35 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Hello,

On Tue, Oct 28, 2025 at 11:29:31AM -0700, Roman Gushchin wrote:
> > Here, too, I wonder whether it's necessary to build a hard-coded
> > infrastructure to hook into PSI's triggers. psi_avgs_work() is what triggers
> > these events and it's not that hot. Wouldn't a fexit attachment to that
> > function that reads the updated values be enough? We can also easily add a
> > TP there if a more structured access is desirable.
> 
> Idk, it would require re-implementing parts of the kernel PSI trigger code
> in BPF, without clear benefits.
> 
> Handling PSI in BPF might be quite useful outside of the OOM handling,
> e.g. it can be used for scheduling decisions, networking throttling,
> memory tiering, etc. So maybe I'm biased (and I'm obviously am here), but
> I'm not too concerned about adding infrastructure which won't be used.
> 
> But I understand your point. I personally feel that the added complexity of
> the infrastructure makes writing and maintaining BPF PSI programs
> simpler, but I'm open to other opinions here.

Yeah, I mean, I'm not necessarily against adding infrastructure if the need
is justified - ie. it enables new things which isn't reasonably feasible
otherwise. However, it's also a good idea to start small, iterate and build
up. It's always easier to add new things than to remove stuff which is
already out there. Wouldn't it make more sense to add the minimum mechanism,
see how things develop and add what's identified as missing in the process?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops
  2025-10-28 18:35       ` Tejun Heo
@ 2025-10-28 19:54         ` Roman Gushchin
  0 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-10-28 19:54 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Tejun Heo <tj@kernel.org> writes:

> Hello,
>
> On Tue, Oct 28, 2025 at 11:29:31AM -0700, Roman Gushchin wrote:
>> > Here, too, I wonder whether it's necessary to build a hard-coded
>> > infrastructure to hook into PSI's triggers. psi_avgs_work() is what triggers
>> > these events and it's not that hot. Wouldn't a fexit attachment to that
>> > function that reads the updated values be enough? We can also easily add a
>> > TP there if a more structured access is desirable.
>> 
>> Idk, it would require re-implementing parts of the kernel PSI trigger code
>> in BPF, without clear benefits.
>> 
>> Handling PSI in BPF might be quite useful outside of the OOM handling,
>> e.g. it can be used for scheduling decisions, networking throttling,
>> memory tiering, etc. So maybe I'm biased (and I'm obviously am here), but
>> I'm not too concerned about adding infrastructure which won't be used.
>> 
>> But I understand your point. I personally feel that the added complexity of
>> the infrastructure makes writing and maintaining BPF PSI programs
>> simpler, but I'm open to other opinions here.
>
> Yeah, I mean, I'm not necessarily against adding infrastructure if the need
> is justified - ie. it enables new things which isn't reasonably feasible
> otherwise. However, it's also a good idea to start small, iterate and build
> up. It's always easier to add new things than to remove stuff which is
> already out there. Wouldn't it make more sense to add the minimum mechanism,
> see how things develop and add what's identified as missing in the
> process?

Ok, let me try the TP approach and see how it will look like.
If there won't see any significant downsides, I'll drop the BPF PSI triggers
infrastructure.

Thanks!


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-10-27 23:21 ` [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-11-10  9:31   ` Michal Hocko
  2025-11-11 19:17     ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2025-11-10  9:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:21:57, Roman Gushchin wrote:
> Currently there is a hard-coded list of possible oom constraints:
> NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
> Also, add an ability to specify a custom constraint name
> when calling bpf_out_of_memory(). If an empty string is passed
> as an argument, CONSTRAINT_BPF is displayed.

Constrain is meant to define the scope of the oom handler but to me it
seems like you want to specify the oom handler and (ab)using scope for
that. In other words it still makes sense to distinguesh memcg, global,
mempolicy wide OOMs with global vs. bpf handler, right?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc
  2025-10-27 23:21 ` [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
  2025-10-27 23:57   ` bot+bpf-ci
@ 2025-11-10  9:46   ` Michal Hocko
  2025-11-11 19:13     ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2025-11-10  9:46 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:21:56, Roman Gushchin wrote:
> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> an out of memory events and trigger the corresponding kernel OOM
> handling mechanism.
> 
> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> as an argument, as well as the page order.
> 
> If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
> can be declared and handled in the system at once, so if the function
> is called in parallel to another OOM handling, it bails out with -EBUSY.
> This mode is suited for global OOM's: any concurrent OOMs will likely
> do the job and release some memory. In a blocking mode (which is
> suited for memcg OOMs) the execution will wait on the oom_lock mutex.

Rather than relying on BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK would it make
sense to take the oom_lock based on the oc->memcg so that this is
completely transparent to specific oom bpf handlers?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 23/23] bpf: selftests: PSI struct ops test
  2025-10-27 23:22 ` [PATCH v2 23/23] bpf: selftests: PSI struct ops test Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-11-10  9:48   ` Michal Hocko
  2025-11-11 19:03     ` Roman Gushchin
  1 sibling, 1 reply; 43+ messages in thread
From: Michal Hocko @ 2025-11-10  9:48 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:22:06, Roman Gushchin wrote:
> Add a PSI struct ops test.
> 
> The test creates a cgroup with two child sub-cgroups, sets up
> memory.high for one of those and puts there a memory hungry
> process (initially frozen).
> 
> Then it creates 2 PSI triggers from within a init() BPF callback and
> attaches them to these cgroups.  Then it deletes the first cgroup,
> creates another one and runs the memory hungry task. From the cgroup
> creation callback the test is creating another trigger.
> 
> The memory hungry task is creating a high memory pressure in one
> memory cgroup, which triggers a PSI event. The PSI BPF handler
> declares a memcg oom in the corresponding cgroup. Finally the checks
> that both handle_cgroup_free() and handle_psi_event() handlers were
> executed, the correct process was killed and oom counters were
> updated.

I might be just dense but what is behind that deleted cgroup
(deleted_cgroup_id etc) dance?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 23/23] bpf: selftests: PSI struct ops test
  2025-11-10  9:48   ` Michal Hocko
@ 2025-11-11 19:03     ` Roman Gushchin
  0 siblings, 0 replies; 43+ messages in thread
From: Roman Gushchin @ 2025-11-11 19:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-10-25 16:22:06, Roman Gushchin wrote:
>> Add a PSI struct ops test.
>> 
>> The test creates a cgroup with two child sub-cgroups, sets up
>> memory.high for one of those and puts there a memory hungry
>> process (initially frozen).
>> 
>> Then it creates 2 PSI triggers from within a init() BPF callback and
>> attaches them to these cgroups.  Then it deletes the first cgroup,
>> creates another one and runs the memory hungry task. From the cgroup
>> creation callback the test is creating another trigger.
>> 
>> The memory hungry task is creating a high memory pressure in one
>> memory cgroup, which triggers a PSI event. The PSI BPF handler
>> declares a memcg oom in the corresponding cgroup. Finally the checks
>> that both handle_cgroup_free() and handle_psi_event() handlers were
>> executed, the correct process was killed and oom counters were
>> updated.
>
> I might be just dense but what is behind that deleted cgroup
> (deleted_cgroup_id etc) dance?

It was a way to test the handle_cgroup_free() callback, which might
go away in the next version. If it's gonna stay, I'll add more comments
around.

Thanks


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc
  2025-11-10  9:46   ` Michal Hocko
@ 2025-11-11 19:13     ` Roman Gushchin
  2025-11-12  7:50       ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-11-11 19:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-10-25 16:21:56, Roman Gushchin wrote:
>> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
>> an out of memory events and trigger the corresponding kernel OOM
>> handling mechanism.
>> 
>> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
>> as an argument, as well as the page order.
>> 
>> If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
>> can be declared and handled in the system at once, so if the function
>> is called in parallel to another OOM handling, it bails out with -EBUSY.
>> This mode is suited for global OOM's: any concurrent OOMs will likely
>> do the job and release some memory. In a blocking mode (which is
>> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
>
> Rather than relying on BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK would it make
> sense to take the oom_lock based on the oc->memcg so that this is
> completely transparent to specific oom bpf handlers?

Idk, I don't have a super-strong opinion here, but giving the user the
flexibility seems to be more future-proof. E.g. if we split oom lock
so that we can have competing OOMs in different parts of the memcg tree,
will we change the behavior?


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-11-10  9:31   ` Michal Hocko
@ 2025-11-11 19:17     ` Roman Gushchin
  2025-11-12  7:52       ` Michal Hocko
  0 siblings, 1 reply; 43+ messages in thread
From: Roman Gushchin @ 2025-11-11 19:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-10-25 16:21:57, Roman Gushchin wrote:
>> Currently there is a hard-coded list of possible oom constraints:
>> NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
>> Also, add an ability to specify a custom constraint name
>> when calling bpf_out_of_memory(). If an empty string is passed
>> as an argument, CONSTRAINT_BPF is displayed.
>
> Constrain is meant to define the scope of the oom handler but to me it
> seems like you want to specify the oom handler and (ab)using scope for
> that. In other words it still makes sense to distinguesh memcg, global,
> mempolicy wide OOMs with global vs. bpf handler, right?

I use the word "constraint" as the "reason" why an OOM was declared (in
other words which constraint was violated). And memcg vs global define
the scope. Right now the only way to trigger a memcg oom is to exceed
the memory.max limit. But with bpf oom there will others, e.g. exceed a
certain PSI threshold. So you can have different constraints violated
within the same scope.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc
  2025-11-11 19:13     ` Roman Gushchin
@ 2025-11-12  7:50       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2025-11-12  7:50 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue 11-11-25 11:13:04, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 27-10-25 16:21:56, Roman Gushchin wrote:
> >> Introduce bpf_out_of_memory() bpf kfunc, which allows to declare
> >> an out of memory events and trigger the corresponding kernel OOM
> >> handling mechanism.
> >> 
> >> It takes a trusted memcg pointer (or NULL for system-wide OOMs)
> >> as an argument, as well as the page order.
> >> 
> >> If the BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK flag is not set, only one OOM
> >> can be declared and handled in the system at once, so if the function
> >> is called in parallel to another OOM handling, it bails out with -EBUSY.
> >> This mode is suited for global OOM's: any concurrent OOMs will likely
> >> do the job and release some memory. In a blocking mode (which is
> >> suited for memcg OOMs) the execution will wait on the oom_lock mutex.
> >
> > Rather than relying on BPF_OOM_FLAGS_WAIT_ON_OOM_LOCK would it make
> > sense to take the oom_lock based on the oc->memcg so that this is
> > completely transparent to specific oom bpf handlers?
> 
> Idk, I don't have a super-strong opinion here, but giving the user the
> flexibility seems to be more future-proof. E.g. if we split oom lock
> so that we can have competing OOMs in different parts of the memcg tree,
> will we change the behavior?

The point I've tried to make is that this OOM invocation is no different
from the global one from the locking perspective. Adding an external
flag to control the behavior might be slightly more flexible but it adds
a new element. Unless there is a very strong reason for that I would go
with the existing locking model.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers
  2025-11-11 19:17     ` Roman Gushchin
@ 2025-11-12  7:52       ` Michal Hocko
  0 siblings, 0 replies; 43+ messages in thread
From: Michal Hocko @ 2025-11-12  7:52 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue 11-11-25 11:17:48, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 27-10-25 16:21:57, Roman Gushchin wrote:
> >> Currently there is a hard-coded list of possible oom constraints:
> >> NONE, CPUSET, MEMORY_POLICY & MEMCG. Add a new one: CONSTRAINT_BPF.
> >> Also, add an ability to specify a custom constraint name
> >> when calling bpf_out_of_memory(). If an empty string is passed
> >> as an argument, CONSTRAINT_BPF is displayed.
> >
> > Constrain is meant to define the scope of the oom handler but to me it
> > seems like you want to specify the oom handler and (ab)using scope for
> > that. In other words it still makes sense to distinguesh memcg, global,
> > mempolicy wide OOMs with global vs. bpf handler, right?
> 
> I use the word "constraint" as the "reason" why an OOM was declared (in
> other words which constraint was violated). And memcg vs global define
> the scope. Right now the only way to trigger a memcg oom is to exceed
> the memory.max limit. But with bpf oom there will others, e.g. exceed a
> certain PSI threshold. So you can have different constraints violated
> within the same scope.

Please use a different placeholder for that. Current constrains have a
well defined semantic. They are not claiming why the OOM happened but
what is the scope of the oom action (domain if you will). The specific
handler has a sufficient knowledge to explain why the OOM killing is
happening and on which domain/scope/constrain.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2025-11-12  7:52 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 23:21 [PATCH v2 11/23] mm: introduce BPF kfunc to access memory events Roman Gushchin
2025-10-27 23:21 ` [PATCH v2 12/23] bpf: selftests: selftests for memcg stat kfuncs Roman Gushchin
2025-10-27 23:21 ` [PATCH v2 13/23] mm: introduce bpf_out_of_memory() BPF kfunc Roman Gushchin
2025-10-27 23:57   ` bot+bpf-ci
2025-10-28 16:43     ` Roman Gushchin
2025-11-10  9:46   ` Michal Hocko
2025-11-11 19:13     ` Roman Gushchin
2025-11-12  7:50       ` Michal Hocko
2025-10-27 23:21 ` [PATCH v2 14/23] mm: allow specifying custom oom constraint for BPF triggers Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 15:58     ` Chris Mason
2025-10-28 16:20       ` Roman Gushchin
2025-10-28 16:35         ` Chris Mason
2025-11-10  9:31   ` Michal Hocko
2025-11-11 19:17     ` Roman Gushchin
2025-11-12  7:52       ` Michal Hocko
2025-10-27 23:21 ` [PATCH v2 15/23] mm: introduce bpf_task_is_oom_victim() kfunc Roman Gushchin
2025-10-28 17:32   ` Tejun Heo
2025-10-28 18:09     ` Roman Gushchin
2025-10-28 18:31       ` Tejun Heo
2025-10-27 23:21 ` [PATCH v2 16/23] libbpf: introduce bpf_map__attach_struct_ops_opts() Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 17:07     ` Roman Gushchin
2025-10-28 17:24       ` Andrii Nakryiko
2025-10-27 23:22 ` [PATCH v2 17/23] bpf: selftests: introduce read_cgroup_file() helper Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 16:31     ` Roman Gushchin
2025-10-27 23:22 ` [PATCH v2 18/23] bpf: selftests: BPF OOM handler test Roman Gushchin
2025-10-27 23:22 ` [PATCH v2 19/23] sched: psi: refactor psi_trigger_create() Roman Gushchin
2025-10-27 23:22 ` [PATCH v2 20/23] sched: psi: implement bpf_psi struct ops Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 17:40   ` Tejun Heo
2025-10-28 18:29     ` Roman Gushchin
2025-10-28 18:35       ` Tejun Heo
2025-10-28 19:54         ` Roman Gushchin
2025-10-27 23:22 ` [PATCH v2 21/23] sched: psi: implement bpf_psi_create_trigger() kfunc Roman Gushchin
2025-10-27 23:22 ` [PATCH v2 22/23] bpf: selftests: add config for psi Roman Gushchin
2025-10-27 23:22 ` [PATCH v2 23/23] bpf: selftests: PSI struct ops test Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 17:13     ` Roman Gushchin
2025-10-28 17:30       ` Alexei Starovoitov
2025-11-10  9:48   ` Michal Hocko
2025-11-11 19:03     ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).