[PATCH v2 00/23] mm: BPF OOM

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/23] mm: BPF OOM
@ 2025-10-27 23:17 Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
                   ` (10 more replies)
  0 siblings, 11 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

This patchset adds an ability to customize the out of memory
handling using bpf.

It focuses on two parts:
1) OOM handling policy,
2) PSI-based OOM invocation.

The idea to use bpf for customizing the OOM handling is not new, but
unlike the previous proposal [1], which augmented the existing task
ranking policy, this one tries to be as generic as possible and
leverage the full power of the modern bpf.

It provides a generic interface which is called before the existing OOM
killer code and allows implementing any policy, e.g. picking a victim
task or memory cgroup or potentially even releasing memory in other
ways, e.g. deleting tmpfs files (the last one might require some
additional but relatively simple changes).

The past attempt to implement memory-cgroup aware policy [2] showed
that there are multiple opinions on what the best policy is.  As it's
highly workload-dependent and specific to a concrete way of organizing
workloads, the structure of the cgroup tree etc, a customizable
bpf-based implementation is preferable over an in-kernel implementation
with a dozen of sysctls.

The second part is related to the fundamental question on when to
declare the OOM event. It's a trade-off between the risk of
unnecessary OOM kills and associated work losses and the risk of
infinite trashing and effective soft lockups.  In the last few years
several PSI-based userspace solutions were developed (e.g. OOMd [3] or
systemd-OOMd [4]). The common idea was to use userspace daemons to
implement custom OOM logic as well as rely on PSI monitoring to avoid
stalls. In this scenario the userspace daemon was supposed to handle
the majority of OOMs, while the in-kernel OOM killer worked as the
last resort measure to guarantee that the system would never deadlock
on the memory. But this approach creates additional infrastructure
churn: userspace OOM daemon is a separate entity which needs to be
deployed, updated, monitored. A completely different pipeline needs to
be built to monitor both types of OOM events and collect associated
logs. A userspace daemon is more restricted in terms on what data is
available to it. Implementing a daemon which can work reliably under a
heavy memory pressure in the system is also tricky.

This patchset includes the code, tests and many ideas from the patchset
of JP Kobryn, which implemented bpf kfuncs to provide a faster method
to access memcg data [5].

[1]: https://lwn.net/ml/linux-kernel/20230810081319.65668-1-zhouchuyi@bytedance.com/
[2]: https://lore.kernel.org/lkml/20171130152824.1591-1-guro@fb.com/
[3]: https://github.com/facebookincubator/oomd
[4]: https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[5]: https://lkml.org/lkml/2025/10/15/1554

---
JP Kobryn (3):
      mm: introduce BPF kfunc to access memory events
      bpf: selftests: selftests for memcg stat kfuncs
      bpf: selftests: add config for psi

Roman Gushchin (20):
      bpf: move bpf_struct_ops_link into bpf.h
      bpf: initial support for attaching struct ops to cgroups
      bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
      mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
      mm: declare memcg_page_state_output() in memcontrol.h
      mm: introduce BPF struct ops for OOM handling
      mm: introduce bpf_oom_kill_process() bpf kfunc
      mm: introduce BPF kfuncs to deal with memcg pointers
      mm: introduce bpf_get_root_mem_cgroup() BPF kfunc
      mm: introduce BPF kfuncs to access memcg statistics and events
      mm: introduce bpf_out_of_memory() BPF kfunc
      mm: allow specifying custom oom constraint for BPF triggers
      mm: introduce bpf_task_is_oom_victim() kfunc
      libbpf: introduce bpf_map__attach_struct_ops_opts()
      bpf: selftests: introduce read_cgroup_file() helper
      bpf: selftests: BPF OOM handler test
      sched: psi: refactor psi_trigger_create()
      sched: psi: implement bpf_psi struct ops
      sched: psi: implement bpf_psi_create_trigger() kfunc
      bpf: selftests: PSI struct ops test


v2:
  1) A single bpf_oom can be attached system-wide and a single bpf_oom per memcg.
     (by Alexei Starovoitov)
  2) Initial support for attaching struct ops to cgroups (Martin KaFai Lau,
     Andrii Nakryiko and others)
  3) bpf memcontrol kfuncs enhancements and tests (co-developed by JP Kobryn)
  4) Many mall-ish fixes and cleanups (suggested by Andrew Morton, Suren Baghdasaryan,
     Andrii Nakryiko and Kumar Kartikeya Dwivedi)
  5) bpf_out_of_memory() is taking u64 flags instead of bool wait_on_oom_lock
     (suggested by Kumar Kartikeya Dwivedi)
  6) bpf_get_mem_cgroup() got KF_RCU flag (suggested by Kumar Kartikeya Dwivedi)
  7) cgroup online and offline callbacks for bpf_psi, cgroup offline for bpf_oom

v1:
  1) Both OOM and PSI parts are now implemented using bpf struct ops,
     providing a path the future extensions (suggested by Kumar Kartikeya Dwivedi,
     Song Liu and Matt Bobrowski)
  2) It's possible to create PSI triggers from BPF, no need for an additional
     userspace agent. (suggested by Suren Baghdasaryan)
     Also there is now a callback for the cgroup release event.
  3) Added an ability to block on oom_lock instead of bailing out (suggested by Michal Hocko)
  4) Added bpf_task_is_oom_victim (suggested by Michal Hocko)
  5) PSI callbacks are scheduled using a separate workqueue (suggested by Suren Baghdasaryan)

RFC:
  https://lwn.net/ml/all/20250428033617.3797686-1-roman.gushchin@linux.dev/


JP Kobryn (3):
  mm: introduce BPF kfunc to access memory events
  bpf: selftests: selftests for memcg stat kfuncs
  bpf: selftests: add config for psi

Roman Gushchin (20):
  bpf: move bpf_struct_ops_link into bpf.h
  bpf: initial support for attaching struct ops to cgroups
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  mm: declare memcg_page_state_output() in memcontrol.h
  mm: introduce BPF struct ops for OOM handling
  mm: introduce bpf_oom_kill_process() bpf kfunc
  mm: introduce BPF kfuncs to deal with memcg pointers
  mm: introduce bpf_get_root_mem_cgroup() BPF kfunc
  mm: introduce BPF kfuncs to access memcg statistics and events
  mm: introduce bpf_out_of_memory() BPF kfunc
  mm: allow specifying custom oom constraint for BPF triggers
  mm: introduce bpf_task_is_oom_victim() kfunc
  libbpf: introduce bpf_map__attach_struct_ops_opts()
  bpf: selftests: introduce read_cgroup_file() helper
  bpf: selftests: BPF OOM handler test
  sched: psi: refactor psi_trigger_create()
  sched: psi: implement bpf_psi struct ops
  sched: psi: implement bpf_psi_create_trigger() kfunc
  bpf: selftests: PSI struct ops test

 include/linux/bpf.h                           |   7 +
 include/linux/bpf_oom.h                       |  74 ++++
 include/linux/bpf_psi.h                       |  87 ++++
 include/linux/cgroup.h                        |   4 +
 include/linux/memcontrol.h                    |  12 +-
 include/linux/oom.h                           |  17 +
 include/linux/psi.h                           |  21 +-
 include/linux/psi_types.h                     |  72 +++-
 kernel/bpf/bpf_struct_ops.c                   |  19 +-
 kernel/bpf/cgroup.c                           |   3 +
 kernel/bpf/verifier.c                         |   5 +
 kernel/cgroup/cgroup.c                        |  14 +-
 kernel/sched/bpf_psi.c                        | 396 ++++++++++++++++++
 kernel/sched/build_utility.c                  |   4 +
 kernel/sched/psi.c                            | 130 ++++--
 mm/Makefile                                   |   4 +
 mm/bpf_memcontrol.c                           | 176 ++++++++
 mm/bpf_oom.c                                  | 272 ++++++++++++
 mm/memcontrol-v1.h                            |   1 -
 mm/memcontrol.c                               |   4 +-
 mm/oom_kill.c                                 | 203 ++++++++-
 tools/lib/bpf/bpf.c                           |   8 +
 tools/lib/bpf/libbpf.c                        |  18 +-
 tools/lib/bpf/libbpf.h                        |  14 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  |  39 ++
 tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
 .../testing/selftests/bpf/cgroup_iter_memcg.h |  18 +
 tools/testing/selftests/bpf/config            |   1 +
 .../bpf/prog_tests/cgroup_iter_memcg.c        | 223 ++++++++++
 .../selftests/bpf/prog_tests/test_oom.c       | 249 +++++++++++
 .../selftests/bpf/prog_tests/test_psi.c       | 238 +++++++++++
 .../selftests/bpf/progs/cgroup_iter_memcg.c   |  42 ++
 tools/testing/selftests/bpf/progs/test_oom.c  | 118 ++++++
 tools/testing/selftests/bpf/progs/test_psi.c  |  82 ++++
 35 files changed, 2512 insertions(+), 66 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 include/linux/bpf_psi.h
 create mode 100644 kernel/sched/bpf_psi.c
 create mode 100644 mm/bpf_memcontrol.c
 create mode 100644 mm/bpf_oom.c
 create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/test_psi.c
 create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_oom.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_psi.c

-- 
2.51.0



^ permalink raw reply	[flat|nested] 83+ messages in thread

* [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Move struct bpf_struct_ops_link's definition into bpf.h,
where other custom bpf links definitions are.

It's necessary to access its members from outside of generic
bpf_struct_ops implementation, which will be done by following
patches in the series.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf.h         | 6 ++++++
 kernel/bpf/bpf_struct_ops.c | 6 ------
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index a47d67db3be5..eae907218188 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1845,6 +1845,12 @@ struct bpf_raw_tp_link {
 	u64 cookie;
 };
 
+struct bpf_struct_ops_link {
+	struct bpf_link link;
+	struct bpf_map __rcu *map;
+	wait_queue_head_t wait_hup;
+};
+
 struct bpf_link_primer {
 	struct bpf_link *link;
 	struct file *file;
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index a41e6730edcf..45cc5ee19dc2 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -55,12 +55,6 @@ struct bpf_struct_ops_map {
 	struct bpf_struct_ops_value kvalue;
 };
 
-struct bpf_struct_ops_link {
-	struct bpf_link link;
-	struct bpf_map __rcu *map;
-	wait_queue_head_t wait_hup;
-};
-
 static DEFINE_MUTEX(update_mutex);
 
 #define VALUE_PREFIX "bpf_struct_ops_"
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
                     ` (4 more replies)
  2025-10-27 23:17 ` [PATCH v2 03/23] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
                   ` (8 subsequent siblings)
  10 siblings, 5 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

When a struct ops is being attached and a bpf link is created,
allow to pass a cgroup fd using bpf attr, so that struct ops
can be attached to a cgroup instead of globally.

Attached struct ops doesn't hold a reference to the cgroup,
only preserves cgroup id.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf.h         |  1 +
 kernel/bpf/bpf_struct_ops.c | 13 +++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index eae907218188..7205b813e25f 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
 	struct bpf_link link;
 	struct bpf_map __rcu *map;
 	wait_queue_head_t wait_hup;
+	u64 cgroup_id;
 };
 
 struct bpf_link_primer {
diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
index 45cc5ee19dc2..58664779a2b6 100644
--- a/kernel/bpf/bpf_struct_ops.c
+++ b/kernel/bpf/bpf_struct_ops.c
@@ -13,6 +13,7 @@
 #include <linux/btf_ids.h>
 #include <linux/rcupdate_wait.h>
 #include <linux/poll.h>
+#include <linux/cgroup.h>
 
 struct bpf_struct_ops_value {
 	struct bpf_struct_ops_common_value common;
@@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
 	}
 	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
 		      attr->link_create.attach_type);
+#ifdef CONFIG_CGROUPS
+	if (attr->link_create.cgroup.relative_fd) {
+		struct cgroup *cgrp;
+
+		cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
+		if (IS_ERR(cgrp))
+			return PTR_ERR(cgrp);
+
+		link->cgroup_id = cgroup_id(cgrp);
+		cgroup_put(cgrp);
+	}
+#endif /* CONFIG_CGROUPS */
 
 	err = bpf_link_prime(&link->link, &link_primer);
 	if (err)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 03/23] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Struct oom_control is used to describe the OOM context.
It's memcg field defines the scope of OOM: it's NULL for global
OOMs and a valid memcg pointer for memcg-scoped OOMs.
Teach bpf verifier to recognize it as trusted or NULL pointer.
It will provide the bpf OOM handler a trusted memcg pointer,
which for example is required for iterating the memcg's subtree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
---
 kernel/bpf/verifier.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 542e23fb19c7..811275419be3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -7104,6 +7104,10 @@ BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct) {
 	struct file *vm_file;
 };
 
+BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control) {
+	struct mem_cgroup *memcg;
+};
+
 static bool type_is_rcu(struct bpf_verifier_env *env,
 			struct bpf_reg_state *reg,
 			const char *field_name, u32 btf_id)
@@ -7146,6 +7150,7 @@ static bool type_is_trusted_or_null(struct bpf_verifier_env *env,
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct socket));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct dentry));
 	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct vm_area_struct));
+	BTF_TYPE_EMIT(BTF_TYPE_SAFE_TRUSTED_OR_NULL(struct oom_control));
 
 	return btf_nested_type_is_trusted(&env->log, reg, field_name, btf_id,
 					  "__safe_trusted_or_null");
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (2 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 03/23] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-31  8:32   ` Michal Hocko
  2025-10-27 23:17 ` [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h Roman Gushchin
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h | 4 ++--
 mm/memcontrol.c            | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 873e510d6f8d..9af9ae28afe7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -832,9 +832,9 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
 {
 	return memcg ? cgroup_ino(memcg->css.cgroup) : 0;
 }
+#endif
 
 struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino);
-#endif
 
 static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
 {
@@ -1331,12 +1331,12 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
 {
 	return 0;
 }
+#endif
 
 static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
 {
 	return NULL;
 }
-#endif
 
 static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4deda33625f4..5d27cd5372aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3618,7 +3618,6 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
 	return xa_load(&mem_cgroup_ids, id);
 }
 
-#ifdef CONFIG_SHRINKER_DEBUG
 struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
 {
 	struct cgroup *cgrp;
@@ -3639,7 +3638,6 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
 
 	return memcg;
 }
-#endif
 
 static void free_mem_cgroup_per_node_info(struct mem_cgroup_per_node *pn)
 {
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (3 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-31  8:34   ` Michal Hocko
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
                   ` (5 subsequent siblings)
  10 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

To use memcg_page_state_output() in bpf_memcontrol.c move the
declaration from v1-specific memcontrol-v1.h to memcontrol.h.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h | 1 +
 mm/memcontrol-v1.h         | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9af9ae28afe7..50d851ff3f27 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -949,6 +949,7 @@ static inline void mod_memcg_page_state(struct page *page,
 }
 
 unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
+unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
 unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
 unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 				      enum node_stat_item idx);
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..a304ad418cdf 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -27,7 +27,6 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
 void drain_all_stock(struct mem_cgroup *root_memcg);
 
 unsigned long memcg_events(struct mem_cgroup *memcg, int event);
-unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
 int memory_stat_show(struct seq_file *m, void *v);
 
 void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (4 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:57   ` bot+bpf-ci
                     ` (5 more replies)
  2025-10-27 23:17 ` [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
                   ` (4 subsequent siblings)
  10 siblings, 6 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Introduce a bpf struct ops for implementing custom OOM handling
policies.

It's possible to load one bpf_oom_ops for the system and one
bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
cgroup tree is traversed from the OOM'ing memcg up to the root and
corresponding BPF OOM handlers are executed until some memory is
freed. If no memory is freed, the kernel OOM killer is invoked.

The struct ops provides the bpf_handle_out_of_memory() callback,
which expected to return 1 if it was able to free some memory and 0
otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
field of the oom_control structure, which is expected to be set by
kfuncs suitable for releasing memory. If both are set, OOM is
considered handled, otherwise the next OOM handler in the chain
(e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
killer) is executed.

The bpf_handle_out_of_memory() callback program is sleepable to enable
using iterators, e.g. cgroup iterators. The callback receives struct
oom_control as an argument, so it can determine the scope of the OOM
event: if this is a memcg-wide or system-wide OOM.

The callback is executed just before the kernel victim task selection
algorithm, so all heuristics and sysctls like panic on oom,
sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
are respected.

BPF OOM struct ops provides the handle_cgroup_offline() callback
which is good for releasing struct ops if the corresponding cgroup
is gone.

The struct ops also has the name field, which allows to define a
custom name for the implemented policy. It's printed in the OOM report
in the oom_policy=<policy> format. "default" is printed if bpf is not
used or policy name is not specified.

[  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
               oom_policy=bpf_test_policy
[  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
[  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
[  112.698167] Call Trace:
[  112.698177]  <TASK>
[  112.698182]  dump_stack_lvl+0x4d/0x70
[  112.698192]  dump_header+0x59/0x1c6
[  112.698199]  oom_kill_process.cold+0x8/0xef
[  112.698206]  bpf_oom_kill_process+0x59/0xb0
[  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
[  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
[  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698240]  bpf_handle_oom+0x11a/0x1e0
[  112.698250]  out_of_memory+0xab/0x5c0
[  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
[  112.698274]  try_charge_memcg+0x4b5/0x7e0
[  112.698288]  charge_memcg+0x2f/0xc0
[  112.698293]  __mem_cgroup_charge+0x30/0xc0
[  112.698299]  do_anonymous_page+0x40f/0xa50
[  112.698311]  __handle_mm_fault+0xbba/0x1140
[  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
[  112.698335]  handle_mm_fault+0xe6/0x370
[  112.698343]  do_user_addr_fault+0x211/0x6a0
[  112.698354]  exc_page_fault+0x75/0x1d0
[  112.698363]  asm_exc_page_fault+0x26/0x30
[  112.698366] RIP: 0033:0x7fa97236db00

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/bpf_oom.h    |  74 ++++++++++
 include/linux/memcontrol.h |   5 +
 include/linux/oom.h        |   8 ++
 mm/Makefile                |   3 +
 mm/bpf_oom.c               | 272 +++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c            |   2 +
 mm/oom_kill.c              |  22 ++-
 7 files changed, 384 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/bpf_oom.h
 create mode 100644 mm/bpf_oom.c

diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
new file mode 100644
index 000000000000..18c32a5a068b
--- /dev/null
+++ b/include/linux/bpf_oom.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+
+#ifndef __BPF_OOM_H
+#define __BPF_OOM_H
+
+struct oom_control;
+
+#define BPF_OOM_NAME_MAX_LEN 64
+
+struct bpf_oom_ctx {
+	/*
+	 * If bpf_oom_ops is attached to a cgroup, id of this cgroup.
+	 * 0 otherwise.
+	 */
+	u64 cgroup_id;
+};
+
+struct bpf_oom_ops {
+	/**
+	 * @handle_out_of_memory: Out of memory bpf handler, called before
+	 * the in-kernel OOM killer.
+	 * @ctx: Execution context
+	 * @oc: OOM control structure
+	 *
+	 * Should return 1 if some memory was freed up, otherwise
+	 * the in-kernel OOM killer is invoked.
+	 */
+	int (*handle_out_of_memory)(struct bpf_oom_ctx *ctx, struct oom_control *oc);
+
+	/**
+	 * @handle_cgroup_offline: Cgroup offline callback
+	 * @ctx: Execution context
+	 * @cgroup_id: Id of deleted cgroup
+	 *
+	 * Called if the cgroup with the attached bpf_oom_ops is deleted.
+	 */
+	void (*handle_cgroup_offline)(struct bpf_oom_ctx *ctx, u64 cgroup_id);
+
+	/**
+	 * @name: BPF OOM policy name
+	 */
+	char name[BPF_OOM_NAME_MAX_LEN];
+};
+
+#ifdef CONFIG_BPF_SYSCALL
+/**
+ * @bpf_handle_oom: handle out of memory condition using bpf
+ * @oc: OOM control structure
+ *
+ * Returns true if some memory was freed.
+ */
+bool bpf_handle_oom(struct oom_control *oc);
+
+
+/**
+ * @bpf_oom_memcg_offline: handle memcg offlining
+ * @memcg: Memory cgroup is offlined
+ *
+ * When a memory cgroup is about to be deleted and there is an
+ * attached BPF OOM structure, it has to be detached.
+ */
+void bpf_oom_memcg_offline(struct mem_cgroup *memcg);
+
+#else /* CONFIG_BPF_SYSCALL */
+static inline bool bpf_handle_oom(struct oom_control *oc)
+{
+	return false;
+}
+
+static inline void bpf_oom_memcg_offline(struct mem_cgroup *memcg) {}
+
+#endif /* CONFIG_BPF_SYSCALL */
+
+#endif /* __BPF_OOM_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 50d851ff3f27..39a6c7c8735b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,6 +29,7 @@ struct obj_cgroup;
 struct page;
 struct mm_struct;
 struct kmem_cache;
+struct bpf_oom_ops;
 
 /* Cgroup-specific page state, on top of universal node page state */
 enum memcg_stat_item {
@@ -226,6 +227,10 @@ struct mem_cgroup {
 	 */
 	bool oom_group;
 
+#ifdef CONFIG_BPF_SYSCALL
+	struct bpf_oom_ops *bpf_oom;
+#endif
+
 	int swappiness;
 
 	/* memory.events and memory.events.local */
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7b02bc1d0a7e..721087952d04 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -51,6 +51,14 @@ struct oom_control {
 
 	/* Used to print the constraint info. */
 	enum oom_constraint constraint;
+
+#ifdef CONFIG_BPF_SYSCALL
+	/* Used by the bpf oom implementation to mark the forward progress */
+	bool bpf_memory_freed;
+
+	/* Policy name */
+	const char *bpf_policy_name;
+#endif
 };
 
 extern struct mutex oom_lock;
diff --git a/mm/Makefile b/mm/Makefile
index 21abb3353550..051e88c699af 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
+ifdef CONFIG_BPF_SYSCALL
+obj-y += bpf_oom.o
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
new file mode 100644
index 000000000000..c4d09ed9d541
--- /dev/null
+++ b/mm/bpf_oom.c
@@ -0,0 +1,272 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * BPF-driven OOM killer customization
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/bpf.h>
+#include <linux/oom.h>
+#include <linux/bpf_oom.h>
+#include <linux/srcu.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+
+DEFINE_STATIC_SRCU(bpf_oom_srcu);
+static struct bpf_oom_ops *system_bpf_oom;
+
+#ifdef CONFIG_MEMCG
+static u64 memcg_cgroup_id(struct mem_cgroup *memcg)
+{
+	return cgroup_id(memcg->css.cgroup);
+}
+
+static struct bpf_oom_ops **bpf_oom_memcg_ops_ptr(struct mem_cgroup *memcg)
+{
+	return &memcg->bpf_oom;
+}
+#else /* CONFIG_MEMCG */
+static u64 memcg_cgroup_id(struct mem_cgroup *memcg)
+{
+	return 0;
+}
+static struct bpf_oom_ops **bpf_oom_memcg_ops_ptr(struct mem_cgroup *memcg)
+{
+	return NULL;
+}
+#endif
+
+static int bpf_ops_handle_oom(struct bpf_oom_ops *bpf_oom_ops,
+			      struct mem_cgroup *memcg,
+			      struct oom_control *oc)
+{
+	struct bpf_oom_ctx exec_ctx;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_MEMCG) && memcg)
+		exec_ctx.cgroup_id = memcg_cgroup_id(memcg);
+	else
+		exec_ctx.cgroup_id = 0;
+
+	oc->bpf_policy_name = &bpf_oom_ops->name[0];
+	oc->bpf_memory_freed = false;
+	ret = bpf_oom_ops->handle_out_of_memory(&exec_ctx, oc);
+	oc->bpf_policy_name = NULL;
+
+	return ret;
+}
+
+bool bpf_handle_oom(struct oom_control *oc)
+{
+	struct bpf_oom_ops *bpf_oom_ops = NULL;
+	struct mem_cgroup __maybe_unused *memcg;
+	int idx, ret = 0;
+
+	/* All bpf_oom_ops structures are protected using bpf_oom_srcu */
+	idx = srcu_read_lock(&bpf_oom_srcu);
+
+#ifdef CONFIG_MEMCG
+	/* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
+	for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
+		bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
+		if (!bpf_oom_ops)
+			continue;
+
+		/* Call BPF OOM handler */
+		ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
+		if (ret && oc->bpf_memory_freed)
+			goto exit;
+	}
+#endif /* CONFIG_MEMCG */
+
+	/*
+	 * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
+	 * Try system_bpf_oom.
+	 */
+	bpf_oom_ops = READ_ONCE(system_bpf_oom);
+	if (!bpf_oom_ops)
+		goto exit;
+
+	/* Call BPF OOM handler */
+	ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
+exit:
+	srcu_read_unlock(&bpf_oom_srcu, idx);
+	return ret && oc->bpf_memory_freed;
+}
+
+static int __handle_out_of_memory(struct bpf_oom_ctx *exec_ctx,
+				  struct oom_control *oc)
+{
+	return 0;
+}
+
+static void __handle_cgroup_offline(struct bpf_oom_ctx *exec_ctx, u64 cgroup_id)
+{
+}
+
+static struct bpf_oom_ops __bpf_oom_ops = {
+	.handle_out_of_memory = __handle_out_of_memory,
+	.handle_cgroup_offline = __handle_cgroup_offline,
+};
+
+static const struct bpf_func_proto *
+bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+	return tracing_prog_func_proto(func_id, prog);
+}
+
+static bool bpf_oom_ops_is_valid_access(int off, int size,
+					enum bpf_access_type type,
+					const struct bpf_prog *prog,
+					struct bpf_insn_access_aux *info)
+{
+	return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
+	.get_func_proto = bpf_oom_func_proto,
+	.is_valid_access = bpf_oom_ops_is_valid_access,
+};
+
+static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
+	struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
+	struct bpf_oom_ops *bpf_oom_ops = kdata;
+	struct mem_cgroup *memcg = NULL;
+	int err = 0;
+
+	if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
+		/* Attach to a memory cgroup? */
+		memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
+		if (IS_ERR_OR_NULL(memcg))
+			return PTR_ERR(memcg);
+		bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
+	} else {
+		/* System-wide OOM handler */
+		bpf_oom_ops_ptr = &system_bpf_oom;
+	}
+
+	/* Another struct ops attached? */
+	if (READ_ONCE(*bpf_oom_ops_ptr)) {
+		err = -EBUSY;
+		goto exit;
+	}
+
+	/* Expose bpf_oom_ops structure */
+	WRITE_ONCE(*bpf_oom_ops_ptr, bpf_oom_ops);
+exit:
+	mem_cgroup_put(memcg);
+	return err;
+}
+
+static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
+{
+	struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
+	struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
+	struct bpf_oom_ops *bpf_oom_ops = kdata;
+	struct mem_cgroup *memcg = NULL;
+
+	if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
+		/* Detach from a memory cgroup? */
+		memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
+		if (IS_ERR_OR_NULL(memcg))
+			goto exit;
+		bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
+	} else {
+		/* System-wide OOM handler */
+		bpf_oom_ops_ptr = &system_bpf_oom;
+	}
+
+	/* Hide bpf_oom_ops from new callers */
+	if (!WARN_ON(READ_ONCE(*bpf_oom_ops_ptr) != bpf_oom_ops))
+		WRITE_ONCE(*bpf_oom_ops_ptr, NULL);
+
+	mem_cgroup_put(memcg);
+
+exit:
+	/* Release bpf_oom_ops after a srcu grace period */
+	synchronize_srcu(&bpf_oom_srcu);
+}
+
+#ifdef CONFIG_MEMCG
+void bpf_oom_memcg_offline(struct mem_cgroup *memcg)
+{
+	struct bpf_oom_ops *bpf_oom_ops;
+	struct bpf_oom_ctx exec_ctx;
+	u64 cgrp_id;
+	int idx;
+
+	/* All bpf_oom_ops structures are protected using bpf_oom_srcu */
+	idx = srcu_read_lock(&bpf_oom_srcu);
+
+	bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
+	WRITE_ONCE(memcg->bpf_oom, NULL);
+
+	if (bpf_oom_ops && bpf_oom_ops->handle_cgroup_offline) {
+		cgrp_id = cgroup_id(memcg->css.cgroup);
+		exec_ctx.cgroup_id = cgrp_id;
+		bpf_oom_ops->handle_cgroup_offline(&exec_ctx, cgrp_id);
+	}
+
+	srcu_read_unlock(&bpf_oom_srcu, idx);
+}
+#endif /* CONFIG_MEMCG */
+
+static int bpf_oom_ops_check_member(const struct btf_type *t,
+				    const struct btf_member *member,
+				    const struct bpf_prog *prog)
+{
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, handle_out_of_memory):
+		if (!prog)
+			return -EINVAL;
+		break;
+	}
+
+	return 0;
+}
+
+static int bpf_oom_ops_init_member(const struct btf_type *t,
+				   const struct btf_member *member,
+				   void *kdata, const void *udata)
+{
+	const struct bpf_oom_ops *uops = udata;
+	struct bpf_oom_ops *ops = kdata;
+	u32 moff = __btf_member_bit_offset(t, member) / 8;
+
+	switch (moff) {
+	case offsetof(struct bpf_oom_ops, name):
+		if (uops->name[0])
+			strscpy_pad(ops->name, uops->name, sizeof(ops->name));
+		else
+			strscpy_pad(ops->name, "bpf_defined_policy");
+		return 1;
+	}
+	return 0;
+}
+
+static int bpf_oom_ops_init(struct btf *btf)
+{
+	return 0;
+}
+
+static struct bpf_struct_ops bpf_oom_bpf_ops = {
+	.verifier_ops = &bpf_oom_verifier_ops,
+	.reg = bpf_oom_ops_reg,
+	.unreg = bpf_oom_ops_unreg,
+	.check_member = bpf_oom_ops_check_member,
+	.init_member = bpf_oom_ops_init_member,
+	.init = bpf_oom_ops_init,
+	.name = "bpf_oom_ops",
+	.owner = THIS_MODULE,
+	.cfi_stubs = &__bpf_oom_ops
+};
+
+static int __init bpf_oom_struct_ops_init(void)
+{
+	return register_bpf_struct_ops(&bpf_oom_bpf_ops, bpf_oom_ops);
+}
+late_initcall(bpf_oom_struct_ops_init);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5d27cd5372aa..d44c1f293e16 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/seq_buf.h>
 #include <linux/sched/isolation.h>
 #include <linux/kmemleak.h>
+#include <linux/bpf_oom.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -3885,6 +3886,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 
 	zswap_memcg_offline_cleanup(memcg);
 
+	bpf_oom_memcg_offline(memcg);
 	memcg_offline_kmem(memcg);
 	reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index c145b0feecc1..d05ec0f84087 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -45,6 +45,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/cred.h>
 #include <linux/nmi.h>
+#include <linux/bpf_oom.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -246,6 +247,15 @@ static const char * const oom_constraint_text[] = {
 	[CONSTRAINT_MEMCG] = "CONSTRAINT_MEMCG",
 };
 
+static const char *oom_policy_name(struct oom_control *oc)
+{
+#ifdef CONFIG_BPF_SYSCALL
+	if (oc->bpf_policy_name)
+		return oc->bpf_policy_name;
+#endif
+	return "default";
+}
+
 /*
  * Determine the type of allocation constraint.
  */
@@ -458,9 +468,10 @@ static void dump_oom_victim(struct oom_control *oc, struct task_struct *victim)
 
 static void dump_header(struct oom_control *oc)
 {
-	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\n",
+	pr_warn("%s invoked oom-killer: gfp_mask=%#x(%pGg), order=%d, oom_score_adj=%hd\noom_policy=%s\n",
 		current->comm, oc->gfp_mask, &oc->gfp_mask, oc->order,
-			current->signal->oom_score_adj);
+		current->signal->oom_score_adj,
+		oom_policy_name(oc));
 	if (!IS_ENABLED(CONFIG_COMPACTION) && oc->order)
 		pr_warn("COMPACTION is disabled!!!\n");
 
@@ -1167,6 +1178,13 @@ bool out_of_memory(struct oom_control *oc)
 		return true;
 	}
 
+	/*
+	 * Let bpf handle the OOM first. If it was able to free up some memory,
+	 * bail out. Otherwise fall back to the kernel OOM killer.
+	 */
+	if (bpf_handle_oom(oc))
+		return true;
+
 	select_bad_process(oc);
 	/* Found nothing?!?! */
 	if (!oc->chosen) {
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (5 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-31  9:05   ` Michal Hocko
  2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
                   ` (3 subsequent siblings)
  10 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
to be used by BPF OOM programs. It allows to kill a process
in exactly the same way the OOM killer does: using the OOM reaper,
bumping corresponding memcg and global statistics, respecting
memory.oom.group etc.

On success, it sets om_control's bpf_memory_freed field to true,
enabling the bpf program to bypass the kernel OOM killer.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/oom_kill.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d05ec0f84087..3c86cd755371 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1288,3 +1288,70 @@ SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
 	return -ENOSYS;
 #endif /* CONFIG_MMU */
 }
+
+#ifdef CONFIG_BPF_SYSCALL
+
+__bpf_kfunc_start_defs();
+/**
+ * bpf_oom_kill_process - Kill a process as OOM killer
+ * @oc: pointer to oom_control structure, describes OOM context
+ * @task: task to be killed
+ * @message__str: message to print in dmesg
+ *
+ * Kill a process in a way similar to the kernel OOM killer.
+ * This means dump the necessary information to dmesg, adjust memcg
+ * statistics, leverage the oom reaper, respect memory.oom.group etc.
+ *
+ * bpf_oom_kill_process() marks the forward progress by setting
+ * oc->bpf_memory_freed. If the progress was made, the bpf program
+ * is free to decide if the kernel oom killer should be invoked.
+ * Otherwise it's enforced, so that a bad bpf program can't
+ * deadlock the machine on memory.
+ */
+__bpf_kfunc int bpf_oom_kill_process(struct oom_control *oc,
+				     struct task_struct *task,
+				     const char *message__str)
+{
+	if (oom_unkillable_task(task))
+		return -EPERM;
+
+	/* paired with put_task_struct() in oom_kill_process() */
+	task = tryget_task_struct(task);
+	if (!task)
+		return -EINVAL;
+
+	oc->chosen = task;
+
+	oom_kill_process(oc, message__str);
+
+	oc->chosen = NULL;
+	oc->bpf_memory_freed = true;
+
+	return 0;
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_oom_kfuncs)
+BTF_ID_FLAGS(func, bpf_oom_kill_process, KF_SLEEPABLE | KF_TRUSTED_ARGS)
+BTF_KFUNCS_END(bpf_oom_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_oom_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_oom_kfuncs,
+};
+
+static int __init bpf_oom_init(void)
+{
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_oom_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf oom kfuncs: %d", err);
+
+	return err;
+}
+late_initcall(bpf_oom_init);
+
+#endif
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (6 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 17:42   ` Tejun Heo
  2025-10-27 23:17 ` [PATCH v2 09/23] mm: introduce bpf_get_root_mem_cgroup() BPF kfunc Roman Gushchin
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

To effectively operate with memory cgroups in BPF there is a need
to convert css pointers to memcg pointers. A simple container_of
cast which is used in the kernel code can't be used in BPF because
from the verifier's point of view that's a out-of-bounds memory access.

Introduce helper get/put kfuncs which can be used to get
a refcounted memcg pointer from the css pointer:
  - bpf_get_mem_cgroup,
  - bpf_put_mem_cgroup.

bpf_get_mem_cgroup() can take both memcg's css and the corresponding
cgroup's "self" css. It allows it to be used with the existing cgroup
iterator which iterates over cgroup tree, not memcg tree.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/Makefile         |  1 +
 mm/bpf_memcontrol.c | 88 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)
 create mode 100644 mm/bpf_memcontrol.c

diff --git a/mm/Makefile b/mm/Makefile
index 051e88c699af..2d8f9beb3c71 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
 ifdef CONFIG_BPF_SYSCALL
 obj-y += bpf_oom.o
+obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
 endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
new file mode 100644
index 000000000000..1e46097745cf
--- /dev/null
+++ b/mm/bpf_memcontrol.c
@@ -0,0 +1,88 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Memory Controller-related BPF kfuncs and auxiliary code
+ *
+ * Author: Roman Gushchin <roman.gushchin@linux.dev>
+ */
+
+#include <linux/memcontrol.h>
+#include <linux/bpf.h>
+
+__bpf_kfunc_start_defs();
+
+/**
+ * bpf_get_mem_cgroup - Get a reference to a memory cgroup
+ * @css: pointer to the css structure
+ *
+ * Returns a pointer to a mem_cgroup structure after bumping
+ * the corresponding css's reference counter.
+ *
+ * It's fine to pass a css which belongs to any cgroup controller,
+ * e.g. unified hierarchy's main css.
+ *
+ * Implements KF_ACQUIRE semantics.
+ */
+__bpf_kfunc struct mem_cgroup *
+bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
+{
+	struct mem_cgroup *memcg = NULL;
+	bool rcu_unlock = false;
+
+	if (!root_mem_cgroup)
+		return NULL;
+
+	if (root_mem_cgroup->css.ss != css->ss) {
+		struct cgroup *cgroup = css->cgroup;
+		int ssid = root_mem_cgroup->css.ss->id;
+
+		rcu_read_lock();
+		rcu_unlock = true;
+		css = rcu_dereference_raw(cgroup->subsys[ssid]);
+	}
+
+	if (css && css_tryget(css))
+		memcg = container_of(css, struct mem_cgroup, css);
+
+	if (rcu_unlock)
+		rcu_read_unlock();
+
+	return memcg;
+}
+
+/**
+ * bpf_put_mem_cgroup - Put a reference to a memory cgroup
+ * @memcg: memory cgroup to release
+ *
+ * Releases a previously acquired memcg reference.
+ * Implements KF_RELEASE semantics.
+ */
+__bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
+{
+	css_put(&memcg->css);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
+BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
+
+BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
+
+static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
+	.owner          = THIS_MODULE,
+	.set            = &bpf_memcontrol_kfuncs,
+};
+
+static int __init bpf_memcontrol_init(void)
+{
+	int err;
+
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+					&bpf_memcontrol_kfunc_set);
+	if (err)
+		pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
+
+	return err;
+}
+late_initcall(bpf_memcontrol_init);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 09/23] mm: introduce bpf_get_root_mem_cgroup() BPF kfunc
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (7 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
  2025-10-31  9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
  10 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Introduce a BPF kfunc to get a trusted pointer to the root memory
cgroup. It's very handy to traverse the full memcg tree, e.g.
for handling a system-wide OOM.

It's possible to obtain this pointer by traversing the memcg tree
up from any known memcg, but it's sub-optimal and makes BPF programs
more complex and less efficient.

bpf_get_root_mem_cgroup() has a KF_ACQUIRE | KF_RET_NULL semantics,
however in reality it's not necessarily to bump the corresponding
reference counter - root memory cgroup is immortal, reference counting
is skipped, see css_get(). Once set, root_mem_cgroup is always a valid
memcg pointer. It's safe to call bpf_put_mem_cgroup() for the pointer
obtained with bpf_get_root_mem_cgroup(), it's effectively a no-op.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/bpf_memcontrol.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 1e46097745cf..76c342318256 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -10,6 +10,20 @@
 
 __bpf_kfunc_start_defs();
 
+/**
+ * bpf_get_root_mem_cgroup - Returns a pointer to the root memory cgroup
+ *
+ * The function has KF_ACQUIRE semantics, even though the root memory
+ * cgroup is never destroyed after being created and doesn't require
+ * reference counting. And it's perfectly safe to pass it to
+ * bpf_put_mem_cgroup()
+ */
+__bpf_kfunc struct mem_cgroup *bpf_get_root_mem_cgroup(void)
+{
+	/* css_get() is not needed */
+	return root_mem_cgroup;
+}
+
 /**
  * bpf_get_mem_cgroup - Get a reference to a memory cgroup
  * @css: pointer to the css structure
@@ -64,6 +78,7 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
+BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (8 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 09/23] mm: introduce bpf_get_root_mem_cgroup() BPF kfunc Roman Gushchin
@ 2025-10-27 23:17 ` Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-10-31  9:08   ` Michal Hocko
  2025-10-31  9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
  10 siblings, 2 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-27 23:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Roman Gushchin

Introduce BPF kfuncs to conveniently access memcg data:
  - bpf_mem_cgroup_vm_events(),
  - bpf_mem_cgroup_usage(),
  - bpf_mem_cgroup_page_state(),
  - bpf_mem_cgroup_flush_stats().

These functions are useful for implementing BPF OOM policies, but
also can be used to accelerate access to the memcg data. Reading
it through cgroupfs is much more expensive, roughly 5x, mostly
because of the need to convert the data into the text and back.

JP Kobryn:
An experiment was setup to compare the performance of a program that
uses the traditional method of reading memory.stat vs a program using
the new kfuncs. The control program opens up the root memory.stat file
and for 1M iterations reads, converts the string values to numeric data,
then seeks back to the beginning. The experimental program sets up the
requisite libbpf objects and for 1M iterations invokes a bpf program
which uses the kfuncs to fetch all available stats for node_stat_item,
memcg_stat_item, and vm_event_item types.

The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 93%. In kernel mode,
elapsed time was reduced by 80%, while in user mode, over 99% of time
was saved.

control: elapsed time
real    0m38.318s
user    0m25.131s
sys     0m13.070s

experiment: elapsed time
real    0m2.789s
user    0m0.187s
sys     0m2.512s

control: perf data
33.43% a.out libc.so.6         [.] __vfscanf_internal
 6.88% a.out [kernel.kallsyms] [k] vsnprintf
 6.33% a.out libc.so.6         [.] _IO_fgets
 5.51% a.out [kernel.kallsyms] [k] format_decode
 4.31% a.out libc.so.6         [.] __GI_____strtoull_l_internal
 3.78% a.out [kernel.kallsyms] [k] string
 3.53% a.out [kernel.kallsyms] [k] number
 2.71% a.out libc.so.6         [.] _IO_sputbackc
 2.41% a.out [kernel.kallsyms] [k] strlen
 1.98% a.out a.out             [.] main
 1.70% a.out libc.so.6         [.] _IO_getline_info
 1.51% a.out libc.so.6         [.] __isoc99_sscanf
 1.47% a.out [kernel.kallsyms] [k] memory_stat_format
 1.47% a.out [kernel.kallsyms] [k] memcpy_orig
 1.41% a.out [kernel.kallsyms] [k] seq_buf_printf

experiment: perf data
10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query
 6.90% memcgstat [kernel.kallsyms]  [k] memcg_page_state_output
 3.55% memcgstat [kernel.kallsyms]  [k] _raw_spin_lock
 3.12% memcgstat [kernel.kallsyms]  [k] memcg_events
 2.87% memcgstat [kernel.kallsyms]  [k] __memcg_slab_post_alloc_hook
 2.73% memcgstat [kernel.kallsyms]  [k] kmem_cache_free
 2.70% memcgstat [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
 2.25% memcgstat [kernel.kallsyms]  [k] __memcg_slab_free_hook
 2.06% memcgstat [kernel.kallsyms]  [k] get_page_from_freelist

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Co-developed-by: JP Kobryn <inwardvessel@gmail.com>
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
---
 include/linux/memcontrol.h |  2 ++
 mm/bpf_memcontrol.c        | 57 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 39a6c7c8735b..b9e08dddd7ad 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -953,6 +953,8 @@ static inline void mod_memcg_page_state(struct page *page,
 	rcu_read_unlock();
 }
 
+unsigned long memcg_events(struct mem_cgroup *memcg, int event);
+unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
 unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
 unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
 unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
index 76c342318256..387255b8ab88 100644
--- a/mm/bpf_memcontrol.c
+++ b/mm/bpf_memcontrol.c
@@ -75,6 +75,56 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
 	css_put(&memcg->css);
 }
 
+/**
+ * bpf_mem_cgroup_vm_events - Read memory cgroup's vm event counter
+ * @memcg: memory cgroup
+ * @event: event id
+ *
+ * Allows to read memory cgroup event counters.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_vm_events(struct mem_cgroup *memcg,
+						enum vm_event_item event)
+{
+	return memcg_events(memcg, event);
+}
+
+/**
+ * bpf_mem_cgroup_usage - Read memory cgroup's usage
+ * @memcg: memory cgroup
+ *
+ * Returns current memory cgroup size in bytes.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
+{
+	return page_counter_read(&memcg->memory);
+}
+
+/**
+ * bpf_mem_cgroup_page_state - Read memory cgroup's page state counter
+ * @memcg: memory cgroup
+ * @idx: counter idx
+ *
+ * Allows to read memory cgroup statistics. The output is in bytes.
+ */
+__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
+{
+	if (idx < 0 || idx >= MEMCG_NR_STAT)
+		return (unsigned long)-1;
+
+	return memcg_page_state_output(memcg, idx);
+}
+
+/**
+ * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics
+ * @memcg: memory cgroup
+ *
+ * Propagate memory cgroup's statistics up the cgroup tree.
+ */
+__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
+{
+	mem_cgroup_flush_stats(memcg);
+}
+
 __bpf_kfunc_end_defs();
 
 BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
@@ -82,6 +132,11 @@ BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
 BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
 BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
 
+BTF_ID_FLAGS(func, bpf_mem_cgroup_vm_events, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS)
+BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS | KF_SLEEPABLE)
+
 BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
 
 static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
@@ -93,7 +148,7 @@ static int __init bpf_memcontrol_init(void)
 {
 	int err;
 
-	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
+	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
 					&bpf_memcontrol_kfunc_set);
 	if (err)
 		pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events
  2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 16:16     ` Roman Gushchin
  2025-10-31  9:08   ` Michal Hocko
  1 sibling, 1 reply; 83+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2734 bytes --]


```
commit 96abfe0549713749f00e32bff919a759db284a58
Author: Roman Gushchin <roman.gushchin@linux.dev>

mm: introduce BPF kfuncs to access memcg statistics and events

This patch adds four new BPF kfuncs for accessing memory cgroup statistics
and events, making them available to all BPF program types for implementing
OOM policies and accelerating memcg data access.

> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 39a6c7c87..b9e08dddd 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h

[ ... ]

> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 76c342318..387255b8a 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c

[ ... ]

> @@ -75,6 +75,56 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>  	css_put(&memcg->css);
>  }
>
> +/**
> + * bpf_mem_cgroup_vm_events - Read memory cgroup's vm event counter
> + * @memcg: memory cgroup
> + * @event: event id
> + *
> + * Allows to read memory cgroup event counters.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_vm_events(struct mem_cgroup *memcg,
> +						enum vm_event_item event)
> +{
> +	return memcg_events(memcg, event);
> +}
> +
> +/**
> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
> + * @memcg: memory cgroup
> + *
> + * Returns current memory cgroup size in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
> +{
> +	return page_counter_read(&memcg->memory);
> +}
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The documentation says this returns bytes, but page_counter_read() returns
pages. In mm/memcontrol-v1.c, the existing mem_cgroup_usage() function's
return value is multiplied by PAGE_SIZE to convert to bytes:

    return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;

Should this function multiply the result by PAGE_SIZE to match its
documentation, or should the documentation be updated to say it returns
pages?

> +
> +/**
> + * bpf_mem_cgroup_page_state - Read memory cgroup's page state counter
> + * @memcg: memory cgroup
> + * @idx: counter idx
> + *
> + * Allows to read memory cgroup statistics. The output is in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
> +{
> +	if (idx < 0 || idx >= MEMCG_NR_STAT)
> +		return (unsigned long)-1;
> +
> +	return memcg_page_state_output(memcg, idx);
> +}

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `mm: introduce BPF kfuncs to access memcg statistics and events`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 15:57     ` Roman Gushchin
  2025-10-29 18:01   ` Song Liu
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 83+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 2583 bytes --]


```
commit 31d238f36a0a6a56efa48c847cb776e67ba2f079
Author: Roman Gushchin <roman.gushchin@linux.dev>

bpf: initial support for attaching struct ops to cgroups

When a struct ops is being attached and a bpf link is created, allow to
pass a cgroup fd using bpf attr, so that struct ops can be attached to a
cgroup instead of globally. Attached struct ops doesn't hold a reference
to the cgroup, only preserves cgroup id.

> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index eae907218..7205b813e 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
>  	struct bpf_link link;
>  	struct bpf_map __rcu *map;
>  	wait_queue_head_t wait_hup;
> +	u64 cgroup_id;
>  };
>
>  struct bpf_link_primer {
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 45cc5ee19..58664779a 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -13,6 +13,7 @@
>  #include <linux/btf_ids.h>
>  #include <linux/rcupdate_wait.h>
>  #include <linux/poll.h>
> +#include <linux/cgroup.h>
>
>  struct bpf_struct_ops_value {
>  	struct bpf_struct_ops_common_value common;

[ ... ]

> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>  	}
>  	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>  		      attr->link_create.attach_type);
> +#ifdef CONFIG_CGROUPS
> +	if (attr->link_create.cgroup.relative_fd) {
> +		struct cgroup *cgrp;
> +
> +		cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
> +		if (IS_ERR(cgrp))
> +			return PTR_ERR(cgrp);
                        ^^^^^^^^^^^^^^^^^^^^^^

Can this leak the link and map? In bpf_struct_ops_link_create(), the link
is allocated with kzalloc(GFP_USER) and map is obtained with bpf_map_get()
before this point. If cgroup_get_from_fd() fails, the early return skips
the err_out label which does kfree(link) and bpf_map_put(map).

All other error paths in this function use "err = <error>; goto err_out;"
but this one returns directly.

> +
> +		link->cgroup_id = cgroup_id(cgrp);
> +		cgroup_put(cgrp);
> +	}
> +#endif /* CONFIG_CGROUPS */
> +
>  	err = bpf_link_prime(&link->link, &link_primer);
>  	if (err)
>  		goto err_out;


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `bpf: initial support for attaching struct ops to cgroups`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
@ 2025-10-27 23:48   ` bot+bpf-ci
  2025-10-28 16:10     ` Roman Gushchin
  2025-10-28 17:42   ` Tejun Heo
  1 sibling, 1 reply; 83+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:48 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3089 bytes --]


```
commit ec8e2e1769704b4f69261f36e6aaf035d47c5f0f
Author: Roman Gushchin <roman.gushchin@linux.dev>

mm: introduce BPF kfuncs to deal with memcg pointers

This commit introduces two new BPF kfuncs for converting css pointers
to memcg pointers with proper reference counting: bpf_get_mem_cgroup()
and bpf_put_mem_cgroup().

> diff --git a/mm/Makefile b/mm/Makefile
> index 051e88c69..2d8f9beb3 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
>  ifdef CONFIG_BPF_SYSCALL
>  obj-y += bpf_oom.o
> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
>  endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> new file mode 100644
> index 000000000..1e4609774
> --- /dev/null
> +++ b/mm/bpf_memcontrol.c
> @@ -0,0 +1,88 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Memory Controller-related BPF kfuncs and auxiliary code
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/memcontrol.h>
> +#include <linux/bpf.h>
> +
> +__bpf_kfunc_start_defs();
> +
> +/**
> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> + * @css: pointer to the css structure
> + *
> + * Returns a pointer to a mem_cgroup structure after bumping
> + * the corresponding css's reference counter.
> + *
> + * It's fine to pass a css which belongs to any cgroup controller,
> + * e.g. unified hierarchy's main css.
> + *
> + * Implements KF_ACQUIRE semantics.
> + */
> +__bpf_kfunc struct mem_cgroup *
> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	bool rcu_unlock = false;
> +
> +	if (!root_mem_cgroup)
> +		return NULL;
> +
> +	if (root_mem_cgroup->css.ss != css->ss) {
                                            ^^^^^^^

Can this dereference a NULL css pointer?  The function checks css for
NULL at line 42 with "if (css && css_tryget(css))", which suggests css
can be NULL.  If a BPF program calls this function with a NULL css,
wouldn't it crash here before reaching that check?

> +		struct cgroup *cgroup = css->cgroup;
> +		int ssid = root_mem_cgroup->css.ss->id;
> +
> +		rcu_read_lock();
> +		rcu_unlock = true;
> +		css = rcu_dereference_raw(cgroup->subsys[ssid]);
> +	}
> +
> +	if (css && css_tryget(css))
> +		memcg = container_of(css, struct mem_cgroup, css);
> +
> +	if (rcu_unlock)
> +		rcu_read_unlock();
> +
> +	return memcg;
> +}

[ ... ]

> +__bpf_kfunc_end_defs();
> +
> +BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> +BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
> +BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
> +
> +BTF_KFUNCS_END(bpf_memcontrol_kfuncs)

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `mm: introduce BPF kfuncs to deal with memcg pointers`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
@ 2025-10-27 23:57   ` bot+bpf-ci
  2025-10-28 17:45   ` Alexei Starovoitov
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 83+ messages in thread
From: bot+bpf-ci @ 2025-10-27 23:57 UTC (permalink / raw)
  To: roman.gushchin, akpm
  Cc: linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes, andrii,
	inwardvessel, linux-mm, cgroups, bpf, martin.lau, song, memxor,
	tj, roman.gushchin, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 3214 bytes --]


```
commit 16c40fd36bbe32c63f44388b5613db55c6b27557
Author: Roman Gushchin <roman.gushchin@linux.dev>

mm: introduce BPF struct ops for OOM handling

This patch introduces a BPF struct ops mechanism for custom OOM handling
policies. It allows attaching bpf_oom_ops system-wide or per-memcg, with
handlers executed before the kernel OOM killer.

> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> new file mode 100644
> index 000000000..c4d09ed9d
> --- /dev/null
> +++ b/mm/bpf_oom.c

[ ... ]

> @@ -0,0 +1,272 @@

[ ... ]

> +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
> +	struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> +	struct bpf_oom_ops *bpf_oom_ops = kdata;
> +	struct mem_cgroup *memcg = NULL;
> +
> +	if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> +		/* Detach from a memory cgroup? */
> +		memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> +		if (IS_ERR_OR_NULL(memcg))
> +			goto exit;
> +		bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> +	} else {
> +		/* System-wide OOM handler */
> +		bpf_oom_ops_ptr = &system_bpf_oom;
> +	}
> +
> +	/* Hide bpf_oom_ops from new callers */
> +	if (!WARN_ON(READ_ONCE(*bpf_oom_ops_ptr) != bpf_oom_ops))
> +		WRITE_ONCE(*bpf_oom_ops_ptr, NULL);
> +
> +	mem_cgroup_put(memcg);
> +
> +exit:
> +	/* Release bpf_oom_ops after a srcu grace period */
> +	synchronize_srcu(&bpf_oom_srcu);
> +}
> +
> +#ifdef CONFIG_MEMCG
> +void bpf_oom_memcg_offline(struct mem_cgroup *memcg)
> +{
> +	struct bpf_oom_ops *bpf_oom_ops;
> +	struct bpf_oom_ctx exec_ctx;
> +	u64 cgrp_id;
> +	int idx;
> +
> +	/* All bpf_oom_ops structures are protected using bpf_oom_srcu */
> +	idx = srcu_read_lock(&bpf_oom_srcu);
> +
> +	bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> +	WRITE_ONCE(memcg->bpf_oom, NULL);

Can bpf_oom_ops_unreg() race with bpf_oom_memcg_offline()? If userspace
closes the BPF struct ops link FD while or after the cgroup is being
deleted, both functions could run concurrently on the same memcg:

  Thread 1 (unreg):                Thread 2 (offline):
  mem_cgroup_get_from_ino()
  [succeeds, cgroup exists]
                                   srcu_read_lock()
                                   READ_ONCE(memcg->bpf_oom) [gets ops]
                                   WRITE_ONCE(memcg->bpf_oom, NULL)
  READ_ONCE(*bpf_oom_ops_ptr)
  [sees NULL]
  WARN_ON() triggers

The WARN_ON in bpf_oom_ops_unreg() assumes *bpf_oom_ops_ptr matches
bpf_oom_ops, but bpf_oom_memcg_offline() can clear it first during
concurrent execution.

> +
> +	if (bpf_oom_ops && bpf_oom_ops->handle_cgroup_offline) {
> +		cgrp_id = cgroup_id(memcg->css.cgroup);
> +		exec_ctx.cgroup_id = cgrp_id;
> +		bpf_oom_ops->handle_cgroup_offline(&exec_ctx, cgrp_id);
> +	}
> +
> +	srcu_read_unlock(&bpf_oom_srcu, idx);
> +}
> +#endif /* CONFIG_MEMCG */

[ ... ]


```

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

In-Reply-To-Subject: `mm: introduce BPF struct ops for OOM handling`
CI run summary: https://github.com/kernel-patches/bpf/actions/runs/18859027430

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 15:57     ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 15:57 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit 31d238f36a0a6a56efa48c847cb776e67ba2f079
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> bpf: initial support for attaching struct ops to cgroups
>
> When a struct ops is being attached and a bpf link is created, allow to
> pass a cgroup fd using bpf attr, so that struct ops can be attached to a
> cgroup instead of globally. Attached struct ops doesn't hold a reference
> to the cgroup, only preserves cgroup id.
>
>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
>> index eae907218..7205b813e 100644
>> --- a/include/linux/bpf.h
>> +++ b/include/linux/bpf.h
>> @@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
>>  	struct bpf_link link;
>>  	struct bpf_map __rcu *map;
>>  	wait_queue_head_t wait_hup;
>> +	u64 cgroup_id;
>>  };
>>
>>  struct bpf_link_primer {
>> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
>> index 45cc5ee19..58664779a 100644
>> --- a/kernel/bpf/bpf_struct_ops.c
>> +++ b/kernel/bpf/bpf_struct_ops.c
>> @@ -13,6 +13,7 @@
>>  #include <linux/btf_ids.h>
>>  #include <linux/rcupdate_wait.h>
>>  #include <linux/poll.h>
>> +#include <linux/cgroup.h>
>>
>>  struct bpf_struct_ops_value {
>>  	struct bpf_struct_ops_common_value common;
>
> [ ... ]
>
>> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>  	}
>>  	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>>  		      attr->link_create.attach_type);
>> +#ifdef CONFIG_CGROUPS
>> +	if (attr->link_create.cgroup.relative_fd) {
>> +		struct cgroup *cgrp;
>> +
>> +		cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
>> +		if (IS_ERR(cgrp))
>> +			return PTR_ERR(cgrp);
>                         ^^^^^^^^^^^^^^^^^^^^^^
>
> Can this leak the link and map? In bpf_struct_ops_link_create(), the link
> is allocated with kzalloc(GFP_USER) and map is obtained with bpf_map_get()
> before this point. If cgroup_get_from_fd() fails, the early return skips
> the err_out label which does kfree(link) and bpf_map_put(map).
>
> All other error paths in this function use "err = <error>; goto err_out;"
> but this one returns directly.

Ok, this is indeed wrong, will fix in the next version.

Thanks!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 16:10     ` Roman Gushchin
  2025-10-28 17:12       ` Alexei Starovoitov
  0 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 16:10 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit ec8e2e1769704b4f69261f36e6aaf035d47c5f0f
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> mm: introduce BPF kfuncs to deal with memcg pointers
>
> This commit introduces two new BPF kfuncs for converting css pointers
> to memcg pointers with proper reference counting: bpf_get_mem_cgroup()
> and bpf_put_mem_cgroup().
>
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 051e88c69..2d8f9beb3 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
>>  endif
>>  ifdef CONFIG_BPF_SYSCALL
>>  obj-y += bpf_oom.o
>> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
>>  endif
>>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
>> new file mode 100644
>> index 000000000..1e4609774
>> --- /dev/null
>> +++ b/mm/bpf_memcontrol.c
>> @@ -0,0 +1,88 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * Memory Controller-related BPF kfuncs and auxiliary code
>> + *
>> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>> + */
>> +
>> +#include <linux/memcontrol.h>
>> +#include <linux/bpf.h>
>> +
>> +__bpf_kfunc_start_defs();
>> +
>> +/**
>> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
>> + * @css: pointer to the css structure
>> + *
>> + * Returns a pointer to a mem_cgroup structure after bumping
>> + * the corresponding css's reference counter.
>> + *
>> + * It's fine to pass a css which belongs to any cgroup controller,
>> + * e.g. unified hierarchy's main css.
>> + *
>> + * Implements KF_ACQUIRE semantics.
>> + */
>> +__bpf_kfunc struct mem_cgroup *
>> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
>> +{
>> +	struct mem_cgroup *memcg = NULL;
>> +	bool rcu_unlock = false;
>> +
>> +	if (!root_mem_cgroup)
>> +		return NULL;
>> +
>> +	if (root_mem_cgroup->css.ss != css->ss) {
>                                             ^^^^^^^
>
> Can this dereference a NULL css pointer?  The function checks css for
> NULL at line 42 with "if (css && css_tryget(css))", which suggests css
> can be NULL.  If a BPF program calls this function with a NULL css,
> wouldn't it crash here before reaching that check?

Here css passed as an argument to bpf_get_mem_cgroup() can't be NULL,
because the BPF verifier guarantees that it's a valid css pointer.

However the result of rcu_dereference_raw(cgroup->subsys[ssid]) can be
NULL, this is why the "if (css && css_tryget(css))" check is required
down below.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 16:16     ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 16:16 UTC (permalink / raw)
  To: bot+bpf-ci
  Cc: akpm, linux-kernel, ast, surenb, mhocko, shakeel.butt, hannes,
	andrii, inwardvessel, linux-mm, cgroups, bpf, martin.lau, song,
	memxor, tj, daniel, eddyz87, yonghong.song, clm, ihor.solodrai

bot+bpf-ci@kernel.org writes:

> ```
> commit 96abfe0549713749f00e32bff919a759db284a58
> Author: Roman Gushchin <roman.gushchin@linux.dev>
>
> mm: introduce BPF kfuncs to access memcg statistics and events
>
> This patch adds four new BPF kfuncs for accessing memory cgroup statistics
> and events, making them available to all BPF program types for implementing
> OOM policies and accelerating memcg data access.
>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 39a6c7c87..b9e08dddd 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>
> [ ... ]
>
>> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
>> index 76c342318..387255b8a 100644
>> --- a/mm/bpf_memcontrol.c
>> +++ b/mm/bpf_memcontrol.c
>
> [ ... ]
>
>> @@ -75,6 +75,56 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>>  	css_put(&memcg->css);
>>  }
>>
>> +/**
>> + * bpf_mem_cgroup_vm_events - Read memory cgroup's vm event counter
>> + * @memcg: memory cgroup
>> + * @event: event id
>> + *
>> + * Allows to read memory cgroup event counters.
>> + */
>> +__bpf_kfunc unsigned long bpf_mem_cgroup_vm_events(struct mem_cgroup *memcg,
>> +						enum vm_event_item event)
>> +{
>> +	return memcg_events(memcg, event);
>> +}
>> +
>> +/**
>> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
>> + * @memcg: memory cgroup
>> + *
>> + * Returns current memory cgroup size in bytes.
>> + */
>> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
>> +{
>> +	return page_counter_read(&memcg->memory);
>> +}
>      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> The documentation says this returns bytes, but page_counter_read() returns
> pages. In mm/memcontrol-v1.c, the existing mem_cgroup_usage() function's
> return value is multiplied by PAGE_SIZE to convert to bytes:
>
>     return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;
>
> Should this function multiply the result by PAGE_SIZE to match its
> documentation, or should the documentation be updated to say it returns
> pages?

Yep, correct, fixed.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-28 16:10     ` Roman Gushchin
@ 2025-10-28 17:12       ` Alexei Starovoitov
  2025-10-28 18:03         ` Chris Mason
  0 siblings, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-28 17:12 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: bot+bpf-ci, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, inwardvessel, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Daniel Borkmann, Eduard,
	Yonghong Song, Chris Mason, Ihor Solodrai

On Tue, Oct 28, 2025 at 9:11 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> bot+bpf-ci@kernel.org writes:
>
> > ```
> > commit ec8e2e1769704b4f69261f36e6aaf035d47c5f0f
> > Author: Roman Gushchin <roman.gushchin@linux.dev>
> >
> > mm: introduce BPF kfuncs to deal with memcg pointers
> >
> > This commit introduces two new BPF kfuncs for converting css pointers
> > to memcg pointers with proper reference counting: bpf_get_mem_cgroup()
> > and bpf_put_mem_cgroup().
> >
> >> diff --git a/mm/Makefile b/mm/Makefile
> >> index 051e88c69..2d8f9beb3 100644
> >> --- a/mm/Makefile
> >> +++ b/mm/Makefile
> >> @@ -107,6 +107,7 @@ obj-$(CONFIG_MEMCG) += swap_cgroup.o
> >>  endif
> >>  ifdef CONFIG_BPF_SYSCALL
> >>  obj-y += bpf_oom.o
> >> +obj-$(CONFIG_MEMCG) += bpf_memcontrol.o
> >>  endif
> >>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
> >>  obj-$(CONFIG_GUP_TEST) += gup_test.o
> >> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> >> new file mode 100644
> >> index 000000000..1e4609774
> >> --- /dev/null
> >> +++ b/mm/bpf_memcontrol.c
> >> @@ -0,0 +1,88 @@
> >> +// SPDX-License-Identifier: GPL-2.0-or-later
> >> +/*
> >> + * Memory Controller-related BPF kfuncs and auxiliary code
> >> + *
> >> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> >> + */
> >> +
> >> +#include <linux/memcontrol.h>
> >> +#include <linux/bpf.h>
> >> +
> >> +__bpf_kfunc_start_defs();
> >> +
> >> +/**
> >> + * bpf_get_mem_cgroup - Get a reference to a memory cgroup
> >> + * @css: pointer to the css structure
> >> + *
> >> + * Returns a pointer to a mem_cgroup structure after bumping
> >> + * the corresponding css's reference counter.
> >> + *
> >> + * It's fine to pass a css which belongs to any cgroup controller,
> >> + * e.g. unified hierarchy's main css.
> >> + *
> >> + * Implements KF_ACQUIRE semantics.
> >> + */
> >> +__bpf_kfunc struct mem_cgroup *
> >> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
> >> +{
> >> +    struct mem_cgroup *memcg = NULL;
> >> +    bool rcu_unlock = false;
> >> +
> >> +    if (!root_mem_cgroup)
> >> +            return NULL;
> >> +
> >> +    if (root_mem_cgroup->css.ss != css->ss) {
> >                                             ^^^^^^^
> >
> > Can this dereference a NULL css pointer?  The function checks css for
> > NULL at line 42 with "if (css && css_tryget(css))", which suggests css
> > can be NULL.  If a BPF program calls this function with a NULL css,
> > wouldn't it crash here before reaching that check?
>
> Here css passed as an argument to bpf_get_mem_cgroup() can't be NULL,
> because the BPF verifier guarantees that it's a valid css pointer.
>
> However the result of rcu_dereference_raw(cgroup->subsys[ssid]) can be
> NULL, this is why the "if (css && css_tryget(css))" check is required
> down below.

Yeah. Not sure how feasible it is to teach AI about KF_RCU semantics.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-28 17:42   ` Tejun Heo
  2025-10-28 18:12     ` Roman Gushchin
  1 sibling, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-28 17:42 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

On Mon, Oct 27, 2025 at 04:17:11PM -0700, Roman Gushchin wrote:
> +__bpf_kfunc struct mem_cgroup *
> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
> +{
> +	struct mem_cgroup *memcg = NULL;
> +	bool rcu_unlock = false;
> +
> +	if (!root_mem_cgroup)
> +		return NULL;
> +
> +	if (root_mem_cgroup->css.ss != css->ss) {
> +		struct cgroup *cgroup = css->cgroup;
> +		int ssid = root_mem_cgroup->css.ss->id;
> +
> +		rcu_read_lock();
> +		rcu_unlock = true;
> +		css = rcu_dereference_raw(cgroup->subsys[ssid]);

Would it make more sense to use cgroup_e_css()?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
  2025-10-27 23:57   ` bot+bpf-ci
@ 2025-10-28 17:45   ` Alexei Starovoitov
  2025-10-28 18:42     ` Roman Gushchin
  2025-10-28 21:33   ` Song Liu
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-28 17:45 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, LKML, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, open list:CONTROL GROUP (CGROUP), bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon, Oct 27, 2025 at 4:18 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       struct bpf_oom_ops *bpf_oom_ops = NULL;
> +       struct mem_cgroup __maybe_unused *memcg;
> +       int idx, ret = 0;
> +
> +       /* All bpf_oom_ops structures are protected using bpf_oom_srcu */
> +       idx = srcu_read_lock(&bpf_oom_srcu);
> +
> +#ifdef CONFIG_MEMCG
> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> +       for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
> +               bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> +               if (!bpf_oom_ops)
> +                       continue;
> +
> +               /* Call BPF OOM handler */
> +               ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
> +               if (ret && oc->bpf_memory_freed)
> +                       goto exit;
> +       }
> +#endif /* CONFIG_MEMCG */
> +
> +       /*
> +        * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
> +        * Try system_bpf_oom.
> +        */
> +       bpf_oom_ops = READ_ONCE(system_bpf_oom);
> +       if (!bpf_oom_ops)
> +               goto exit;
> +
> +       /* Call BPF OOM handler */
> +       ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
> +exit:
> +       srcu_read_unlock(&bpf_oom_srcu, idx);
> +       return ret && oc->bpf_memory_freed;
> +}

...

> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
> +       struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> +       struct bpf_oom_ops *bpf_oom_ops = kdata;
> +       struct mem_cgroup *memcg = NULL;
> +       int err = 0;
> +
> +       if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> +               /* Attach to a memory cgroup? */
> +               memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> +               if (IS_ERR_OR_NULL(memcg))
> +                       return PTR_ERR(memcg);
> +               bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> +       } else {
> +               /* System-wide OOM handler */
> +               bpf_oom_ops_ptr = &system_bpf_oom;
> +       }

I don't like the fallback and special case of cgroup_id == 0.
imo it would be cleaner to require CONFIG_MEMCG for this feature
and only allow attach to a cgroup.
There is always a root cgroup that can be attached to and that
handler will be acting as "system wide" oom handler.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-28 17:12       ` Alexei Starovoitov
@ 2025-10-28 18:03         ` Chris Mason
  2025-10-28 18:32           ` Roman Gushchin
  0 siblings, 1 reply; 83+ messages in thread
From: Chris Mason @ 2025-10-28 18:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Roman Gushchin
  Cc: bot+bpf-ci, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, inwardvessel, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Daniel Borkmann, Eduard,
	Yonghong Song, Ihor Solodrai

On 10/28/25 1:12 PM, Alexei Starovoitov wrote:
> On Tue, Oct 28, 2025 at 9:11 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> bot+bpf-ci@kernel.org writes:
>>
>>> ```
>>> commit ec8e2e1769704b4f69261f36e6aaf035d47c5f0f
>>> Author: Roman Gushchin <roman.gushchin@linux.dev>
>>> Can this dereference a NULL css pointer?  The function checks css for
>>> NULL at line 42 with "if (css && css_tryget(css))", which suggests css
>>> can be NULL.  If a BPF program calls this function with a NULL css,
>>> wouldn't it crash here before reaching that check?
>>
>> Here css passed as an argument to bpf_get_mem_cgroup() can't be NULL,
>> because the BPF verifier guarantees that it's a valid css pointer.
>>
>> However the result of rcu_dereference_raw(cgroup->subsys[ssid]) can be
>> NULL, this is why the "if (css && css_tryget(css))" check is required
>> down below.
> 
> Yeah. Not sure how feasible it is to teach AI about KF_RCU semantics.

I pulled it down locally to try and w/semcode it is properly catching this:

False Positives Eliminated

1. EH-001 NULL dereference - css parameter dereferenced without check

- Why false positive: BPF verifier ensures pointer parameters are
non-NULL. All kernel kfuncs follow the same pattern of not checking
parameters for NULL (css_rstat_updated, css_rstat_flush,
bpf_put_mem_cgroup, etc.). The KF_RET_NULL flag controls return value,
not parameter nullability.

My plan is to just have the prompt read Documentation/bpf/kfuncs.rst,
which Eduard suggested.  I'll make a bpf kfuncs pattern and do that.

-chris



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-28 17:42   ` Tejun Heo
@ 2025-10-28 18:12     ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 18:12 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Tejun Heo <tj@kernel.org> writes:

> On Mon, Oct 27, 2025 at 04:17:11PM -0700, Roman Gushchin wrote:
>> +__bpf_kfunc struct mem_cgroup *
>> +bpf_get_mem_cgroup(struct cgroup_subsys_state *css)
>> +{
>> +	struct mem_cgroup *memcg = NULL;
>> +	bool rcu_unlock = false;
>> +
>> +	if (!root_mem_cgroup)
>> +		return NULL;
>> +
>> +	if (root_mem_cgroup->css.ss != css->ss) {
>> +		struct cgroup *cgroup = css->cgroup;
>> +		int ssid = root_mem_cgroup->css.ss->id;
>> +
>> +		rcu_read_lock();
>> +		rcu_unlock = true;
>> +		css = rcu_dereference_raw(cgroup->subsys[ssid]);
>
> Would it make more sense to use cgroup_e_css()?

Good call, will update in the next version.

Thank you!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers
  2025-10-28 18:03         ` Chris Mason
@ 2025-10-28 18:32           ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 18:32 UTC (permalink / raw)
  To: Chris Mason
  Cc: Alexei Starovoitov, bot+bpf-ci, Andrew Morton, LKML,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, inwardvessel,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo, Daniel Borkmann,
	Eduard, Yonghong Song, Ihor Solodrai

Chris Mason <clm@meta.com> writes:

> On 10/28/25 1:12 PM, Alexei Starovoitov wrote:
>> On Tue, Oct 28, 2025 at 9:11 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>>
>>> bot+bpf-ci@kernel.org writes:
>>>
>>>> ```
>>>> commit ec8e2e1769704b4f69261f36e6aaf035d47c5f0f
>>>> Author: Roman Gushchin <roman.gushchin@linux.dev>
>>>> Can this dereference a NULL css pointer?  The function checks css for
>>>> NULL at line 42 with "if (css && css_tryget(css))", which suggests css
>>>> can be NULL.  If a BPF program calls this function with a NULL css,
>>>> wouldn't it crash here before reaching that check?
>>>
>>> Here css passed as an argument to bpf_get_mem_cgroup() can't be NULL,
>>> because the BPF verifier guarantees that it's a valid css pointer.
>>>
>>> However the result of rcu_dereference_raw(cgroup->subsys[ssid]) can be
>>> NULL, this is why the "if (css && css_tryget(css))" check is required
>>> down below.
>> 
>> Yeah. Not sure how feasible it is to teach AI about KF_RCU semantics.
>
> I pulled it down locally to try and w/semcode it is properly catching this:
>
> False Positives Eliminated
>
> 1. EH-001 NULL dereference - css parameter dereferenced without check
>
> - Why false positive: BPF verifier ensures pointer parameters are
> non-NULL. All kernel kfuncs follow the same pattern of not checking
> parameters for NULL (css_rstat_updated, css_rstat_flush,
> bpf_put_mem_cgroup, etc.). The KF_RET_NULL flag controls return value,
> not parameter nullability.
>
> My plan is to just have the prompt read Documentation/bpf/kfuncs.rst,
> which Eduard suggested.  I'll make a bpf kfuncs pattern and do that.

Awesome, thank you!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-28 17:45   ` Alexei Starovoitov
@ 2025-10-28 18:42     ` Roman Gushchin
  2025-10-28 22:07       ` Alexei Starovoitov
  0 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 18:42 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, LKML, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, open list:CONTROL GROUP (CGROUP), bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Mon, Oct 27, 2025 at 4:18 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       struct bpf_oom_ops *bpf_oom_ops = NULL;
>> +       struct mem_cgroup __maybe_unused *memcg;
>> +       int idx, ret = 0;
>> +
>> +       /* All bpf_oom_ops structures are protected using bpf_oom_srcu */
>> +       idx = srcu_read_lock(&bpf_oom_srcu);
>> +
>> +#ifdef CONFIG_MEMCG
>> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> +       for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +               bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
>> +               if (!bpf_oom_ops)
>> +                       continue;
>> +
>> +               /* Call BPF OOM handler */
>> +               ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
>> +               if (ret && oc->bpf_memory_freed)
>> +                       goto exit;
>> +       }
>> +#endif /* CONFIG_MEMCG */
>> +
>> +       /*
>> +        * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
>> +        * Try system_bpf_oom.
>> +        */
>> +       bpf_oom_ops = READ_ONCE(system_bpf_oom);
>> +       if (!bpf_oom_ops)
>> +               goto exit;
>> +
>> +       /* Call BPF OOM handler */
>> +       ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
>> +exit:
>> +       srcu_read_unlock(&bpf_oom_srcu, idx);
>> +       return ret && oc->bpf_memory_freed;
>> +}
>
> ...
>
>> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>> +{
>> +       struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
>> +       struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
>> +       struct bpf_oom_ops *bpf_oom_ops = kdata;
>> +       struct mem_cgroup *memcg = NULL;
>> +       int err = 0;
>> +
>> +       if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
>> +               /* Attach to a memory cgroup? */
>> +               memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
>> +               if (IS_ERR_OR_NULL(memcg))
>> +                       return PTR_ERR(memcg);
>> +               bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
>> +       } else {
>> +               /* System-wide OOM handler */
>> +               bpf_oom_ops_ptr = &system_bpf_oom;
>> +       }
>
> I don't like the fallback and special case of cgroup_id == 0.
> imo it would be cleaner to require CONFIG_MEMCG for this feature
> and only allow attach to a cgroup.
> There is always a root cgroup that can be attached to and that
> handler will be acting as "system wide" oom handler.

I thought about it, but then it can't be used on !CONFIG_MEMCG
configurations and also before cgroupfs is mounted, root cgroup
is created etc. This is why system-wide things are often handled in a
special way, e.g. in by PSI (grep system_group_pcpu).

I think supporting !CONFIG_MEMCG configurations might be useful for
some very stripped down VM's, for example.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
  2025-10-27 23:57   ` bot+bpf-ci
  2025-10-28 17:45   ` Alexei Starovoitov
@ 2025-10-28 21:33   ` Song Liu
  2025-10-28 23:24     ` Roman Gushchin
  2025-10-30  0:20   ` Martin KaFai Lau
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 83+ messages in thread
From: Song Liu @ 2025-10-28 21:33 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon, Oct 27, 2025 at 4:18 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
[...]
> +
> +struct bpf_oom_ops {
> +       /**
> +        * @handle_out_of_memory: Out of memory bpf handler, called before
> +        * the in-kernel OOM killer.
> +        * @ctx: Execution context
> +        * @oc: OOM control structure
> +        *
> +        * Should return 1 if some memory was freed up, otherwise
> +        * the in-kernel OOM killer is invoked.
> +        */
> +       int (*handle_out_of_memory)(struct bpf_oom_ctx *ctx, struct oom_control *oc);
> +
> +       /**
> +        * @handle_cgroup_offline: Cgroup offline callback
> +        * @ctx: Execution context
> +        * @cgroup_id: Id of deleted cgroup
> +        *
> +        * Called if the cgroup with the attached bpf_oom_ops is deleted.
> +        */
> +       void (*handle_cgroup_offline)(struct bpf_oom_ctx *ctx, u64 cgroup_id);

handle_out_of_memory() and handle_cgroup_offline() takes bpf_oom_ctx,
which is just cgroup_id for now. Shall we pass in struct mem_cgroup, which
should be easier to use?

Thanks,
Song

> +
> +       /**
> +        * @name: BPF OOM policy name
> +        */
> +       char name[BPF_OOM_NAME_MAX_LEN];
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +/**
> + * @bpf_handle_oom: handle out of memory condition using bpf
> + * @oc: OOM control structure
> + *
> + * Returns true if some memory was freed.
> + */
> +bool bpf_handle_oom(struct oom_control *oc);
> +


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-28 18:42     ` Roman Gushchin
@ 2025-10-28 22:07       ` Alexei Starovoitov
  2025-10-28 22:56         ` Roman Gushchin
  0 siblings, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-28 22:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, LKML, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, open list:CONTROL GROUP (CGROUP), bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Oct 28, 2025 at 11:42 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Mon, Oct 27, 2025 at 4:18 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >>
> >> +bool bpf_handle_oom(struct oom_control *oc)
> >> +{
> >> +       struct bpf_oom_ops *bpf_oom_ops = NULL;
> >> +       struct mem_cgroup __maybe_unused *memcg;
> >> +       int idx, ret = 0;
> >> +
> >> +       /* All bpf_oom_ops structures are protected using bpf_oom_srcu */
> >> +       idx = srcu_read_lock(&bpf_oom_srcu);
> >> +
> >> +#ifdef CONFIG_MEMCG
> >> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> >> +       for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
> >> +               bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> >> +               if (!bpf_oom_ops)
> >> +                       continue;
> >> +
> >> +               /* Call BPF OOM handler */
> >> +               ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
> >> +               if (ret && oc->bpf_memory_freed)
> >> +                       goto exit;
> >> +       }
> >> +#endif /* CONFIG_MEMCG */
> >> +
> >> +       /*
> >> +        * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
> >> +        * Try system_bpf_oom.
> >> +        */
> >> +       bpf_oom_ops = READ_ONCE(system_bpf_oom);
> >> +       if (!bpf_oom_ops)
> >> +               goto exit;
> >> +
> >> +       /* Call BPF OOM handler */
> >> +       ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
> >> +exit:
> >> +       srcu_read_unlock(&bpf_oom_srcu, idx);
> >> +       return ret && oc->bpf_memory_freed;
> >> +}
> >
> > ...
> >
> >> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> >> +{
> >> +       struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
> >> +       struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> >> +       struct bpf_oom_ops *bpf_oom_ops = kdata;
> >> +       struct mem_cgroup *memcg = NULL;
> >> +       int err = 0;
> >> +
> >> +       if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> >> +               /* Attach to a memory cgroup? */
> >> +               memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> >> +               if (IS_ERR_OR_NULL(memcg))
> >> +                       return PTR_ERR(memcg);
> >> +               bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> >> +       } else {
> >> +               /* System-wide OOM handler */
> >> +               bpf_oom_ops_ptr = &system_bpf_oom;
> >> +       }
> >
> > I don't like the fallback and special case of cgroup_id == 0.
> > imo it would be cleaner to require CONFIG_MEMCG for this feature
> > and only allow attach to a cgroup.
> > There is always a root cgroup that can be attached to and that
> > handler will be acting as "system wide" oom handler.
>
> I thought about it, but then it can't be used on !CONFIG_MEMCG
> configurations and also before cgroupfs is mounted, root cgroup
> is created etc.

before that bpf isn't viable either, and oom is certainly not an issue.

> This is why system-wide things are often handled in a
> special way, e.g. in by PSI (grep system_group_pcpu).
>
> I think supporting !CONFIG_MEMCG configurations might be useful for
> some very stripped down VM's, for example.

I thought I wouldn't need to convince the guy who converted bpf maps
to memcg and it made it pretty much mandatory for the bpf subsystem :)
I think the following is long overdue:
diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
index eb3de35734f0..af60be6d3d41 100644
--- a/kernel/bpf/Kconfig
+++ b/kernel/bpf/Kconfig
@@ -34,6 +34,7 @@ config BPF_SYSCALL
        select NET_SOCK_MSG if NET
        select NET_XGRESS if NET
        select PAGE_POOL if NET
+       depends on MEMCG
        default n

With this we can cleanup a ton of code.
Let's not add more hacks just because some weird thing
still wants !MEMCG. If they do, they will survive without bpf.


^ permalink raw reply related	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-28 22:07       ` Alexei Starovoitov
@ 2025-10-28 22:56         ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 22:56 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andrew Morton, LKML, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, open list:CONTROL GROUP (CGROUP), bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Tue, Oct 28, 2025 at 11:42 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>>
>> > On Mon, Oct 27, 2025 at 4:18 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>> >>
>> >> +bool bpf_handle_oom(struct oom_control *oc)
>> >> +{
>> >> +       struct bpf_oom_ops *bpf_oom_ops = NULL;
>> >> +       struct mem_cgroup __maybe_unused *memcg;
>> >> +       int idx, ret = 0;
>> >> +
>> >> +       /* All bpf_oom_ops structures are protected using bpf_oom_srcu */
>> >> +       idx = srcu_read_lock(&bpf_oom_srcu);
>> >> +
>> >> +#ifdef CONFIG_MEMCG
>> >> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> >> +       for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
>> >> +               bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
>> >> +               if (!bpf_oom_ops)
>> >> +                       continue;
>> >> +
>> >> +               /* Call BPF OOM handler */
>> >> +               ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
>> >> +               if (ret && oc->bpf_memory_freed)
>> >> +                       goto exit;
>> >> +       }
>> >> +#endif /* CONFIG_MEMCG */
>> >> +
>> >> +       /*
>> >> +        * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
>> >> +        * Try system_bpf_oom.
>> >> +        */
>> >> +       bpf_oom_ops = READ_ONCE(system_bpf_oom);
>> >> +       if (!bpf_oom_ops)
>> >> +               goto exit;
>> >> +
>> >> +       /* Call BPF OOM handler */
>> >> +       ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
>> >> +exit:
>> >> +       srcu_read_unlock(&bpf_oom_srcu, idx);
>> >> +       return ret && oc->bpf_memory_freed;
>> >> +}
>> >
>> > ...
>> >
>> >> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>> >> +{
>> >> +       struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
>> >> +       struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
>> >> +       struct bpf_oom_ops *bpf_oom_ops = kdata;
>> >> +       struct mem_cgroup *memcg = NULL;
>> >> +       int err = 0;
>> >> +
>> >> +       if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
>> >> +               /* Attach to a memory cgroup? */
>> >> +               memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
>> >> +               if (IS_ERR_OR_NULL(memcg))
>> >> +                       return PTR_ERR(memcg);
>> >> +               bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
>> >> +       } else {
>> >> +               /* System-wide OOM handler */
>> >> +               bpf_oom_ops_ptr = &system_bpf_oom;
>> >> +       }
>> >
>> > I don't like the fallback and special case of cgroup_id == 0.
>> > imo it would be cleaner to require CONFIG_MEMCG for this feature
>> > and only allow attach to a cgroup.
>> > There is always a root cgroup that can be attached to and that
>> > handler will be acting as "system wide" oom handler.
>>
>> I thought about it, but then it can't be used on !CONFIG_MEMCG
>> configurations and also before cgroupfs is mounted, root cgroup
>> is created etc.
>
> before that bpf isn't viable either, and oom is certainly not an issue.
>
>> This is why system-wide things are often handled in a
>> special way, e.g. in by PSI (grep system_group_pcpu).
>>
>> I think supporting !CONFIG_MEMCG configurations might be useful for
>> some very stripped down VM's, for example.
>
> I thought I wouldn't need to convince the guy who converted bpf maps
> to memcg and it made it pretty much mandatory for the bpf subsystem :)
> I think the following is long overdue:
> diff --git a/kernel/bpf/Kconfig b/kernel/bpf/Kconfig
> index eb3de35734f0..af60be6d3d41 100644
> --- a/kernel/bpf/Kconfig
> +++ b/kernel/bpf/Kconfig
> @@ -34,6 +34,7 @@ config BPF_SYSCALL
>         select NET_SOCK_MSG if NET
>         select NET_XGRESS if NET
>         select PAGE_POOL if NET
> +       depends on MEMCG
>         default n
>
> With this we can cleanup a ton of code.
> Let's not add more hacks just because some weird thing
> still wants !MEMCG. If they do, they will survive without bpf.

Ok, this is bold, but why not?
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>

Are you going to land it separately, I guess?


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-28 21:33   ` Song Liu
@ 2025-10-28 23:24     ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-28 23:24 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo

Song Liu <song@kernel.org> writes:

> On Mon, Oct 27, 2025 at 4:18 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> [...]
>> +
>> +struct bpf_oom_ops {
>> +       /**
>> +        * @handle_out_of_memory: Out of memory bpf handler, called before
>> +        * the in-kernel OOM killer.
>> +        * @ctx: Execution context
>> +        * @oc: OOM control structure
>> +        *
>> +        * Should return 1 if some memory was freed up, otherwise
>> +        * the in-kernel OOM killer is invoked.
>> +        */
>> +       int (*handle_out_of_memory)(struct bpf_oom_ctx *ctx, struct oom_control *oc);
>> +
>> +       /**
>> +        * @handle_cgroup_offline: Cgroup offline callback
>> +        * @ctx: Execution context
>> +        * @cgroup_id: Id of deleted cgroup
>> +        *
>> +        * Called if the cgroup with the attached bpf_oom_ops is deleted.
>> +        */
>> +       void (*handle_cgroup_offline)(struct bpf_oom_ctx *ctx, u64 cgroup_id);
>
> handle_out_of_memory() and handle_cgroup_offline() takes bpf_oom_ctx,
> which is just cgroup_id for now. Shall we pass in struct mem_cgroup, which
> should be easier to use?

I want it to be easier to extend, this is why the structure. But I can
pass a memcg pointer instead of cgroup_id, not a problem.

Thanks!

>
> Thanks,
> Song
>
>> +
>> +       /**
>> +        * @name: BPF OOM policy name
>> +        */
>> +       char name[BPF_OOM_NAME_MAX_LEN];
>> +};
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +/**
>> + * @bpf_handle_oom: handle out of memory condition using bpf
>> + * @oc: OOM control structure
>> + *
>> + * Returns true if some memory was freed.
>> + */
>> +bool bpf_handle_oom(struct oom_control *oc);
>> +


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-29 18:01   ` Song Liu
  2025-10-29 20:26     ` Roman Gushchin
  2025-10-30 17:22     ` Roman Gushchin
  2025-10-29 18:14   ` Tejun Heo
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 83+ messages in thread
From: Song Liu @ 2025-10-29 18:01 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
[...]
>  struct bpf_struct_ops_value {
>         struct bpf_struct_ops_common_value common;
> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>         }
>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>                       attr->link_create.attach_type);
> +#ifdef CONFIG_CGROUPS
> +       if (attr->link_create.cgroup.relative_fd) {
> +               struct cgroup *cgrp;
> +
> +               cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);

We should use "target_fd" here, not relative_fd.

Also, 0 is a valid fd, so we cannot use target_fd == 0 to attach to
global memcg.

Thanks,
Song

> +               if (IS_ERR(cgrp))
> +                       return PTR_ERR(cgrp);
> +
> +               link->cgroup_id = cgroup_id(cgrp);
> +               cgroup_put(cgrp);
> +       }
> +#endif /* CONFIG_CGROUPS */
>
>         err = bpf_link_prime(&link->link, &link_primer);
>         if (err)
> --
> 2.51.0
>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
  2025-10-29 18:01   ` Song Liu
@ 2025-10-29 18:14   ` Tejun Heo
  2025-10-29 20:25     ` Roman Gushchin
  2025-10-29 21:04   ` Song Liu
  2025-10-30  0:43   ` Martin KaFai Lau
  4 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-29 18:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Hello,

On Mon, Oct 27, 2025 at 04:17:05PM -0700, Roman Gushchin wrote:
> @@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
>  	struct bpf_link link;
>  	struct bpf_map __rcu *map;
>  	wait_queue_head_t wait_hup;
> +	u64 cgroup_id;
>  };

BTW, for sched_ext sub-sched support, I'm just adding cgroup_id to
struct_ops, which seems to work fine. It'd be nice to align on the same
approach. What are the benefits of doing this through fd?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 18:14   ` Tejun Heo
@ 2025-10-29 20:25     ` Roman Gushchin
  2025-10-29 20:36       ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-29 20:25 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Tejun Heo <tj@kernel.org> writes:

> Hello,
>
> On Mon, Oct 27, 2025 at 04:17:05PM -0700, Roman Gushchin wrote:
>> @@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
>>  	struct bpf_link link;
>>  	struct bpf_map __rcu *map;
>>  	wait_queue_head_t wait_hup;
>> +	u64 cgroup_id;
>>  };
>
> BTW, for sched_ext sub-sched support, I'm just adding cgroup_id to
> struct_ops, which seems to work fine. It'd be nice to align on the same
> approach. What are the benefits of doing this through fd?

Then you can attach a single struct ops to multiple cgroups (or Idk
sockets or processes or some other objects in the future).
And IMO it's just a more generic solution.

Thanks!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 18:01   ` Song Liu
@ 2025-10-29 20:26     ` Roman Gushchin
  2025-10-30 17:22     ` Roman Gushchin
  1 sibling, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-29 20:26 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo

Song Liu <song@kernel.org> writes:

> On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> [...]
>>  struct bpf_struct_ops_value {
>>         struct bpf_struct_ops_common_value common;
>> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>         }
>>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>>                       attr->link_create.attach_type);
>> +#ifdef CONFIG_CGROUPS
>> +       if (attr->link_create.cgroup.relative_fd) {
>> +               struct cgroup *cgrp;
>> +
>> +               cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
>
> We should use "target_fd" here, not relative_fd.

Ok, thanks!

>
> Also, 0 is a valid fd, so we cannot use target_fd == 0 to attach to
> global memcg.

Yep, switching to using root_memcg's fd instead.

Thanks!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 20:25     ` Roman Gushchin
@ 2025-10-29 20:36       ` Tejun Heo
  2025-10-29 21:18         ` Song Liu
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-29 20:36 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

On Wed, Oct 29, 2025 at 01:25:52PM -0700, Roman Gushchin wrote:
> > BTW, for sched_ext sub-sched support, I'm just adding cgroup_id to
> > struct_ops, which seems to work fine. It'd be nice to align on the same
> > approach. What are the benefits of doing this through fd?
> 
> Then you can attach a single struct ops to multiple cgroups (or Idk
> sockets or processes or some other objects in the future).
> And IMO it's just a more generic solution.

I'm not very convinced that sharing a single struct_ops instance across
multiple cgroups would be all that useful. If you map this to normal
userspace programs, a given struct_ops instance is package of code and all
the global data (maps). ie. it's not like running the same program multiple
times against different targets. It's more akin to running a single program
instance which can handle multiple targets.

Maybe that's useful in some cases, but that program would have to explicitly
distinguish the cgroups that it's attached to. I have a hard time imagining
use cases where a single struct_ops has to service multiple disjoint cgroups
in the hierarchy and it ends up stepping outside of the usual operation
model of cgroups - commonality being expressed through the hierarchical
structure.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
                     ` (2 preceding siblings ...)
  2025-10-29 18:14   ` Tejun Heo
@ 2025-10-29 21:04   ` Song Liu
  2025-10-30  0:43   ` Martin KaFai Lau
  4 siblings, 0 replies; 83+ messages in thread
From: Song Liu @ 2025-10-29 21:04 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> When a struct ops is being attached and a bpf link is created,
> allow to pass a cgroup fd using bpf attr, so that struct ops
> can be attached to a cgroup instead of globally.
>
> Attached struct ops doesn't hold a reference to the cgroup,
> only preserves cgroup id.

With the current model, when a cgroup is freed, the bpf link still
holds a reference to the struct_ops. Can we make the cgroup to hold
a reference to the struct_ops, so that the struct_ops is freed
automatically when the cgroup is freed?

I think the downside is that we will need an API to remove/change
per cgroup OOM handler.

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 20:36       ` Tejun Heo
@ 2025-10-29 21:18         ` Song Liu
  2025-10-29 21:27           ` Tejun Heo
  2025-10-29 21:53           ` Roman Gushchin
  0 siblings, 2 replies; 83+ messages in thread
From: Song Liu @ 2025-10-29 21:18 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi

Hi Tejun,

On Wed, Oct 29, 2025 at 1:36 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Wed, Oct 29, 2025 at 01:25:52PM -0700, Roman Gushchin wrote:
> > > BTW, for sched_ext sub-sched support, I'm just adding cgroup_id to
> > > struct_ops, which seems to work fine. It'd be nice to align on the same
> > > approach. What are the benefits of doing this through fd?
> >
> > Then you can attach a single struct ops to multiple cgroups (or Idk
> > sockets or processes or some other objects in the future).
> > And IMO it's just a more generic solution.
>
> I'm not very convinced that sharing a single struct_ops instance across
> multiple cgroups would be all that useful. If you map this to normal
> userspace programs, a given struct_ops instance is package of code and all
> the global data (maps). ie. it's not like running the same program multiple
> times against different targets. It's more akin to running a single program
> instance which can handle multiple targets.
>
> Maybe that's useful in some cases, but that program would have to explicitly
> distinguish the cgroups that it's attached to. I have a hard time imagining
> use cases where a single struct_ops has to service multiple disjoint cgroups
> in the hierarchy and it ends up stepping outside of the usual operation
> model of cgroups - commonality being expressed through the hierarchical
> structure.

How about we pass a pointer to mem_cgroup (and/or related pointers)
to all the callbacks in the struct_ops? AFAICT, in-kernel _ops structures like
struct file_operations and struct tcp_congestion_ops use this method. And
we can actually implement struct tcp_congestion_ops in BPF. With the
struct tcp_congestion_ops model, the struct_ops map and the struct_ops
link are both shared among multiple instances (sockets).

With this model, the system admin with root access can load a bunch of
available oom handlers, and users in their container can pick a preferred
oom handler for the sub cgroup. AFAICT, the users in the container can
pick the proper OOM handler without CAP_BPF. Does this sound useful
for some cases?

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 21:18         ` Song Liu
@ 2025-10-29 21:27           ` Tejun Heo
  2025-10-29 21:37             ` Song Liu
  2025-10-29 21:53           ` Roman Gushchin
  1 sibling, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-29 21:27 UTC (permalink / raw)
  To: Song Liu
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi

Hello,

On Wed, Oct 29, 2025 at 02:18:00PM -0700, Song Liu wrote:
...
> How about we pass a pointer to mem_cgroup (and/or related pointers)
> to all the callbacks in the struct_ops? AFAICT, in-kernel _ops structures like
> struct file_operations and struct tcp_congestion_ops use this method. And
> we can actually implement struct tcp_congestion_ops in BPF. With the
> struct tcp_congestion_ops model, the struct_ops map and the struct_ops
> link are both shared among multiple instances (sockets).
> 
> With this model, the system admin with root access can load a bunch of
> available oom handlers, and users in their container can pick a preferred
> oom handler for the sub cgroup. AFAICT, the users in the container can
> pick the proper OOM handler without CAP_BPF. Does this sound useful
> for some cases?

Doesn't that assume that the programs are more or less stateless? Wouldn't
oom handlers want to track historical information, running averages, which
process expanded the most and so on?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 21:27           ` Tejun Heo
@ 2025-10-29 21:37             ` Song Liu
  2025-10-29 21:45               ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Song Liu @ 2025-10-29 21:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Song Liu, Roman Gushchin, Andrew Morton, linux-kernel,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, cgroups, bpf, Martin KaFai Lau, Kumar Kartikeya Dwivedi

On Wed, Oct 29, 2025 at 2:27 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 29, 2025 at 02:18:00PM -0700, Song Liu wrote:
> ...
> > How about we pass a pointer to mem_cgroup (and/or related pointers)
> > to all the callbacks in the struct_ops? AFAICT, in-kernel _ops structures like
> > struct file_operations and struct tcp_congestion_ops use this method. And
> > we can actually implement struct tcp_congestion_ops in BPF. With the
> > struct tcp_congestion_ops model, the struct_ops map and the struct_ops
> > link are both shared among multiple instances (sockets).
> >
> > With this model, the system admin with root access can load a bunch of
> > available oom handlers, and users in their container can pick a preferred
> > oom handler for the sub cgroup. AFAICT, the users in the container can
> > pick the proper OOM handler without CAP_BPF. Does this sound useful
> > for some cases?
>
> Doesn't that assume that the programs are more or less stateless? Wouldn't
> oom handlers want to track historical information, running averages, which
> process expanded the most and so on?

Yes, this does mean the program needs to store data in some BPF maps.
Do we have concern with the performance of BPF maps?

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 21:37             ` Song Liu
@ 2025-10-29 21:45               ` Tejun Heo
  2025-10-30  4:32                 ` Song Liu
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-29 21:45 UTC (permalink / raw)
  To: Song Liu
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi

Hello,

On Wed, Oct 29, 2025 at 02:37:38PM -0700, Song Liu wrote:
> On Wed, Oct 29, 2025 at 2:27 PM Tejun Heo <tj@kernel.org> wrote:
> > Doesn't that assume that the programs are more or less stateless? Wouldn't
> > oom handlers want to track historical information, running averages, which
> > process expanded the most and so on?
> 
> Yes, this does mean the program needs to store data in some BPF maps.
> Do we have concern with the performance of BPF maps?

It's just a lot more awkward to do and I have a difficult time thinking up
reasons why one would need to do that. If you attach a single struct_ops
instance to one cgroup, you can use global variables, maps, arena to track
what's happening with the cgroup. If you share the same struct_ops across
multiple cgroups, each operation has to scope per-cgroup states. I can see
how that probably makes sense for sockets but cgroups aren't sockets. There
are a lot fewer cgroups and they are organized in a tree.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 21:18         ` Song Liu
  2025-10-29 21:27           ` Tejun Heo
@ 2025-10-29 21:53           ` Roman Gushchin
  2025-10-29 22:43             ` Alexei Starovoitov
  1 sibling, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-29 21:53 UTC (permalink / raw)
  To: Song Liu
  Cc: Tejun Heo, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi

Song Liu <song@kernel.org> writes:

> Hi Tejun,
>
> On Wed, Oct 29, 2025 at 1:36 PM Tejun Heo <tj@kernel.org> wrote:
>>
>> On Wed, Oct 29, 2025 at 01:25:52PM -0700, Roman Gushchin wrote:
>> > > BTW, for sched_ext sub-sched support, I'm just adding cgroup_id to
>> > > struct_ops, which seems to work fine. It'd be nice to align on the same
>> > > approach. What are the benefits of doing this through fd?
>> >
>> > Then you can attach a single struct ops to multiple cgroups (or Idk
>> > sockets or processes or some other objects in the future).
>> > And IMO it's just a more generic solution.
>>
>> I'm not very convinced that sharing a single struct_ops instance across
>> multiple cgroups would be all that useful. If you map this to normal
>> userspace programs, a given struct_ops instance is package of code and all
>> the global data (maps). ie. it's not like running the same program multiple
>> times against different targets. It's more akin to running a single program
>> instance which can handle multiple targets.
>>
>> Maybe that's useful in some cases, but that program would have to explicitly
>> distinguish the cgroups that it's attached to. I have a hard time imagining
>> use cases where a single struct_ops has to service multiple disjoint cgroups
>> in the hierarchy and it ends up stepping outside of the usual operation
>> model of cgroups - commonality being expressed through the hierarchical
>> structure.
>
> How about we pass a pointer to mem_cgroup (and/or related pointers)
> to all the callbacks in the struct_ops? AFAICT, in-kernel _ops structures like
> struct file_operations and struct tcp_congestion_ops use this method. And
> we can actually implement struct tcp_congestion_ops in BPF. With the
> struct tcp_congestion_ops model, the struct_ops map and the struct_ops
> link are both shared among multiple instances (sockets).

+1 to this.
I agree it might be debatable when it comes to cgroups, but when it comes to
sockets or similar objects, having a separate struct ops per object
isn't really an option.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 21:53           ` Roman Gushchin
@ 2025-10-29 22:43             ` Alexei Starovoitov
  2025-10-29 22:53               ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-29 22:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Song Liu, Tejun Heo, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi

On Wed, Oct 29, 2025 at 2:53 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Song Liu <song@kernel.org> writes:
>
> > Hi Tejun,
> >
> > On Wed, Oct 29, 2025 at 1:36 PM Tejun Heo <tj@kernel.org> wrote:
> >>
> >> On Wed, Oct 29, 2025 at 01:25:52PM -0700, Roman Gushchin wrote:
> >> > > BTW, for sched_ext sub-sched support, I'm just adding cgroup_id to
> >> > > struct_ops, which seems to work fine. It'd be nice to align on the same
> >> > > approach. What are the benefits of doing this through fd?
> >> >
> >> > Then you can attach a single struct ops to multiple cgroups (or Idk
> >> > sockets or processes or some other objects in the future).
> >> > And IMO it's just a more generic solution.
> >>
> >> I'm not very convinced that sharing a single struct_ops instance across
> >> multiple cgroups would be all that useful. If you map this to normal
> >> userspace programs, a given struct_ops instance is package of code and all
> >> the global data (maps). ie. it's not like running the same program multiple
> >> times against different targets. It's more akin to running a single program
> >> instance which can handle multiple targets.
> >>
> >> Maybe that's useful in some cases, but that program would have to explicitly
> >> distinguish the cgroups that it's attached to. I have a hard time imagining
> >> use cases where a single struct_ops has to service multiple disjoint cgroups
> >> in the hierarchy and it ends up stepping outside of the usual operation
> >> model of cgroups - commonality being expressed through the hierarchical
> >> structure.
> >
> > How about we pass a pointer to mem_cgroup (and/or related pointers)
> > to all the callbacks in the struct_ops? AFAICT, in-kernel _ops structures like
> > struct file_operations and struct tcp_congestion_ops use this method. And
> > we can actually implement struct tcp_congestion_ops in BPF. With the
> > struct tcp_congestion_ops model, the struct_ops map and the struct_ops
> > link are both shared among multiple instances (sockets).
>
> +1 to this.
> I agree it might be debatable when it comes to cgroups, but when it comes to
> sockets or similar objects, having a separate struct ops per object
> isn't really an option.

I think the general bpf philosophy that load and attach are two
separate steps. For struct-ops it's almost there, but not quite.
struct-ops shouldn't be an exception.
The bpf infra should be able to load a set of progs (aka struct-ops)
and attach it with a link to different entities. Like cgroups.
I think sched-ext should do that too. Even if there is no use case
today for the same sched-ext in two different cgroups.
For bpf-oom I can imagine a use case where container management sw
would pre-load struct-ops and then attach it later to different
containers depending on container configs. These container might
be peers in hierarchy, but attaching to their parent won't be
equivalent, since other peers might not need that bpf-oom management.
The "workaround" could be to create another cgroup layer
between parent and container, but that becomes messy, since now
there is a cgroup only for the purpose of attaching bpf-oom to it.

Whether struct-ops link attach is using cgroup_fd or cgroup_id
is debatable. I think FD is cleaner.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 22:43             ` Alexei Starovoitov
@ 2025-10-29 22:53               ` Tejun Heo
  2025-10-29 23:53                 ` Alexei Starovoitov
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-29 22:53 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Roman Gushchin, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi

Hello,

On Wed, Oct 29, 2025 at 03:43:39PM -0700, Alexei Starovoitov wrote:
...
> I think the general bpf philosophy that load and attach are two
> separate steps. For struct-ops it's almost there, but not quite.
> struct-ops shouldn't be an exception.
> The bpf infra should be able to load a set of progs (aka struct-ops)
> and attach it with a link to different entities. Like cgroups.
> I think sched-ext should do that too. Even if there is no use case
> today for the same sched-ext in two different cgroups.

I'm not sure it's just that there's no use case.

- How would recursion work with private stacks? Aren't those attached to
  each BPF program?

- Wouldn't that also complicate attributing kfunc calls to the handle
  instance? If there is one struct_ops per cgroup, the oom kill kfunc can
  look that up and then verify that the struct_ops has authority over the
  target process. Multiple attachments can work too but that'd require
  iterating all attachments, right?

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 22:53               ` Tejun Heo
@ 2025-10-29 23:53                 ` Alexei Starovoitov
  2025-10-30  0:03                   ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-29 23:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi

On Wed, Oct 29, 2025 at 3:53 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 29, 2025 at 03:43:39PM -0700, Alexei Starovoitov wrote:
> ...
> > I think the general bpf philosophy that load and attach are two
> > separate steps. For struct-ops it's almost there, but not quite.
> > struct-ops shouldn't be an exception.
> > The bpf infra should be able to load a set of progs (aka struct-ops)
> > and attach it with a link to different entities. Like cgroups.
> > I think sched-ext should do that too. Even if there is no use case
> > today for the same sched-ext in two different cgroups.
>
> I'm not sure it's just that there's no use case.

I think there will be a use case for sched-ext as well,
just the current way the scheds are written is too specific.
There is cgroup local storage, so scheds can certainly
store whatever state there.
Potentially we can improve UX further by utilizing __thread on bpf.c
side in some way.

> - How would recursion work with private stacks? Aren't those attached to
>   each BPF program?

yes. private stack is per prog, but why does it matter?
I'm not suggesting that the same prog to be attached at different
levels of the cgroup hierarchy, because such configuration
will indeed trigger recursion prevention logic (with or without private
stack).
But having one logical sched-ext prog set to manage tasks
in container A and in container B makes sense as a use case to me
where A and B are different cgroups.
DSQs can be cgroup scoped too.

> - Wouldn't that also complicate attributing kfunc calls to the handle
>   instance?

you mean the whole prog_assoc stuff ?
That's orthogonal. tracing progs are global so there is
no perfect place to associate them with. struct-ops map
is the best we can do today, but ideally it's run_ctx
that should be per-attachment. Like cookie.

> If there is one struct_ops per cgroup, the oom kill kfunc can
>   look that up and then verify that the struct_ops has authority over the
>   target process. Multiple attachments can work too but that'd require
>   iterating all attachments, right?

Are you talking about bpf_oom_kill_process() kfunc from these patch set?
I don't think it needs any changes. oom context is passed into prog
and passed along to kfunc. Doesn't matter the cgroup origin.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 23:53                 ` Alexei Starovoitov
@ 2025-10-30  0:03                   ` Tejun Heo
  2025-10-30  0:16                     ` Alexei Starovoitov
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-30  0:03 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Roman Gushchin, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi

Hello,

On Wed, Oct 29, 2025 at 04:53:07PM -0700, Alexei Starovoitov wrote:
...
> > - How would recursion work with private stacks? Aren't those attached to
> >   each BPF program?
> 
> yes. private stack is per prog, but why does it matter?
> I'm not suggesting that the same prog to be attached at different
> levels of the cgroup hierarchy, because such configuration
> will indeed trigger recursion prevention logic (with or without private
> stack).
> But having one logical sched-ext prog set to manage tasks
> in container A and in container B makes sense as a use case to me
> where A and B are different cgroups.
> DSQs can be cgroup scoped too.

I don't know. Maybe, but this is kinda specific and I don't see how this
would be useful in practical sense. Have nothing against using the
mechanism. I can still enforce the same rules from scx side. It just looks
unnecessarily over-designed. Maybe consistency with other BPF progs
justifies it.

> > If there is one struct_ops per cgroup, the oom kill kfunc can
> >   look that up and then verify that the struct_ops has authority over the
> >   target process. Multiple attachments can work too but that'd require
> >   iterating all attachments, right?
> 
> Are you talking about bpf_oom_kill_process() kfunc from these patch set?
> I don't think it needs any changes. oom context is passed into prog
> and passed along to kfunc. Doesn't matter the cgroup origin.

Oh, if there are other mechanisms to enforce boundaries, it's not a problem,
but I can almost guarantee as the framework grows, there will be needs for
kfuncs to identify and verify the callers and handlers communicating with
each other along the hierarchy requiring recursive calls.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30  0:03                   ` Tejun Heo
@ 2025-10-30  0:16                     ` Alexei Starovoitov
  2025-10-30  6:33                       ` Yafang Shao
  0 siblings, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-30  0:16 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Roman Gushchin, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi

On Wed, Oct 29, 2025 at 5:03 PM Tejun Heo <tj@kernel.org> wrote:
>
> Oh, if there are other mechanisms to enforce boundaries, it's not a problem,
> but I can almost guarantee as the framework grows, there will be needs for
> kfuncs to identify and verify the callers and handlers communicating with
> each other along the hierarchy requiring recursive calls.

tbh I think it's a combination of sched_ext_ops and bpf infra problem.
All of the scx ops are missing "this" pointer which would have
been there if it was a C++ class.
And "this" should be pointing to an instance of class.
If sched-ext progs are attached to different cgroups, then
every attachment would have been a different instance and
different "this".
Then all kfuncs would effectively be declared as helper
methods within a class. In this case within "struct sched_ext_ops"
as functions that ops callback can call but they will
also have implicit "this" that points back to a particular instance.

Special aux__prog and prog_assoc are not exactly pretty
workarounds for lack of "this".

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
                     ` (2 preceding siblings ...)
  2025-10-28 21:33   ` Song Liu
@ 2025-10-30  0:20   ` Martin KaFai Lau
  2025-10-30  5:57   ` Yafang Shao
  2025-10-31  9:02   ` Michal Hocko
  5 siblings, 0 replies; 83+ messages in thread
From: Martin KaFai Lau @ 2025-10-30  0:20 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Andrew Morton

On 10/27/25 4:17 PM, Roman Gushchin wrote:
> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
> new file mode 100644
> index 000000000000..18c32a5a068b
> --- /dev/null
> +++ b/include/linux/bpf_oom.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_OOM_H
> +#define __BPF_OOM_H
> +
> +struct oom_control;
> +
> +#define BPF_OOM_NAME_MAX_LEN 64
> +
> +struct bpf_oom_ctx {
> +	/*
> +	 * If bpf_oom_ops is attached to a cgroup, id of this cgroup.
> +	 * 0 otherwise.
> +	 */
> +	u64 cgroup_id;
> +};

A function argument can be added to the ops (e.g. handle_out_of_memory) 
in the future. afaict, I don't see it will disrupt the existing bpf prog 
as long as it does not change the ordering of the existing arguments.

If it goes down the 'struct bpf_oom_ctx" abstraction path, all future 
new members of the 'struct bpf_oom_ctx' will need to be initialized even 
they may not be useful for most of the existing ops.

For networking use case, I am quite sure the wrapping is unnecessary. I 
will leave it as fruit of thoughts here for this use case.

> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);

link could be NULL here. "return -EOPNOTSUPP" for the legacy kdata reg 
that does not use the link api.

In the future, we should enforce link must be used in the 
bpf_struct_ops.c except for a few of the existing struct_ops kernel users.

> +	struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> +	struct bpf_oom_ops *bpf_oom_ops = kdata;
> +	struct mem_cgroup *memcg = NULL;
> +	int err = 0;
> +
> +	if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> +		/* Attach to a memory cgroup? */
> +		memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> +		if (IS_ERR_OR_NULL(memcg))
> +			return PTR_ERR(memcg);
> +		bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> +	} else {
> +		/* System-wide OOM handler */
> +		bpf_oom_ops_ptr = &system_bpf_oom;
> +	}
> +
> +	/* Another struct ops attached? */
> +	if (READ_ONCE(*bpf_oom_ops_ptr)) {
> +		err = -EBUSY;
> +		goto exit;
> +	}
> +
> +	/* Expose bpf_oom_ops structure */
> +	WRITE_ONCE(*bpf_oom_ops_ptr, bpf_oom_ops);
> +exit:
> +	mem_cgroup_put(memcg);
> +	return err;
> +}
> +
> +static void bpf_oom_ops_unreg(void *kdata, struct bpf_link *link)
> +{
> +	struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
> +	struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> +	struct bpf_oom_ops *bpf_oom_ops = kdata;
> +	struct mem_cgroup *memcg = NULL;
> +
> +	if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> +		/* Detach from a memory cgroup? */
> +		memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> +		if (IS_ERR_OR_NULL(memcg))
> +			goto exit;
> +		bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> +	} else {
> +		/* System-wide OOM handler */
> +		bpf_oom_ops_ptr = &system_bpf_oom;
> +	}
> +
> +	/* Hide bpf_oom_ops from new callers */
> +	if (!WARN_ON(READ_ONCE(*bpf_oom_ops_ptr) != bpf_oom_ops))
> +		WRITE_ONCE(*bpf_oom_ops_ptr, NULL);
> +
> +	mem_cgroup_put(memcg);
> +
> +exit:
> +	/* Release bpf_oom_ops after a srcu grace period */
> +	synchronize_srcu(&bpf_oom_srcu);
> +}
> +
> +#ifdef CONFIG_MEMCG
> +void bpf_oom_memcg_offline(struct mem_cgroup *memcg)

Is it when the memcg/cgroup is going away? I think it should also call 
bpf_struct_ops_map_link_detach (through link->ops->detach [1]). It will 
notify the user space which may poll on the link fd. This will also call 
the bpf_oom_ops_unreg above.

[1] 
https://lore.kernel.org/all/20240530065946.979330-7-thinker.li@gmail.com/

> +{
> +	struct bpf_oom_ops *bpf_oom_ops;
> +	struct bpf_oom_ctx exec_ctx;
> +	u64 cgrp_id;
> +	int idx;
> +
> +	/* All bpf_oom_ops structures are protected using bpf_oom_srcu */
> +	idx = srcu_read_lock(&bpf_oom_srcu);
> +
> +	bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> +	WRITE_ONCE(memcg->bpf_oom, NULL);
> +
> +	if (bpf_oom_ops && bpf_oom_ops->handle_cgroup_offline) {
> +		cgrp_id = cgroup_id(memcg->css.cgroup);
> +		exec_ctx.cgroup_id = cgrp_id;
> +		bpf_oom_ops->handle_cgroup_offline(&exec_ctx, cgrp_id);
> +	}
> +
> +	srcu_read_unlock(&bpf_oom_srcu, idx);
> +}




^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
                     ` (3 preceding siblings ...)
  2025-10-29 21:04   ` Song Liu
@ 2025-10-30  0:43   ` Martin KaFai Lau
  4 siblings, 0 replies; 83+ messages in thread
From: Martin KaFai Lau @ 2025-10-30  0:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: linux-kernel, Alexei Starovoitov, Suren Baghdasaryan,
	Michal Hocko, Shakeel Butt, Johannes Weiner, Andrii Nakryiko,
	JP Kobryn, linux-mm, cgroups, bpf, Martin KaFai Lau, Song Liu,
	Kumar Kartikeya Dwivedi, Tejun Heo, Andrew Morton



On 10/27/25 4:17 PM, Roman Gushchin wrote:
> When a struct ops is being attached and a bpf link is created,
> allow to pass a cgroup fd using bpf attr, so that struct ops
> can be attached to a cgroup instead of globally.
> 
> Attached struct ops doesn't hold a reference to the cgroup,
> only preserves cgroup id.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>   include/linux/bpf.h         |  1 +
>   kernel/bpf/bpf_struct_ops.c | 13 +++++++++++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index eae907218188..7205b813e25f 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -1849,6 +1849,7 @@ struct bpf_struct_ops_link {
>   	struct bpf_link link;
>   	struct bpf_map __rcu *map;
>   	wait_queue_head_t wait_hup;
> +	u64 cgroup_id;
>   };
>   
>   struct bpf_link_primer {
> diff --git a/kernel/bpf/bpf_struct_ops.c b/kernel/bpf/bpf_struct_ops.c
> index 45cc5ee19dc2..58664779a2b6 100644
> --- a/kernel/bpf/bpf_struct_ops.c
> +++ b/kernel/bpf/bpf_struct_ops.c
> @@ -13,6 +13,7 @@
>   #include <linux/btf_ids.h>
>   #include <linux/rcupdate_wait.h>
>   #include <linux/poll.h>
> +#include <linux/cgroup.h>
>   
>   struct bpf_struct_ops_value {
>   	struct bpf_struct_ops_common_value common;
> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>   	}
>   	bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>   		      attr->link_create.attach_type);
> +#ifdef CONFIG_CGROUPS
> +	if (attr->link_create.cgroup.relative_fd) {
> +		struct cgroup *cgrp;
> +
> +		cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
> +		if (IS_ERR(cgrp))
> +			return PTR_ERR(cgrp);
> +
> +		link->cgroup_id = cgroup_id(cgrp);

Not sure storing the cgroup_id or storing the memcg/cgroup pointer is 
better here. Regardless, link->cgroup_id should be cleared in 
bpf_struct_ops_map_link_detach(). The cgroup_id probably is useful to 
bpf_struct_ops_map_link_show_fdinfo().

> +		cgroup_put(cgrp);
> +	}
> +#endif /* CONFIG_CGROUPS */
>   
>   	err = bpf_link_prime(&link->link, &link_primer);
>   	if (err)



^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 21:45               ` Tejun Heo
@ 2025-10-30  4:32                 ` Song Liu
  2025-10-30 16:13                   ` Tejun Heo
  0 siblings, 1 reply; 83+ messages in thread
From: Song Liu @ 2025-10-30  4:32 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Song Liu, Roman Gushchin, Andrew Morton, linux-kernel,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, cgroups, bpf, Martin KaFai Lau, Kumar Kartikeya Dwivedi

On Wed, Oct 29, 2025 at 2:45 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 29, 2025 at 02:37:38PM -0700, Song Liu wrote:
> > On Wed, Oct 29, 2025 at 2:27 PM Tejun Heo <tj@kernel.org> wrote:
> > > Doesn't that assume that the programs are more or less stateless? Wouldn't
> > > oom handlers want to track historical information, running averages, which
> > > process expanded the most and so on?
> >
> > Yes, this does mean the program needs to store data in some BPF maps.
> > Do we have concern with the performance of BPF maps?
>
> It's just a lot more awkward to do and I have a difficult time thinking up
> reasons why one would need to do that. If you attach a single struct_ops
> instance to one cgroup, you can use global variables, maps, arena to track
> what's happening with the cgroup. If you share the same struct_ops across
> multiple cgroups, each operation has to scope per-cgroup states. I can see
> how that probably makes sense for sockets but cgroups aren't sockets. There
> are a lot fewer cgroups and they are organized in a tree.

If the use case is to attach a single struct_ops to a single cgroup, the author
of that BPF program can always ignore the memcg parameter and use
global variables, etc. We waste a register in BPF ISA to save the pointer to
memcg,  but JiT may recover that in native instructions.

OTOH, starting without a memcg parameter, it will be impossible to allow
attaching the same struct_ops to different cgroups. I still think it is a valid
use case that the sysadmin loads a set of OOM handlers for users in the
containers to choose from is a valid use case.

Also, a per cgroup oom handler may need to access the memcg information
anyway. Without a dedicated memcg argument, the user need to fetch it
somewhere else.

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
                     ` (3 preceding siblings ...)
  2025-10-30  0:20   ` Martin KaFai Lau
@ 2025-10-30  5:57   ` Yafang Shao
  2025-10-30 14:26     ` Roman Gushchin
  2025-10-31  9:02   ` Michal Hocko
  5 siblings, 1 reply; 83+ messages in thread
From: Yafang Shao @ 2025-10-30  5:57 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue, Oct 28, 2025 at 7:22 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Introduce a bpf struct ops for implementing custom OOM handling
> policies.
>
> It's possible to load one bpf_oom_ops for the system and one
> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> cgroup tree is traversed from the OOM'ing memcg up to the root and
> corresponding BPF OOM handlers are executed until some memory is
> freed. If no memory is freed, the kernel OOM killer is invoked.
>
> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory. If both are set, OOM is
> considered handled, otherwise the next OOM handler in the chain
> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
> killer) is executed.
>
> The bpf_handle_out_of_memory() callback program is sleepable to enable
> using iterators, e.g. cgroup iterators. The callback receives struct
> oom_control as an argument, so it can determine the scope of the OOM
> event: if this is a memcg-wide or system-wide OOM.
>
> The callback is executed just before the kernel victim task selection
> algorithm, so all heuristics and sysctls like panic on oom,
> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> are respected.
>
> BPF OOM struct ops provides the handle_cgroup_offline() callback
> which is good for releasing struct ops if the corresponding cgroup
> is gone.
>
> The struct ops also has the name field, which allows to define a
> custom name for the implemented policy. It's printed in the OOM report
> in the oom_policy=<policy> format. "default" is printed if bpf is not
> used or policy name is not specified.
>
> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>                oom_policy=bpf_test_policy
> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
> [  112.698167] Call Trace:
> [  112.698177]  <TASK>
> [  112.698182]  dump_stack_lvl+0x4d/0x70
> [  112.698192]  dump_header+0x59/0x1c6
> [  112.698199]  oom_kill_process.cold+0x8/0xef
> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
> [  112.698250]  out_of_memory+0xab/0x5c0
> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
> [  112.698288]  charge_memcg+0x2f/0xc0
> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
> [  112.698299]  do_anonymous_page+0x40f/0xa50
> [  112.698311]  __handle_mm_fault+0xbba/0x1140
> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  112.698335]  handle_mm_fault+0xe6/0x370
> [  112.698343]  do_user_addr_fault+0x211/0x6a0
> [  112.698354]  exc_page_fault+0x75/0x1d0
> [  112.698363]  asm_exc_page_fault+0x26/0x30
> [  112.698366] RIP: 0033:0x7fa97236db00
>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  include/linux/bpf_oom.h    |  74 ++++++++++
>  include/linux/memcontrol.h |   5 +
>  include/linux/oom.h        |   8 ++
>  mm/Makefile                |   3 +
>  mm/bpf_oom.c               | 272 +++++++++++++++++++++++++++++++++++++
>  mm/memcontrol.c            |   2 +
>  mm/oom_kill.c              |  22 ++-
>  7 files changed, 384 insertions(+), 2 deletions(-)
>  create mode 100644 include/linux/bpf_oom.h
>  create mode 100644 mm/bpf_oom.c
>
> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
> new file mode 100644
> index 000000000000..18c32a5a068b
> --- /dev/null
> +++ b/include/linux/bpf_oom.h
> @@ -0,0 +1,74 @@
> +/* SPDX-License-Identifier: GPL-2.0+ */
> +
> +#ifndef __BPF_OOM_H
> +#define __BPF_OOM_H
> +
> +struct oom_control;
> +
> +#define BPF_OOM_NAME_MAX_LEN 64
> +
> +struct bpf_oom_ctx {
> +       /*
> +        * If bpf_oom_ops is attached to a cgroup, id of this cgroup.
> +        * 0 otherwise.
> +        */
> +       u64 cgroup_id;
> +};
> +
> +struct bpf_oom_ops {
> +       /**
> +        * @handle_out_of_memory: Out of memory bpf handler, called before
> +        * the in-kernel OOM killer.
> +        * @ctx: Execution context
> +        * @oc: OOM control structure
> +        *
> +        * Should return 1 if some memory was freed up, otherwise
> +        * the in-kernel OOM killer is invoked.
> +        */
> +       int (*handle_out_of_memory)(struct bpf_oom_ctx *ctx, struct oom_control *oc);
> +
> +       /**
> +        * @handle_cgroup_offline: Cgroup offline callback
> +        * @ctx: Execution context
> +        * @cgroup_id: Id of deleted cgroup
> +        *
> +        * Called if the cgroup with the attached bpf_oom_ops is deleted.
> +        */
> +       void (*handle_cgroup_offline)(struct bpf_oom_ctx *ctx, u64 cgroup_id);
> +
> +       /**
> +        * @name: BPF OOM policy name
> +        */
> +       char name[BPF_OOM_NAME_MAX_LEN];
> +};
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +/**
> + * @bpf_handle_oom: handle out of memory condition using bpf
> + * @oc: OOM control structure
> + *
> + * Returns true if some memory was freed.
> + */
> +bool bpf_handle_oom(struct oom_control *oc);
> +
> +
> +/**
> + * @bpf_oom_memcg_offline: handle memcg offlining
> + * @memcg: Memory cgroup is offlined
> + *
> + * When a memory cgroup is about to be deleted and there is an
> + * attached BPF OOM structure, it has to be detached.
> + */
> +void bpf_oom_memcg_offline(struct mem_cgroup *memcg);
> +
> +#else /* CONFIG_BPF_SYSCALL */
> +static inline bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       return false;
> +}
> +
> +static inline void bpf_oom_memcg_offline(struct mem_cgroup *memcg) {}
> +
> +#endif /* CONFIG_BPF_SYSCALL */
> +
> +#endif /* __BPF_OOM_H */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 50d851ff3f27..39a6c7c8735b 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -29,6 +29,7 @@ struct obj_cgroup;
>  struct page;
>  struct mm_struct;
>  struct kmem_cache;
> +struct bpf_oom_ops;
>
>  /* Cgroup-specific page state, on top of universal node page state */
>  enum memcg_stat_item {
> @@ -226,6 +227,10 @@ struct mem_cgroup {
>          */
>         bool oom_group;
>
> +#ifdef CONFIG_BPF_SYSCALL
> +       struct bpf_oom_ops *bpf_oom;
> +#endif
> +
>         int swappiness;
>
>         /* memory.events and memory.events.local */
> diff --git a/include/linux/oom.h b/include/linux/oom.h
> index 7b02bc1d0a7e..721087952d04 100644
> --- a/include/linux/oom.h
> +++ b/include/linux/oom.h
> @@ -51,6 +51,14 @@ struct oom_control {
>
>         /* Used to print the constraint info. */
>         enum oom_constraint constraint;
> +
> +#ifdef CONFIG_BPF_SYSCALL
> +       /* Used by the bpf oom implementation to mark the forward progress */
> +       bool bpf_memory_freed;
> +
> +       /* Policy name */
> +       const char *bpf_policy_name;
> +#endif
>  };
>
>  extern struct mutex oom_lock;
> diff --git a/mm/Makefile b/mm/Makefile
> index 21abb3353550..051e88c699af 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>  ifdef CONFIG_SWAP
>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
> +ifdef CONFIG_BPF_SYSCALL
> +obj-y += bpf_oom.o
> +endif
>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
> new file mode 100644
> index 000000000000..c4d09ed9d541
> --- /dev/null
> +++ b/mm/bpf_oom.c
> @@ -0,0 +1,272 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * BPF-driven OOM killer customization
> + *
> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
> + */
> +
> +#include <linux/bpf.h>
> +#include <linux/oom.h>
> +#include <linux/bpf_oom.h>
> +#include <linux/srcu.h>
> +#include <linux/cgroup.h>
> +#include <linux/memcontrol.h>
> +
> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
> +static struct bpf_oom_ops *system_bpf_oom;
> +
> +#ifdef CONFIG_MEMCG
> +static u64 memcg_cgroup_id(struct mem_cgroup *memcg)
> +{
> +       return cgroup_id(memcg->css.cgroup);
> +}
> +
> +static struct bpf_oom_ops **bpf_oom_memcg_ops_ptr(struct mem_cgroup *memcg)
> +{
> +       return &memcg->bpf_oom;
> +}
> +#else /* CONFIG_MEMCG */
> +static u64 memcg_cgroup_id(struct mem_cgroup *memcg)
> +{
> +       return 0;
> +}
> +static struct bpf_oom_ops **bpf_oom_memcg_ops_ptr(struct mem_cgroup *memcg)
> +{
> +       return NULL;
> +}
> +#endif
> +
> +static int bpf_ops_handle_oom(struct bpf_oom_ops *bpf_oom_ops,
> +                             struct mem_cgroup *memcg,
> +                             struct oom_control *oc)
> +{
> +       struct bpf_oom_ctx exec_ctx;
> +       int ret;
> +
> +       if (IS_ENABLED(CONFIG_MEMCG) && memcg)
> +               exec_ctx.cgroup_id = memcg_cgroup_id(memcg);
> +       else
> +               exec_ctx.cgroup_id = 0;
> +
> +       oc->bpf_policy_name = &bpf_oom_ops->name[0];
> +       oc->bpf_memory_freed = false;
> +       ret = bpf_oom_ops->handle_out_of_memory(&exec_ctx, oc);
> +       oc->bpf_policy_name = NULL;
> +
> +       return ret;
> +}
> +
> +bool bpf_handle_oom(struct oom_control *oc)
> +{
> +       struct bpf_oom_ops *bpf_oom_ops = NULL;
> +       struct mem_cgroup __maybe_unused *memcg;
> +       int idx, ret = 0;
> +
> +       /* All bpf_oom_ops structures are protected using bpf_oom_srcu */
> +       idx = srcu_read_lock(&bpf_oom_srcu);
> +
> +#ifdef CONFIG_MEMCG
> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
> +       for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
> +               bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
> +               if (!bpf_oom_ops)
> +                       continue;
> +
> +               /* Call BPF OOM handler */
> +               ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
> +               if (ret && oc->bpf_memory_freed)
> +                       goto exit;
> +       }
> +#endif /* CONFIG_MEMCG */
> +
> +       /*
> +        * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
> +        * Try system_bpf_oom.
> +        */
> +       bpf_oom_ops = READ_ONCE(system_bpf_oom);
> +       if (!bpf_oom_ops)
> +               goto exit;
> +
> +       /* Call BPF OOM handler */
> +       ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
> +exit:
> +       srcu_read_unlock(&bpf_oom_srcu, idx);
> +       return ret && oc->bpf_memory_freed;
> +}
> +
> +static int __handle_out_of_memory(struct bpf_oom_ctx *exec_ctx,
> +                                 struct oom_control *oc)
> +{
> +       return 0;
> +}
> +
> +static void __handle_cgroup_offline(struct bpf_oom_ctx *exec_ctx, u64 cgroup_id)
> +{
> +}
> +
> +static struct bpf_oom_ops __bpf_oom_ops = {
> +       .handle_out_of_memory = __handle_out_of_memory,
> +       .handle_cgroup_offline = __handle_cgroup_offline,
> +};
> +
> +static const struct bpf_func_proto *
> +bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +       return tracing_prog_func_proto(func_id, prog);
> +}
> +
> +static bool bpf_oom_ops_is_valid_access(int off, int size,
> +                                       enum bpf_access_type type,
> +                                       const struct bpf_prog *prog,
> +                                       struct bpf_insn_access_aux *info)
> +{
> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
> +       .get_func_proto = bpf_oom_func_proto,
> +       .is_valid_access = bpf_oom_ops_is_valid_access,
> +};
> +
> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
> +{
> +       struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
> +       struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
> +       struct bpf_oom_ops *bpf_oom_ops = kdata;
> +       struct mem_cgroup *memcg = NULL;
> +       int err = 0;
> +
> +       if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
> +               /* Attach to a memory cgroup? */
> +               memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
> +               if (IS_ERR_OR_NULL(memcg))
> +                       return PTR_ERR(memcg);
> +               bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
> +       } else {
> +               /* System-wide OOM handler */
> +               bpf_oom_ops_ptr = &system_bpf_oom;
> +       }
> +
> +       /* Another struct ops attached? */
> +       if (READ_ONCE(*bpf_oom_ops_ptr)) {
> +               err = -EBUSY;
> +               goto exit;
> +       }
> +
> +       /* Expose bpf_oom_ops structure */
> +       WRITE_ONCE(*bpf_oom_ops_ptr, bpf_oom_ops);

The mechanism for propagating this pointer to child cgroups isn't
clear. Would an explicit installation in every cgroup be required?
This approach seems impractical for production environments, where
cgroups are often created dynamically.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30  0:16                     ` Alexei Starovoitov
@ 2025-10-30  6:33                       ` Yafang Shao
  0 siblings, 0 replies; 83+ messages in thread
From: Yafang Shao @ 2025-10-30  6:33 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Tejun Heo, Roman Gushchin, Song Liu, Andrew Morton, LKML,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi

On Thu, Oct 30, 2025 at 8:16 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Oct 29, 2025 at 5:03 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Oh, if there are other mechanisms to enforce boundaries, it's not a problem,
> > but I can almost guarantee as the framework grows, there will be needs for
> > kfuncs to identify and verify the callers and handlers communicating with
> > each other along the hierarchy requiring recursive calls.
>
> tbh I think it's a combination of sched_ext_ops and bpf infra problem.
> All of the scx ops are missing "this" pointer which would have
> been there if it was a C++ class.
> And "this" should be pointing to an instance of class.
> If sched-ext progs are attached to different cgroups, then
> every attachment would have been a different instance and
> different "this".
> Then all kfuncs would effectively be declared as helper
> methods within a class. In this case within "struct sched_ext_ops"
> as functions that ops callback can call but they will
> also have implicit "this" that points back to a particular instance.
>
> Special aux__prog and prog_assoc are not exactly pretty
> workarounds for lack of "this".
>

I also share the concern that supporting the attachment of a single
struct-ops to multiple cgroups appears over-engineered for the current
needs. Given that we do not anticipate a large number of cgroup
attachments in real-world use, implementing such a generalized
mechanism now seems premature. We can always introduce this
functionality later in a backward-compatible manner if concrete use
cases emerge.

That said, if we still decide to move forward with this approach, I
would suggest merging this patch as a standalone change. Doing so
would allow my BPF-THP series to build upon the same mechanism.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-30  5:57   ` Yafang Shao
@ 2025-10-30 14:26     ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-30 14:26 UTC (permalink / raw)
  To: Yafang Shao
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Yafang Shao <laoar.shao@gmail.com> writes:

> On Tue, Oct 28, 2025 at 7:22 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>>
>> Introduce a bpf struct ops for implementing custom OOM handling
>> policies.
>>
>> It's possible to load one bpf_oom_ops for the system and one
>> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> corresponding BPF OOM handlers are executed until some memory is
>> freed. If no memory is freed, the kernel OOM killer is invoked.
>>
>> The struct ops provides the bpf_handle_out_of_memory() callback,
>> which expected to return 1 if it was able to free some memory and 0
>> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> field of the oom_control structure, which is expected to be set by
>> kfuncs suitable for releasing memory. If both are set, OOM is
>> considered handled, otherwise the next OOM handler in the chain
>> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
>> killer) is executed.
>>
>> The bpf_handle_out_of_memory() callback program is sleepable to enable
>> using iterators, e.g. cgroup iterators. The callback receives struct
>> oom_control as an argument, so it can determine the scope of the OOM
>> event: if this is a memcg-wide or system-wide OOM.
>>
>> The callback is executed just before the kernel victim task selection
>> algorithm, so all heuristics and sysctls like panic on oom,
>> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>> are respected.
>>
>> BPF OOM struct ops provides the handle_cgroup_offline() callback
>> which is good for releasing struct ops if the corresponding cgroup
>> is gone.
>>
>> The struct ops also has the name field, which allows to define a
>> custom name for the implemented policy. It's printed in the OOM report
>> in the oom_policy=<policy> format. "default" is printed if bpf is not
>> used or policy name is not specified.
>>
>> [  112.696676] test_progs invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
>>                oom_policy=bpf_test_policy
>> [  112.698160] CPU: 1 UID: 0 PID: 660 Comm: test_progs Not tainted 6.16.0-00015-gf09eb0d6badc #102 PREEMPT(full)
>> [  112.698165] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-5.fc42 04/01/2014
>> [  112.698167] Call Trace:
>> [  112.698177]  <TASK>
>> [  112.698182]  dump_stack_lvl+0x4d/0x70
>> [  112.698192]  dump_header+0x59/0x1c6
>> [  112.698199]  oom_kill_process.cold+0x8/0xef
>> [  112.698206]  bpf_oom_kill_process+0x59/0xb0
>> [  112.698216]  bpf_prog_7ecad0f36a167fd7_test_out_of_memory+0x2be/0x313
>> [  112.698229]  bpf__bpf_oom_ops_handle_out_of_memory+0x47/0xaf
>> [  112.698236]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [  112.698240]  bpf_handle_oom+0x11a/0x1e0
>> [  112.698250]  out_of_memory+0xab/0x5c0
>> [  112.698258]  mem_cgroup_out_of_memory+0xbc/0x110
>> [  112.698274]  try_charge_memcg+0x4b5/0x7e0
>> [  112.698288]  charge_memcg+0x2f/0xc0
>> [  112.698293]  __mem_cgroup_charge+0x30/0xc0
>> [  112.698299]  do_anonymous_page+0x40f/0xa50
>> [  112.698311]  __handle_mm_fault+0xbba/0x1140
>> [  112.698317]  ? srso_alias_return_thunk+0x5/0xfbef5
>> [  112.698335]  handle_mm_fault+0xe6/0x370
>> [  112.698343]  do_user_addr_fault+0x211/0x6a0
>> [  112.698354]  exc_page_fault+0x75/0x1d0
>> [  112.698363]  asm_exc_page_fault+0x26/0x30
>> [  112.698366] RIP: 0033:0x7fa97236db00
>>
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>> ---
>>  include/linux/bpf_oom.h    |  74 ++++++++++
>>  include/linux/memcontrol.h |   5 +
>>  include/linux/oom.h        |   8 ++
>>  mm/Makefile                |   3 +
>>  mm/bpf_oom.c               | 272 +++++++++++++++++++++++++++++++++++++
>>  mm/memcontrol.c            |   2 +
>>  mm/oom_kill.c              |  22 ++-
>>  7 files changed, 384 insertions(+), 2 deletions(-)
>>  create mode 100644 include/linux/bpf_oom.h
>>  create mode 100644 mm/bpf_oom.c
>>
>> diff --git a/include/linux/bpf_oom.h b/include/linux/bpf_oom.h
>> new file mode 100644
>> index 000000000000..18c32a5a068b
>> --- /dev/null
>> +++ b/include/linux/bpf_oom.h
>> @@ -0,0 +1,74 @@
>> +/* SPDX-License-Identifier: GPL-2.0+ */
>> +
>> +#ifndef __BPF_OOM_H
>> +#define __BPF_OOM_H
>> +
>> +struct oom_control;
>> +
>> +#define BPF_OOM_NAME_MAX_LEN 64
>> +
>> +struct bpf_oom_ctx {
>> +       /*
>> +        * If bpf_oom_ops is attached to a cgroup, id of this cgroup.
>> +        * 0 otherwise.
>> +        */
>> +       u64 cgroup_id;
>> +};
>> +
>> +struct bpf_oom_ops {
>> +       /**
>> +        * @handle_out_of_memory: Out of memory bpf handler, called before
>> +        * the in-kernel OOM killer.
>> +        * @ctx: Execution context
>> +        * @oc: OOM control structure
>> +        *
>> +        * Should return 1 if some memory was freed up, otherwise
>> +        * the in-kernel OOM killer is invoked.
>> +        */
>> +       int (*handle_out_of_memory)(struct bpf_oom_ctx *ctx, struct oom_control *oc);
>> +
>> +       /**
>> +        * @handle_cgroup_offline: Cgroup offline callback
>> +        * @ctx: Execution context
>> +        * @cgroup_id: Id of deleted cgroup
>> +        *
>> +        * Called if the cgroup with the attached bpf_oom_ops is deleted.
>> +        */
>> +       void (*handle_cgroup_offline)(struct bpf_oom_ctx *ctx, u64 cgroup_id);
>> +
>> +       /**
>> +        * @name: BPF OOM policy name
>> +        */
>> +       char name[BPF_OOM_NAME_MAX_LEN];
>> +};
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +/**
>> + * @bpf_handle_oom: handle out of memory condition using bpf
>> + * @oc: OOM control structure
>> + *
>> + * Returns true if some memory was freed.
>> + */
>> +bool bpf_handle_oom(struct oom_control *oc);
>> +
>> +
>> +/**
>> + * @bpf_oom_memcg_offline: handle memcg offlining
>> + * @memcg: Memory cgroup is offlined
>> + *
>> + * When a memory cgroup is about to be deleted and there is an
>> + * attached BPF OOM structure, it has to be detached.
>> + */
>> +void bpf_oom_memcg_offline(struct mem_cgroup *memcg);
>> +
>> +#else /* CONFIG_BPF_SYSCALL */
>> +static inline bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       return false;
>> +}
>> +
>> +static inline void bpf_oom_memcg_offline(struct mem_cgroup *memcg) {}
>> +
>> +#endif /* CONFIG_BPF_SYSCALL */
>> +
>> +#endif /* __BPF_OOM_H */
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 50d851ff3f27..39a6c7c8735b 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -29,6 +29,7 @@ struct obj_cgroup;
>>  struct page;
>>  struct mm_struct;
>>  struct kmem_cache;
>> +struct bpf_oom_ops;
>>
>>  /* Cgroup-specific page state, on top of universal node page state */
>>  enum memcg_stat_item {
>> @@ -226,6 +227,10 @@ struct mem_cgroup {
>>          */
>>         bool oom_group;
>>
>> +#ifdef CONFIG_BPF_SYSCALL
>> +       struct bpf_oom_ops *bpf_oom;
>> +#endif
>> +
>>         int swappiness;
>>
>>         /* memory.events and memory.events.local */
>> diff --git a/include/linux/oom.h b/include/linux/oom.h
>> index 7b02bc1d0a7e..721087952d04 100644
>> --- a/include/linux/oom.h
>> +++ b/include/linux/oom.h
>> @@ -51,6 +51,14 @@ struct oom_control {
>>
>>         /* Used to print the constraint info. */
>>         enum oom_constraint constraint;
>> +
>> +#ifdef CONFIG_BPF_SYSCALL
>> +       /* Used by the bpf oom implementation to mark the forward progress */
>> +       bool bpf_memory_freed;
>> +
>> +       /* Policy name */
>> +       const char *bpf_policy_name;
>> +#endif
>>  };
>>
>>  extern struct mutex oom_lock;
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 21abb3353550..051e88c699af 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -105,6 +105,9 @@ obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>>  ifdef CONFIG_SWAP
>>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>>  endif
>> +ifdef CONFIG_BPF_SYSCALL
>> +obj-y += bpf_oom.o
>> +endif
>>  obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
>>  obj-$(CONFIG_GUP_TEST) += gup_test.o
>>  obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
>> diff --git a/mm/bpf_oom.c b/mm/bpf_oom.c
>> new file mode 100644
>> index 000000000000..c4d09ed9d541
>> --- /dev/null
>> +++ b/mm/bpf_oom.c
>> @@ -0,0 +1,272 @@
>> +// SPDX-License-Identifier: GPL-2.0-or-later
>> +/*
>> + * BPF-driven OOM killer customization
>> + *
>> + * Author: Roman Gushchin <roman.gushchin@linux.dev>
>> + */
>> +
>> +#include <linux/bpf.h>
>> +#include <linux/oom.h>
>> +#include <linux/bpf_oom.h>
>> +#include <linux/srcu.h>
>> +#include <linux/cgroup.h>
>> +#include <linux/memcontrol.h>
>> +
>> +DEFINE_STATIC_SRCU(bpf_oom_srcu);
>> +static struct bpf_oom_ops *system_bpf_oom;
>> +
>> +#ifdef CONFIG_MEMCG
>> +static u64 memcg_cgroup_id(struct mem_cgroup *memcg)
>> +{
>> +       return cgroup_id(memcg->css.cgroup);
>> +}
>> +
>> +static struct bpf_oom_ops **bpf_oom_memcg_ops_ptr(struct mem_cgroup *memcg)
>> +{
>> +       return &memcg->bpf_oom;
>> +}
>> +#else /* CONFIG_MEMCG */
>> +static u64 memcg_cgroup_id(struct mem_cgroup *memcg)
>> +{
>> +       return 0;
>> +}
>> +static struct bpf_oom_ops **bpf_oom_memcg_ops_ptr(struct mem_cgroup *memcg)
>> +{
>> +       return NULL;
>> +}
>> +#endif
>> +
>> +static int bpf_ops_handle_oom(struct bpf_oom_ops *bpf_oom_ops,
>> +                             struct mem_cgroup *memcg,
>> +                             struct oom_control *oc)
>> +{
>> +       struct bpf_oom_ctx exec_ctx;
>> +       int ret;
>> +
>> +       if (IS_ENABLED(CONFIG_MEMCG) && memcg)
>> +               exec_ctx.cgroup_id = memcg_cgroup_id(memcg);
>> +       else
>> +               exec_ctx.cgroup_id = 0;
>> +
>> +       oc->bpf_policy_name = &bpf_oom_ops->name[0];
>> +       oc->bpf_memory_freed = false;
>> +       ret = bpf_oom_ops->handle_out_of_memory(&exec_ctx, oc);
>> +       oc->bpf_policy_name = NULL;
>> +
>> +       return ret;
>> +}
>> +
>> +bool bpf_handle_oom(struct oom_control *oc)
>> +{
>> +       struct bpf_oom_ops *bpf_oom_ops = NULL;
>> +       struct mem_cgroup __maybe_unused *memcg;
>> +       int idx, ret = 0;
>> +
>> +       /* All bpf_oom_ops structures are protected using bpf_oom_srcu */
>> +       idx = srcu_read_lock(&bpf_oom_srcu);
>> +
>> +#ifdef CONFIG_MEMCG
>> +       /* Find the nearest bpf_oom_ops traversing the cgroup tree upwards */
>> +       for (memcg = oc->memcg; memcg; memcg = parent_mem_cgroup(memcg)) {
>> +               bpf_oom_ops = READ_ONCE(memcg->bpf_oom);
>> +               if (!bpf_oom_ops)
>> +                       continue;
>> +
>> +               /* Call BPF OOM handler */
>> +               ret = bpf_ops_handle_oom(bpf_oom_ops, memcg, oc);
>> +               if (ret && oc->bpf_memory_freed)
>> +                       goto exit;
>> +       }
>> +#endif /* CONFIG_MEMCG */
>> +
>> +       /*
>> +        * System-wide OOM or per-memcg BPF OOM handler wasn't successful?
>> +        * Try system_bpf_oom.
>> +        */
>> +       bpf_oom_ops = READ_ONCE(system_bpf_oom);
>> +       if (!bpf_oom_ops)
>> +               goto exit;
>> +
>> +       /* Call BPF OOM handler */
>> +       ret = bpf_ops_handle_oom(bpf_oom_ops, NULL, oc);
>> +exit:
>> +       srcu_read_unlock(&bpf_oom_srcu, idx);
>> +       return ret && oc->bpf_memory_freed;
>> +}
>> +
>> +static int __handle_out_of_memory(struct bpf_oom_ctx *exec_ctx,
>> +                                 struct oom_control *oc)
>> +{
>> +       return 0;
>> +}
>> +
>> +static void __handle_cgroup_offline(struct bpf_oom_ctx *exec_ctx, u64 cgroup_id)
>> +{
>> +}
>> +
>> +static struct bpf_oom_ops __bpf_oom_ops = {
>> +       .handle_out_of_memory = __handle_out_of_memory,
>> +       .handle_cgroup_offline = __handle_cgroup_offline,
>> +};
>> +
>> +static const struct bpf_func_proto *
>> +bpf_oom_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
>> +{
>> +       return tracing_prog_func_proto(func_id, prog);
>> +}
>> +
>> +static bool bpf_oom_ops_is_valid_access(int off, int size,
>> +                                       enum bpf_access_type type,
>> +                                       const struct bpf_prog *prog,
>> +                                       struct bpf_insn_access_aux *info)
>> +{
>> +       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
>> +}
>> +
>> +static const struct bpf_verifier_ops bpf_oom_verifier_ops = {
>> +       .get_func_proto = bpf_oom_func_proto,
>> +       .is_valid_access = bpf_oom_ops_is_valid_access,
>> +};
>> +
>> +static int bpf_oom_ops_reg(void *kdata, struct bpf_link *link)
>> +{
>> +       struct bpf_struct_ops_link *ops_link = container_of(link, struct bpf_struct_ops_link, link);
>> +       struct bpf_oom_ops **bpf_oom_ops_ptr = NULL;
>> +       struct bpf_oom_ops *bpf_oom_ops = kdata;
>> +       struct mem_cgroup *memcg = NULL;
>> +       int err = 0;
>> +
>> +       if (IS_ENABLED(CONFIG_MEMCG) && ops_link->cgroup_id) {
>> +               /* Attach to a memory cgroup? */
>> +               memcg = mem_cgroup_get_from_ino(ops_link->cgroup_id);
>> +               if (IS_ERR_OR_NULL(memcg))
>> +                       return PTR_ERR(memcg);
>> +               bpf_oom_ops_ptr = bpf_oom_memcg_ops_ptr(memcg);
>> +       } else {
>> +               /* System-wide OOM handler */
>> +               bpf_oom_ops_ptr = &system_bpf_oom;
>> +       }
>> +
>> +       /* Another struct ops attached? */
>> +       if (READ_ONCE(*bpf_oom_ops_ptr)) {
>> +               err = -EBUSY;
>> +               goto exit;
>> +       }
>> +
>> +       /* Expose bpf_oom_ops structure */
>> +       WRITE_ONCE(*bpf_oom_ops_ptr, bpf_oom_ops);
>
> The mechanism for propagating this pointer to child cgroups isn't
> clear. Would an explicit installation in every cgroup be required?
> This approach seems impractical for production environments, where
> cgroups are often created dynamically.

There is no need to propagate it. Instead, the cgroup tree is traversed
to the root when then OOM is happening and the closest bpf_oom_ops is used.
Obviously, unlike some other cases of attaching bpf progs to cgroups,
OOMs can not be that frequent, so there is no need to optimize for speed
here.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30  4:32                 ` Song Liu
@ 2025-10-30 16:13                   ` Tejun Heo
  2025-10-30 17:56                     ` Song Liu
  0 siblings, 1 reply; 83+ messages in thread
From: Tejun Heo @ 2025-10-30 16:13 UTC (permalink / raw)
  To: Song Liu
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi

Hello,

On Wed, Oct 29, 2025 at 09:32:44PM -0700, Song Liu wrote:
> If the use case is to attach a single struct_ops to a single cgroup, the author
> of that BPF program can always ignore the memcg parameter and use
> global variables, etc. We waste a register in BPF ISA to save the pointer to
> memcg,  but JiT may recover that in native instructions.
> 
> OTOH, starting without a memcg parameter, it will be impossible to allow
> attaching the same struct_ops to different cgroups. I still think it is a valid
> use case that the sysadmin loads a set of OOM handlers for users in the
> containers to choose from is a valid use case.

I find something like that being implemented through struct_ops attaching
rather unlikely. Wouldn't it look more like the following?

- Attach a handler at the parent level which implements different policies.

- Child cgroups pick the desired policy using e.g. cgroup xattrs and when
  OOM event happens, the OOM handler attached at the parent implements the
  requested policy.

- If further customization is desired and supported, it's implemented
  through child loading its own OOM handler which operates under the
  parent's OOM handler.

> Also, a per cgroup oom handler may need to access the memcg information
> anyway. Without a dedicated memcg argument, the user need to fetch it
> somewhere else.

An OOM handler attached to a cgroup doesn't just need to handle OOM events
in the cgroup itself. It's responsible for the whole sub-hierarchy. ie. It
will need accessors to reach all those memcgs anyway.

Another thing to consider is that the memcg for a given cgroup can change by
the controller being enabled and disabled. There isn't the one permanent
memcg that a given cgroup is associated with.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-29 18:01   ` Song Liu
  2025-10-29 20:26     ` Roman Gushchin
@ 2025-10-30 17:22     ` Roman Gushchin
  2025-10-30 18:03       ` Song Liu
  1 sibling, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-10-30 17:22 UTC (permalink / raw)
  To: Song Liu
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo

Song Liu <song@kernel.org> writes:

> On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> [...]
>>  struct bpf_struct_ops_value {
>>         struct bpf_struct_ops_common_value common;
>> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>>         }
>>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>>                       attr->link_create.attach_type);
>> +#ifdef CONFIG_CGROUPS
>> +       if (attr->link_create.cgroup.relative_fd) {
>> +               struct cgroup *cgrp;
>> +
>> +               cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
>
> We should use "target_fd" here, not relative_fd.
>
> Also, 0 is a valid fd, so we cannot use target_fd == 0 to attach to
> global memcg.

Yep, but then we need somehow signal there is a cgroup fd passed,
so that struct ops'es which are not attached to cgroups keep working
as previously. And we can't use link_create.attach_type.

Should I use link_create.flags? E.g. something like add new flag

@@ -1224,6 +1224,7 @@ enum bpf_perf_event_type {
 #define BPF_F_AFTER		(1U << 4)
 #define BPF_F_ID		(1U << 5)
 #define BPF_F_PREORDER		(1U << 6)
+#define BPF_F_CGROUP		(1U << 7)
 #define BPF_F_LINK		BPF_F_LINK /* 1 << 13 */
 
 /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the

and then do something like this:

int bpf_struct_ops_link_create(union bpf_attr *attr)
{
	<...>
	if (attr->link_create.flags & BPF_F_CGROUP) {
		struct cgroup *cgrp;

		cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
		if (IS_ERR(cgrp)) {
			err = PTR_ERR(cgrp);
			goto err_out;
		}

		link->cgroup_id = cgroup_id(cgrp);
		cgroup_put(cgrp);
	}

Does it sound right?

Thanks


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 16:13                   ` Tejun Heo
@ 2025-10-30 17:56                     ` Song Liu
  0 siblings, 0 replies; 83+ messages in thread
From: Song Liu @ 2025-10-30 17:56 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Song Liu, Roman Gushchin, Andrew Morton, linux-kernel,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, cgroups, bpf, Martin KaFai Lau, Kumar Kartikeya Dwivedi

On Thu, Oct 30, 2025 at 9:14 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 29, 2025 at 09:32:44PM -0700, Song Liu wrote:
> > If the use case is to attach a single struct_ops to a single cgroup, the author
> > of that BPF program can always ignore the memcg parameter and use
> > global variables, etc. We waste a register in BPF ISA to save the pointer to
> > memcg,  but JiT may recover that in native instructions.
> >
> > OTOH, starting without a memcg parameter, it will be impossible to allow
> > attaching the same struct_ops to different cgroups. I still think it is a valid
> > use case that the sysadmin loads a set of OOM handlers for users in the
> > containers to choose from is a valid use case.
>
> I find something like that being implemented through struct_ops attaching
> rather unlikely. Wouldn't it look more like the following?
>
> - Attach a handler at the parent level which implements different policies.
>
> - Child cgroups pick the desired policy using e.g. cgroup xattrs and when
>   OOM event happens, the OOM handler attached at the parent implements the
>   requested policy.

OK, using xattrs is another way to achieve this.

> - If further customization is desired and supported, it's implemented
>   through child loading its own OOM handler which operates under the
>   parent's OOM handler.
>
> > Also, a per cgroup oom handler may need to access the memcg information
> > anyway. Without a dedicated memcg argument, the user need to fetch it
> > somewhere else.
>
> An OOM handler attached to a cgroup doesn't just need to handle OOM events
> in the cgroup itself. It's responsible for the whole sub-hierarchy. ie. It
> will need accessors to reach all those memcgs anyway.
>
> Another thing to consider is that the memcg for a given cgroup can change by
> the controller being enabled and disabled. There isn't the one permanent
> memcg that a given cgroup is associated with.

In the current version, bpf_oom_ops is attached to the memcg. As long as
we feed a pointer to memcg to all struct_ops functions, these functions
can be implemented in a stateless way. I think having the option to do
this stateless implementation will help us in the long term.

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 17:22     ` Roman Gushchin
@ 2025-10-30 18:03       ` Song Liu
  2025-10-30 18:19         ` Amery Hung
  0 siblings, 1 reply; 83+ messages in thread
From: Song Liu @ 2025-10-30 18:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Song Liu, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo

On Thu, Oct 30, 2025 at 10:22 AM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Song Liu <song@kernel.org> writes:
>
> > On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > [...]
> >>  struct bpf_struct_ops_value {
> >>         struct bpf_struct_ops_common_value common;
> >> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
> >>         }
> >>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
> >>                       attr->link_create.attach_type);
> >> +#ifdef CONFIG_CGROUPS
> >> +       if (attr->link_create.cgroup.relative_fd) {
> >> +               struct cgroup *cgrp;
> >> +
> >> +               cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
> >
> > We should use "target_fd" here, not relative_fd.
> >
> > Also, 0 is a valid fd, so we cannot use target_fd == 0 to attach to
> > global memcg.
>
> Yep, but then we need somehow signal there is a cgroup fd passed,
> so that struct ops'es which are not attached to cgroups keep working
> as previously. And we can't use link_create.attach_type.
>
> Should I use link_create.flags? E.g. something like add new flag
>
> @@ -1224,6 +1224,7 @@ enum bpf_perf_event_type {
>  #define BPF_F_AFTER            (1U << 4)
>  #define BPF_F_ID               (1U << 5)
>  #define BPF_F_PREORDER         (1U << 6)
> +#define BPF_F_CGROUP           (1U << 7)
>  #define BPF_F_LINK             BPF_F_LINK /* 1 << 13 */
>
>  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>
> and then do something like this:
>
> int bpf_struct_ops_link_create(union bpf_attr *attr)
> {
>         <...>
>         if (attr->link_create.flags & BPF_F_CGROUP) {
>                 struct cgroup *cgrp;
>
>                 cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
>                 if (IS_ERR(cgrp)) {
>                         err = PTR_ERR(cgrp);
>                         goto err_out;
>                 }
>
>                 link->cgroup_id = cgroup_id(cgrp);
>                 cgroup_put(cgrp);
>         }
>
> Does it sound right?

I believe adding a flag (BPF_F_CGROUP or some other name), is the
right solution for this.

OTOH, I am not sure whether we want to add cgroup fd/id to the
bpf link. I personally prefer the model used by TCP congestion
control: the link attaches the struct_ops to a global list, then each
user picks a struct_ops from the list. But I do agree this might be
an overkill for cgroup use cases.

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 18:03       ` Song Liu
@ 2025-10-30 18:19         ` Amery Hung
  2025-10-30 19:06           ` Roman Gushchin
  0 siblings, 1 reply; 83+ messages in thread
From: Amery Hung @ 2025-10-30 18:19 UTC (permalink / raw)
  To: Song Liu
  Cc: Roman Gushchin, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo

On Thu, Oct 30, 2025 at 11:09 AM Song Liu <song@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 10:22 AM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> >
> > Song Liu <song@kernel.org> writes:
> >
> > > On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> > > [...]
> > >>  struct bpf_struct_ops_value {
> > >>         struct bpf_struct_ops_common_value common;
> > >> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
> > >>         }
> > >>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
> > >>                       attr->link_create.attach_type);
> > >> +#ifdef CONFIG_CGROUPS
> > >> +       if (attr->link_create.cgroup.relative_fd) {
> > >> +               struct cgroup *cgrp;
> > >> +
> > >> +               cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
> > >
> > > We should use "target_fd" here, not relative_fd.
> > >
> > > Also, 0 is a valid fd, so we cannot use target_fd == 0 to attach to
> > > global memcg.
> >
> > Yep, but then we need somehow signal there is a cgroup fd passed,
> > so that struct ops'es which are not attached to cgroups keep working
> > as previously. And we can't use link_create.attach_type.
> >
> > Should I use link_create.flags? E.g. something like add new flag
> >
> > @@ -1224,6 +1224,7 @@ enum bpf_perf_event_type {
> >  #define BPF_F_AFTER            (1U << 4)
> >  #define BPF_F_ID               (1U << 5)
> >  #define BPF_F_PREORDER         (1U << 6)
> > +#define BPF_F_CGROUP           (1U << 7)
> >  #define BPF_F_LINK             BPF_F_LINK /* 1 << 13 */
> >
> >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
> >
> > and then do something like this:
> >
> > int bpf_struct_ops_link_create(union bpf_attr *attr)
> > {
> >         <...>
> >         if (attr->link_create.flags & BPF_F_CGROUP) {
> >                 struct cgroup *cgrp;
> >
> >                 cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
> >                 if (IS_ERR(cgrp)) {
> >                         err = PTR_ERR(cgrp);
> >                         goto err_out;
> >                 }
> >
> >                 link->cgroup_id = cgroup_id(cgrp);
> >                 cgroup_put(cgrp);
> >         }
> >
> > Does it sound right?
>
> I believe adding a flag (BPF_F_CGROUP or some other name), is the
> right solution for this.
>
> OTOH, I am not sure whether we want to add cgroup fd/id to the
> bpf link. I personally prefer the model used by TCP congestion
> control: the link attaches the struct_ops to a global list, then each
> user picks a struct_ops from the list. But I do agree this might be
> an overkill for cgroup use cases.

+1.

In TCP congestion control and BPF qdisc's model:

During link_create, both adds the struct_ops to a list, and the
struct_ops can be indexed by name. The struct_ops are not "active" by
this time.
Then, each has their own interface to 'apply' the struct_ops to a
socket or queue: setsockopt() or netlink.

But maybe cgroup-related struct_ops are different.

-Amery

>
> Thanks,
> Song
>


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 18:19         ` Amery Hung
@ 2025-10-30 19:06           ` Roman Gushchin
  2025-10-30 21:34             ` Song Liu
  2025-10-30 22:19             ` bpf_st_ops and cgroups. Was: " Alexei Starovoitov
  0 siblings, 2 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-30 19:06 UTC (permalink / raw)
  To: Amery Hung
  Cc: Song Liu, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo

Amery Hung <ameryhung@gmail.com> writes:

> On Thu, Oct 30, 2025 at 11:09 AM Song Liu <song@kernel.org> wrote:
>>
>> On Thu, Oct 30, 2025 at 10:22 AM Roman Gushchin
>> <roman.gushchin@linux.dev> wrote:
>> >
>> > Song Liu <song@kernel.org> writes:
>> >
>> > > On Mon, Oct 27, 2025 at 4:17 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>> > > [...]
>> > >>  struct bpf_struct_ops_value {
>> > >>         struct bpf_struct_ops_common_value common;
>> > >> @@ -1359,6 +1360,18 @@ int bpf_struct_ops_link_create(union bpf_attr *attr)
>> > >>         }
>> > >>         bpf_link_init(&link->link, BPF_LINK_TYPE_STRUCT_OPS, &bpf_struct_ops_map_lops, NULL,
>> > >>                       attr->link_create.attach_type);
>> > >> +#ifdef CONFIG_CGROUPS
>> > >> +       if (attr->link_create.cgroup.relative_fd) {
>> > >> +               struct cgroup *cgrp;
>> > >> +
>> > >> +               cgrp = cgroup_get_from_fd(attr->link_create.cgroup.relative_fd);
>> > >
>> > > We should use "target_fd" here, not relative_fd.
>> > >
>> > > Also, 0 is a valid fd, so we cannot use target_fd == 0 to attach to
>> > > global memcg.
>> >
>> > Yep, but then we need somehow signal there is a cgroup fd passed,
>> > so that struct ops'es which are not attached to cgroups keep working
>> > as previously. And we can't use link_create.attach_type.
>> >
>> > Should I use link_create.flags? E.g. something like add new flag
>> >
>> > @@ -1224,6 +1224,7 @@ enum bpf_perf_event_type {
>> >  #define BPF_F_AFTER            (1U << 4)
>> >  #define BPF_F_ID               (1U << 5)
>> >  #define BPF_F_PREORDER         (1U << 6)
>> > +#define BPF_F_CGROUP           (1U << 7)
>> >  #define BPF_F_LINK             BPF_F_LINK /* 1 << 13 */
>> >
>> >  /* If BPF_F_STRICT_ALIGNMENT is used in BPF_PROG_LOAD command, the
>> >
>> > and then do something like this:
>> >
>> > int bpf_struct_ops_link_create(union bpf_attr *attr)
>> > {
>> >         <...>
>> >         if (attr->link_create.flags & BPF_F_CGROUP) {
>> >                 struct cgroup *cgrp;
>> >
>> >                 cgrp = cgroup_get_from_fd(attr->link_create.target_fd);
>> >                 if (IS_ERR(cgrp)) {
>> >                         err = PTR_ERR(cgrp);
>> >                         goto err_out;
>> >                 }
>> >
>> >                 link->cgroup_id = cgroup_id(cgrp);
>> >                 cgroup_put(cgrp);
>> >         }
>> >
>> > Does it sound right?
>>
>> I believe adding a flag (BPF_F_CGROUP or some other name), is the
>> right solution for this.
>>
>> OTOH, I am not sure whether we want to add cgroup fd/id to the
>> bpf link. I personally prefer the model used by TCP congestion
>> control: the link attaches the struct_ops to a global list, then each
>> user picks a struct_ops from the list. But I do agree this might be
>> an overkill for cgroup use cases.
>
> +1.
>
> In TCP congestion control and BPF qdisc's model:
>
> During link_create, both adds the struct_ops to a list, and the
> struct_ops can be indexed by name. The struct_ops are not "active" by
> this time.
> Then, each has their own interface to 'apply' the struct_ops to a
> socket or queue: setsockopt() or netlink.
>
> But maybe cgroup-related struct_ops are different.

Both tcp congestion and qdisk cases are somewhat different because
there already is a way to select between multiple implementations, bpf
just adds another one. In the oom case, it's not true. As of today,
there is only one (global) oom killer. Of course we can create
interfaces to allow a user make a choice. But the question is do we want
to create such interface for the oom case specifically (and later for
each new case separately), or there is a place for some generalization?


Ok, let me summarize the options we discussed here:

1) Make the attachment details (e.g. cgroup_id) the part of struct ops
itself. The attachment is happening at the reg() time.

  +: It's convenient for complex stateful struct ops'es, because a
      single entity represents a combination of code and data.
  -: No way to attach a single struct ops to multiple entities.

This approach is used by Tejun for per-cgroup sched_ext prototype.

2) Make the attachment details a part of bpf_link creation. The
attachment is still happening at the reg() time.

  +: A single struct ops can be attached to multiple entities.
  -: Implementing stateful struct ops'es is harder and requires passing
     an additional argument (some sort of "self") to all callbacks.

I'm using this approach in the bpf oom proposal.

3) Move the attachment out of .reg() scope entirely. reg() will register
the implementation system-wide and then some 3rd-party interface
(e.g. cgroupfs) should be used to select the implementation.

  +: ?
  -: New hard-coded interfaces might be required to enable bpf-driven
     kernel customization. The "attachment" code is not shared between
     various struct ops cases.
     Implementing stateful struct ops'es is harder and requires passing
     an additional argument (some sort of "self") to all callbacks.

This approach works well for cases when there is already a selection
of implementations (e.g. tcp congestion mechanisms), and bpf is adding
another one.

I personally lean towards 2), but I can easily implement bpf_oom with 1)
and most likely with 3) too, but it's a bit more complicated.

So I guess we all need to come to an agreement which way to go.

Thanks!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 19:06           ` Roman Gushchin
@ 2025-10-30 21:34             ` Song Liu
  2025-10-30 22:42               ` Martin KaFai Lau
  2025-10-30 22:19             ` bpf_st_ops and cgroups. Was: " Alexei Starovoitov
  1 sibling, 1 reply; 83+ messages in thread
From: Song Liu @ 2025-10-30 21:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Amery Hung, Song Liu, Andrew Morton, linux-kernel,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, cgroups, bpf, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
	Tejun Heo

Hi Roman,

On Thu, Oct 30, 2025 at 12:07 PM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
[...]
> > In TCP congestion control and BPF qdisc's model:
> >
> > During link_create, both adds the struct_ops to a list, and the
> > struct_ops can be indexed by name. The struct_ops are not "active" by
> > this time.
> > Then, each has their own interface to 'apply' the struct_ops to a
> > socket or queue: setsockopt() or netlink.
> >
> > But maybe cgroup-related struct_ops are different.
>
> Both tcp congestion and qdisk cases are somewhat different because
> there already is a way to select between multiple implementations, bpf
> just adds another one. In the oom case, it's not true. As of today,
> there is only one (global) oom killer. Of course we can create
> interfaces to allow a user make a choice. But the question is do we want
> to create such interface for the oom case specifically (and later for
> each new case separately), or there is a place for some generalization?

Agreed that this approach requires a separate mechanism to attach
the struct_ops to an entity.

> Ok, let me summarize the options we discussed here:

Thanks for the summary!

>
> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
> itself. The attachment is happening at the reg() time.
>
>   +: It's convenient for complex stateful struct ops'es, because a
>       single entity represents a combination of code and data.
>   -: No way to attach a single struct ops to multiple entities.
>
> This approach is used by Tejun for per-cgroup sched_ext prototype.
>
> 2) Make the attachment details a part of bpf_link creation. The
> attachment is still happening at the reg() time.
>
>   +: A single struct ops can be attached to multiple entities.
>   -: Implementing stateful struct ops'es is harder and requires passing
>      an additional argument (some sort of "self") to all callbacks.
> I'm using this approach in the bpf oom proposal.
>

I think both 1) and 2) have the following issue. With cgroup_id in
struct_ops or the link, the cgroup_id works more like a filter. The
cgroup doesn't hold any reference to the struct_ops. The bpf link
holds the reference to the struct_ops, so we need to keep the
the link alive, either by keeping an active fd, or by pinning the
link to bpffs. When the cgroup is removed, we need to clean up
the bpf link separately.

> 3) Move the attachment out of .reg() scope entirely. reg() will register
> the implementation system-wide and then some 3rd-party interface
> (e.g. cgroupfs) should be used to select the implementation.
>
>   +: ?
>   -: New hard-coded interfaces might be required to enable bpf-driven
>      kernel customization. The "attachment" code is not shared between
>      various struct ops cases.
>      Implementing stateful struct ops'es is harder and requires passing
>      an additional argument (some sort of "self") to all callbacks.
>
> This approach works well for cases when there is already a selection
> of implementations (e.g. tcp congestion mechanisms), and bpf is adding
> another one.

Another benefit of 3) is that it allows loading an OOM controller in a
kernel module, just like loading a file system in a kernel module. This
is possible with 3) because we paid the cost of adding a new select
attach interface.

A semi-separate topic, option 2) enables attaching a BPF program
to a kernel object (a cgroup here, but could be something else). This
is an interesting idea, and we may find it useful in other cases (attach
a BPF program to a task_struct, etc.).

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* bpf_st_ops and cgroups. Was: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 19:06           ` Roman Gushchin
  2025-10-30 21:34             ` Song Liu
@ 2025-10-30 22:19             ` Alexei Starovoitov
  2025-10-30 23:24               ` Roman Gushchin
  1 sibling, 1 reply; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-30 22:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Amery Hung, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo

On Thu, Oct 30, 2025 at 12:06 PM Roman Gushchin
<roman.gushchin@linux.dev> wrote:
>
> Ok, let me summarize the options we discussed here:
>
> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
> itself. The attachment is happening at the reg() time.
>
>   +: It's convenient for complex stateful struct ops'es, because a
>       single entity represents a combination of code and data.
>   -: No way to attach a single struct ops to multiple entities.
>
> This approach is used by Tejun for per-cgroup sched_ext prototype.

It's wrong. It should adopt bpf_struct_ops_link_create() approach
and use attr->link_create.cgroup.relative_fd to attach.
At that point scx can enforce that it attaches to one cgroup only
if it simplifies things for sched-ext. That's fine.
But api must be link based.
Otherwise cgroup_id inside st_ops all the way from bpf prog
will not be backward compatible if/when people would want
to attach the same sched-ext to multiple cgroups.

> 2) Make the attachment details a part of bpf_link creation. The
> attachment is still happening at the reg() time.
>
>   +: A single struct ops can be attached to multiple entities.
>   -: Implementing stateful struct ops'es is harder and requires passing
>      an additional argument (some sort of "self") to all callbacks.

sched-ext is already suffering from lack of 'this'.
The current workarounds with prog_assoc and aux__prog are not great.
We should learn from that mistake instead of repeating it with bpf-oom.

As far as 'this' I think we should pass
'struct bpf_struct_ops_link *' to all callbacks.
This patch is proposing to have cougrp_id in there.
It can be a pointer to cgroup too. This detail we can change later.

We can brainstorm a way to pass 'link *' in run_ctx,
and have an easy way to access it from ops and from kfuncs
that ops will call.
The existing tracing style bpf_set_run_ctx() should work for bpf-oom,
and 'link *'->cgroup_id->cgrp->memcg will be there for ops
and for kfuncs, but it doesn't quite work for sched-ext as-is
that wants run_ctx to be different for sched-ext-s
attached at different levels of hierarchy.
Maybe additional bpf_set_run_ctx() while traversing
hierarchy will do the trick?
Then we might not even need aux_prog and kf_implicit_args that much.
Though they may be useful on their own though.

> I'm using this approach in the bpf oom proposal.
>
> 3) Move the attachment out of .reg() scope entirely. reg() will register
> the implementation system-wide and then some 3rd-party interface
> (e.g. cgroupfs) should be used to select the implementation.

We went that road with ioctl-s and subsystem specific ways to attach.
All of them sucked. link_create is the only acceptable approach
because it returns FD.

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 21:34             ` Song Liu
@ 2025-10-30 22:42               ` Martin KaFai Lau
  2025-10-30 23:14                 ` Roman Gushchin
  2025-10-31  0:05                 ` Song Liu
  0 siblings, 2 replies; 83+ messages in thread
From: Martin KaFai Lau @ 2025-10-30 22:42 UTC (permalink / raw)
  To: Song Liu, Roman Gushchin
  Cc: Amery Hung, Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Kumar Kartikeya Dwivedi, Tejun Heo



On 10/30/25 2:34 PM, Song Liu wrote:
> Hi Roman,
> 
> On Thu, Oct 30, 2025 at 12:07 PM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
> [...]
>>> In TCP congestion control and BPF qdisc's model:
>>>
>>> During link_create, both adds the struct_ops to a list, and the
>>> struct_ops can be indexed by name. The struct_ops are not "active" by
>>> this time.
>>> Then, each has their own interface to 'apply' the struct_ops to a
>>> socket or queue: setsockopt() or netlink.
>>>
>>> But maybe cgroup-related struct_ops are different.
>>
>> Both tcp congestion and qdisk cases are somewhat different because
>> there already is a way to select between multiple implementations, bpf
>> just adds another one. In the oom case, it's not true. As of today,
>> there is only one (global) oom killer. Of course we can create
>> interfaces to allow a user make a choice. But the question is do we want
>> to create such interface for the oom case specifically (and later for
>> each new case separately), or there is a place for some generalization?
> 
> Agreed that this approach requires a separate mechanism to attach
> the struct_ops to an entity.
> 
>> Ok, let me summarize the options we discussed here:
> 
> Thanks for the summary!
> 
>>
>> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
>> itself. The attachment is happening at the reg() time.
>>
>>    +: It's convenient for complex stateful struct ops'es, because a
>>        single entity represents a combination of code and data.
>>    -: No way to attach a single struct ops to multiple entities.
>>
>> This approach is used by Tejun for per-cgroup sched_ext prototype.
>>
>> 2) Make the attachment details a part of bpf_link creation. The
>> attachment is still happening at the reg() time.
>>
>>    +: A single struct ops can be attached to multiple entities.
>>    -: Implementing stateful struct ops'es is harder and requires passing
>>       an additional argument (some sort of "self") to all callbacks.
>> I'm using this approach in the bpf oom proposal.
>>
> 
> I think both 1) and 2) have the following issue. With cgroup_id in
> struct_ops or the link, the cgroup_id works more like a filter. The
> cgroup doesn't hold any reference to the struct_ops. The bpf link
> holds the reference to the struct_ops, so we need to keep the
> the link alive, either by keeping an active fd, or by pinning the
> link to bpffs. When the cgroup is removed, we need to clean up
> the bpf link separately.

The link can be detached (struct_ops's unreg) by the user space.

The link can also be detached from the subsystem (cgroup) here.
It was requested by scx:
https://lore.kernel.org/all/20240530065946.979330-7-thinker.li@gmail.com/

Not sure if scx has started using it.

> 
>> 3) Move the attachment out of .reg() scope entirely. reg() will register
>> the implementation system-wide and then some 3rd-party interface
>> (e.g. cgroupfs) should be used to select the implementation.
>>
>>    +: ?
>>    -: New hard-coded interfaces might be required to enable bpf-driven
>>       kernel customization. The "attachment" code is not shared between
>>       various struct ops cases.
>>       Implementing stateful struct ops'es is harder and requires passing
>>       an additional argument (some sort of "self") to all callbacks.
>>
>> This approach works well for cases when there is already a selection
>> of implementations (e.g. tcp congestion mechanisms), and bpf is adding
>> another one.
> 
> Another benefit of 3) is that it allows loading an OOM controller in a
> kernel module, just like loading a file system in a kernel module. This
> is possible with 3) because we paid the cost of adding a new select
> attach interface.
> 
> A semi-separate topic, option 2) enables attaching a BPF program
> to a kernel object (a cgroup here, but could be something else). This
> is an interesting idea, and we may find it useful in other cases (attach
> a BPF program to a task_struct, etc.).

Does it have plan for a pure kernel module oom implementation?
I think the link-to-cgrp support here does not necessary stop the
later write to cgroupfs support if a kernel module oom is indeed needed
in the future.

imo, cgroup-bpf has a eco-system around it, so it is sort of special. bpf user
has expectation on how a bpf prog is attached to a cgroup. The introspection,
auto detachment from the cgroup when the link is gone...etc.

If link-to-cgrp is used, I prefer (2). Stay with one way to attach
to a cgrp. It is also consistent with the current way of attaching a single
bpf prog to a cgroup. It is now attaching a map/set of bpf prog to a cgroup.
The individual struct_ops implementation can decide if it should
allow a struct_ops be attached multiple times.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 22:42               ` Martin KaFai Lau
@ 2025-10-30 23:14                 ` Roman Gushchin
  2025-10-31  0:05                 ` Song Liu
  1 sibling, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-30 23:14 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Song Liu, Amery Hung, Andrew Morton, linux-kernel,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, cgroups, bpf, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
	Tejun Heo

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 10/30/25 2:34 PM, Song Liu wrote:
>> Hi Roman,
>> On Thu, Oct 30, 2025 at 12:07 PM Roman Gushchin
>> <roman.gushchin@linux.dev> wrote:
>> [...]
>>>> In TCP congestion control and BPF qdisc's model:
>>>>
>>>> During link_create, both adds the struct_ops to a list, and the
>>>> struct_ops can be indexed by name. The struct_ops are not "active" by
>>>> this time.
>>>> Then, each has their own interface to 'apply' the struct_ops to a
>>>> socket or queue: setsockopt() or netlink.
>>>>
>>>> But maybe cgroup-related struct_ops are different.
>>>
>>> Both tcp congestion and qdisk cases are somewhat different because
>>> there already is a way to select between multiple implementations, bpf
>>> just adds another one. In the oom case, it's not true. As of today,
>>> there is only one (global) oom killer. Of course we can create
>>> interfaces to allow a user make a choice. But the question is do we want
>>> to create such interface for the oom case specifically (and later for
>>> each new case separately), or there is a place for some generalization?
>> Agreed that this approach requires a separate mechanism to attach
>> the struct_ops to an entity.
>> 
>>> Ok, let me summarize the options we discussed here:
>> Thanks for the summary!
>> 
>>>
>>> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
>>> itself. The attachment is happening at the reg() time.
>>>
>>>    +: It's convenient for complex stateful struct ops'es, because a
>>>        single entity represents a combination of code and data.
>>>    -: No way to attach a single struct ops to multiple entities.
>>>
>>> This approach is used by Tejun for per-cgroup sched_ext prototype.
>>>
>>> 2) Make the attachment details a part of bpf_link creation. The
>>> attachment is still happening at the reg() time.
>>>
>>>    +: A single struct ops can be attached to multiple entities.
>>>    -: Implementing stateful struct ops'es is harder and requires passing
>>>       an additional argument (some sort of "self") to all callbacks.
>>> I'm using this approach in the bpf oom proposal.
>>>
>> I think both 1) and 2) have the following issue. With cgroup_id in
>> struct_ops or the link, the cgroup_id works more like a filter. The
>> cgroup doesn't hold any reference to the struct_ops. The bpf link
>> holds the reference to the struct_ops, so we need to keep the
>> the link alive, either by keeping an active fd, or by pinning the
>> link to bpffs. When the cgroup is removed, we need to clean up
>> the bpf link separately.
>
> The link can be detached (struct_ops's unreg) by the user space.
>
> The link can also be detached from the subsystem (cgroup) here.
> It was requested by scx:
> https://lore.kernel.org/all/20240530065946.979330-7-thinker.li@gmail.com/
>
> Not sure if scx has started using it.
>
>> 
>>> 3) Move the attachment out of .reg() scope entirely. reg() will register
>>> the implementation system-wide and then some 3rd-party interface
>>> (e.g. cgroupfs) should be used to select the implementation.
>>>
>>>    +: ?
>>>    -: New hard-coded interfaces might be required to enable bpf-driven
>>>       kernel customization. The "attachment" code is not shared between
>>>       various struct ops cases.
>>>       Implementing stateful struct ops'es is harder and requires passing
>>>       an additional argument (some sort of "self") to all callbacks.
>>>
>>> This approach works well for cases when there is already a selection
>>> of implementations (e.g. tcp congestion mechanisms), and bpf is adding
>>> another one.
>> Another benefit of 3) is that it allows loading an OOM controller in
>> a
>> kernel module, just like loading a file system in a kernel module. This
>> is possible with 3) because we paid the cost of adding a new select
>> attach interface.
>> A semi-separate topic, option 2) enables attaching a BPF program
>> to a kernel object (a cgroup here, but could be something else). This
>> is an interesting idea, and we may find it useful in other cases (attach
>> a BPF program to a task_struct, etc.).

Yep, task_struct is an attractive target for something like mm-related
policies (THP, NUMA, memory tiers etc).

>
> Does it have plan for a pure kernel module oom implementation?

I highly doubt.

> I think the link-to-cgrp support here does not necessary stop the
> later write to cgroupfs support if a kernel module oom is indeed needed
> in the future.
>
> imo, cgroup-bpf has a eco-system around it, so it is sort of special. bpf user
> has expectation on how a bpf prog is attached to a cgroup. The introspection,
> auto detachment from the cgroup when the link is gone...etc.
>
> If link-to-cgrp is used, I prefer (2). Stay with one way to attach
> to a cgrp. It is also consistent with the current way of attaching a single
> bpf prog to a cgroup. It is now attaching a map/set of bpf prog to a cgroup.
> The individual struct_ops implementation can decide if it should
> allow a struct_ops be attached multiple times.

+1


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: bpf_st_ops and cgroups. Was: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 22:19             ` bpf_st_ops and cgroups. Was: " Alexei Starovoitov
@ 2025-10-30 23:24               ` Roman Gushchin
  2025-10-31  3:03                 ` Yafang Shao
                                   ` (2 more replies)
  0 siblings, 3 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-10-30 23:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Amery Hung, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo

Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:

> On Thu, Oct 30, 2025 at 12:06 PM Roman Gushchin
> <roman.gushchin@linux.dev> wrote:
>>
>> Ok, let me summarize the options we discussed here:
>>
>> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
>> itself. The attachment is happening at the reg() time.
>>
>>   +: It's convenient for complex stateful struct ops'es, because a
>>       single entity represents a combination of code and data.
>>   -: No way to attach a single struct ops to multiple entities.
>>
>> This approach is used by Tejun for per-cgroup sched_ext prototype.
>
> It's wrong. It should adopt bpf_struct_ops_link_create() approach
> and use attr->link_create.cgroup.relative_fd to attach.

This is basically what I have in v2, but Andrii and Song suggested that
I should use attr->link_create.target_fd instead.

I have a slight preference towards attr->link_create.cgroup.relative_fd
because it makes it clear that fd is a cgroup fd and potentially opens
a possibility to e.g. attach struct_ops to individual tasks and
cgroups, but I'm fine with both options.

Also, as Song pointed out, fd==0 is in theory a valid target, so instead of
using the "if (fd) {...}" check we might need a new flag. Idk if it
really makes sense to complicate the code for it.

Can we, please, decide on what's best here?

^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 22:42               ` Martin KaFai Lau
  2025-10-30 23:14                 ` Roman Gushchin
@ 2025-10-31  0:05                 ` Song Liu
  1 sibling, 0 replies; 83+ messages in thread
From: Song Liu @ 2025-10-31  0:05 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Song Liu, Roman Gushchin, Amery Hung, Andrew Morton, linux-kernel,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, cgroups, bpf, Martin KaFai Lau, Kumar Kartikeya Dwivedi,
	Tejun Heo

On Thu, Oct 30, 2025 at 3:42 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
[...]
>
> The link can be detached (struct_ops's unreg) by the user space.
>
> The link can also be detached from the subsystem (cgroup) here.
> It was requested by scx:
> https://lore.kernel.org/all/20240530065946.979330-7-thinker.li@gmail.com/
>
> Not sure if scx has started using it.

I see. The user space can poll the link fd, and get notified when the
cgroup is removed.

> >
> >> 3) Move the attachment out of .reg() scope entirely. reg() will register
> >> the implementation system-wide and then some 3rd-party interface
> >> (e.g. cgroupfs) should be used to select the implementation.
> >>
> >>    +: ?
> >>    -: New hard-coded interfaces might be required to enable bpf-driven
> >>       kernel customization. The "attachment" code is not shared between
> >>       various struct ops cases.
> >>       Implementing stateful struct ops'es is harder and requires passing
> >>       an additional argument (some sort of "self") to all callbacks.
> >>
> >> This approach works well for cases when there is already a selection
> >> of implementations (e.g. tcp congestion mechanisms), and bpf is adding
> >> another one.
> >
> > Another benefit of 3) is that it allows loading an OOM controller in a
> > kernel module, just like loading a file system in a kernel module. This
> > is possible with 3) because we paid the cost of adding a new select
> > attach interface.
> >
> > A semi-separate topic, option 2) enables attaching a BPF program
> > to a kernel object (a cgroup here, but could be something else). This
> > is an interesting idea, and we may find it useful in other cases (attach
> > a BPF program to a task_struct, etc.).
>
> Does it have plan for a pure kernel module oom implementation?
> I think the link-to-cgrp support here does not necessary stop the
> later write to cgroupfs support if a kernel module oom is indeed needed
> in the future.

I am not aware of use cases to write OOM handlers in modules. Also
agreed that adding attach to cgroup link doesn't stop us from using
modules in the future.

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: bpf_st_ops and cgroups. Was: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 23:24               ` Roman Gushchin
@ 2025-10-31  3:03                 ` Yafang Shao
  2025-10-31  6:14                 ` Song Liu
  2025-10-31 17:37                 ` Alexei Starovoitov
  2 siblings, 0 replies; 83+ messages in thread
From: Yafang Shao @ 2025-10-31  3:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Amery Hung, Song Liu, Andrew Morton, LKML,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo

On Fri, Oct 31, 2025 at 7:30 AM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Thu, Oct 30, 2025 at 12:06 PM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> >>
> >> Ok, let me summarize the options we discussed here:
> >>
> >> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
> >> itself. The attachment is happening at the reg() time.
> >>
> >>   +: It's convenient for complex stateful struct ops'es, because a
> >>       single entity represents a combination of code and data.
> >>   -: No way to attach a single struct ops to multiple entities.
> >>
> >> This approach is used by Tejun for per-cgroup sched_ext prototype.
> >
> > It's wrong. It should adopt bpf_struct_ops_link_create() approach
> > and use attr->link_create.cgroup.relative_fd to attach.
>
> This is basically what I have in v2, but Andrii and Song suggested that
> I should use attr->link_create.target_fd instead.
>
> I have a slight preference towards attr->link_create.cgroup.relative_fd
> because it makes it clear that fd is a cgroup fd and potentially opens
> a possibility to e.g. attach struct_ops to individual tasks and
> cgroups, but I'm fine with both options.
>
> Also, as Song pointed out, fd==0 is in theory a valid target, so instead of
> using the "if (fd) {...}" check we might need a new flag.

I recall that Linus has reminded the BPF subsystem not to use `if
(fd)` to check for a valid fd. We should avoid repeating this mistake.
The proper solution is to add a new flag to indicate whether a fd is
valid.

> Idk if it
> really makes sense to complicate the code for it.
>
> Can we, please, decide on what's best here?
>

It seems the only way for us to learn is through practice—even if that
means making mistakes first ;-)

I can imagine a key benefit of a single struct-ops-to-multiple-cgroups
model is the ability to pre-load all required policies. This allows
users the flexibility to attach them on demand, while completely
avoiding the complex lifecycle management of individual links—a major
practical pain point.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: bpf_st_ops and cgroups. Was: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 23:24               ` Roman Gushchin
  2025-10-31  3:03                 ` Yafang Shao
@ 2025-10-31  6:14                 ` Song Liu
  2025-10-31 11:35                   ` Yafang Shao
  2025-10-31 17:37                 ` Alexei Starovoitov
  2 siblings, 1 reply; 83+ messages in thread
From: Song Liu @ 2025-10-31  6:14 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Alexei Starovoitov, Amery Hung, Song Liu, Andrew Morton, LKML,
	Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo

On Thu, Oct 30, 2025 at 4:24 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Thu, Oct 30, 2025 at 12:06 PM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> >>
> >> Ok, let me summarize the options we discussed here:
> >>
> >> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
> >> itself. The attachment is happening at the reg() time.
> >>
> >>   +: It's convenient for complex stateful struct ops'es, because a
> >>       single entity represents a combination of code and data.
> >>   -: No way to attach a single struct ops to multiple entities.
> >>
> >> This approach is used by Tejun for per-cgroup sched_ext prototype.
> >
> > It's wrong. It should adopt bpf_struct_ops_link_create() approach
> > and use attr->link_create.cgroup.relative_fd to attach.
>
> This is basically what I have in v2, but Andrii and Song suggested that
> I should use attr->link_create.target_fd instead.
>
> I have a slight preference towards attr->link_create.cgroup.relative_fd
> because it makes it clear that fd is a cgroup fd and potentially opens
> a possibility to e.g. attach struct_ops to individual tasks and
> cgroups, but I'm fine with both options.

relative_fd and relative_id have specific meaning. When multiple
programs are attached to the same object (cgroup, socket, etc.),
relative_fd and relative_id (together with BPF_F_BEFORE and
BPF_F_AFTER) are used to specify the order of execution.

>
> Also, as Song pointed out, fd==0 is in theory a valid target, so instead of
> using the "if (fd) {...}" check we might need a new flag. Idk if it
> really makes sense to complicate the code for it.
>
> Can we, please, decide on what's best here?

How about we add a new attach_type BPF_STRUCT_OPS_CGROUP?

Thanks,
Song


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  2025-10-27 23:17 ` [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
@ 2025-10-31  8:32   ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2025-10-31  8:32 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:17:07, Roman Gushchin wrote:
> mem_cgroup_get_from_ino() can be reused by the BPF OOM implementation,
> but currently depends on CONFIG_SHRINKER_DEBUG. Remove this dependency.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 4 ++--
>  mm/memcontrol.c            | 2 --
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 873e510d6f8d..9af9ae28afe7 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -832,9 +832,9 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
>  {
>  	return memcg ? cgroup_ino(memcg->css.cgroup) : 0;
>  }
> +#endif
>  
>  struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino);
> -#endif
>  
>  static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
>  {
> @@ -1331,12 +1331,12 @@ static inline unsigned long mem_cgroup_ino(struct mem_cgroup *memcg)
>  {
>  	return 0;
>  }
> +#endif
>  
>  static inline struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
>  {
>  	return NULL;
>  }
> -#endif
>  
>  static inline struct mem_cgroup *mem_cgroup_from_seq(struct seq_file *m)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 4deda33625f4..5d27cd5372aa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3618,7 +3618,6 @@ struct mem_cgroup *mem_cgroup_from_id(unsigned short id)
>  	return xa_load(&mem_cgroup_ids, id);
>  }
>  
> -#ifdef CONFIG_SHRINKER_DEBUG
>  struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
>  {
>  	struct cgroup *cgrp;
> @@ -3639,7 +3638,6 @@ struct mem_cgroup *mem_cgroup_get_from_ino(unsigned long ino)
>  
>  	return memcg;
>  }
> -#endif
>  
>  static void free_mem_cgroup_per_node_info(struct mem_cgroup_per_node *pn)
>  {
> -- 
> 2.51.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h
  2025-10-27 23:17 ` [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h Roman Gushchin
@ 2025-10-31  8:34   ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2025-10-31  8:34 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:17:08, Roman Gushchin wrote:
> To use memcg_page_state_output() in bpf_memcontrol.c move the
> declaration from v1-specific memcontrol-v1.h to memcontrol.h.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 1 +
>  mm/memcontrol-v1.h         | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 9af9ae28afe7..50d851ff3f27 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -949,6 +949,7 @@ static inline void mod_memcg_page_state(struct page *page,
>  }
>  
>  unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
> +unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>  unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
>  unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  				      enum node_stat_item idx);
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 6358464bb416..a304ad418cdf 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -27,7 +27,6 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
>  void drain_all_stock(struct mem_cgroup *root_memcg);
>  
>  unsigned long memcg_events(struct mem_cgroup *memcg, int event);
> -unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>  int memory_stat_show(struct seq_file *m, void *v);
>  
>  void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
> -- 
> 2.51.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
                     ` (4 preceding siblings ...)
  2025-10-30  5:57   ` Yafang Shao
@ 2025-10-31  9:02   ` Michal Hocko
  2025-11-02 21:36     ` Roman Gushchin
  5 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2025-10-31  9:02 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:17:09, Roman Gushchin wrote:
> Introduce a bpf struct ops for implementing custom OOM handling
> policies.
> 
> It's possible to load one bpf_oom_ops for the system and one
> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> cgroup tree is traversed from the OOM'ing memcg up to the root and
> corresponding BPF OOM handlers are executed until some memory is
> freed. If no memory is freed, the kernel OOM killer is invoked.

Do you have any usecase in mind where parent memcg oom handler decides
to not kill or cannot kill anything and hand over upwards in the
hierarchy?

> The struct ops provides the bpf_handle_out_of_memory() callback,
> which expected to return 1 if it was able to free some memory and 0
> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> field of the oom_control structure, which is expected to be set by
> kfuncs suitable for releasing memory. If both are set, OOM is
> considered handled, otherwise the next OOM handler in the chain
> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
> killer) is executed.

Could you explain why do we need both? Why is not bpf_memory_freed
return value sufficient?

> The bpf_handle_out_of_memory() callback program is sleepable to enable
> using iterators, e.g. cgroup iterators. The callback receives struct
> oom_control as an argument, so it can determine the scope of the OOM
> event: if this is a memcg-wide or system-wide OOM.

This could be tricky because it might introduce a subtle and hard to
debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
Sleepable locks should be only allowed in trylock mode.

> The callback is executed just before the kernel victim task selection
> algorithm, so all heuristics and sysctls like panic on oom,
> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> are respected.

I guess you meant to say and sysctl_panic_on_oom.

> BPF OOM struct ops provides the handle_cgroup_offline() callback
> which is good for releasing struct ops if the corresponding cgroup
> is gone.

What kind of synchronization is expected between handle_cgroup_offline
and bpf_handle_out_of_memory?
 
> The struct ops also has the name field, which allows to define a
> custom name for the implemented policy. It's printed in the OOM report
> in the oom_policy=<policy> format. "default" is printed if bpf is not
> used or policy name is not specified.

oom_handler seems like a better fit but nothing I would insist on. Also
I would just print it if there is an actual handler so that existing
users who do not use bpf oom killers do not need to change their
parsers.

Other than that this looks reasonable to me.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc
  2025-10-27 23:17 ` [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
@ 2025-10-31  9:05   ` Michal Hocko
  2025-11-02 21:09     ` Roman Gushchin
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2025-10-31  9:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:17:10, Roman Gushchin wrote:
> Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
> to be used by BPF OOM programs. It allows to kill a process
> in exactly the same way the OOM killer does: using the OOM reaper,
> bumping corresponding memcg and global statistics, respecting
> memory.oom.group etc.
> 
> On success, it sets om_control's bpf_memory_freed field to true,
> enabling the bpf program to bypass the kernel OOM killer.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

LGTM
Just a minor question

> +	/* paired with put_task_struct() in oom_kill_process() */
> +	task = tryget_task_struct(task);

Any reason this is not a plain get_task_struct?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events
  2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
  2025-10-27 23:48   ` bot+bpf-ci
@ 2025-10-31  9:08   ` Michal Hocko
  1 sibling, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2025-10-31  9:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:17:13, Roman Gushchin wrote:
> Introduce BPF kfuncs to conveniently access memcg data:
>   - bpf_mem_cgroup_vm_events(),
>   - bpf_mem_cgroup_usage(),
>   - bpf_mem_cgroup_page_state(),
>   - bpf_mem_cgroup_flush_stats().
> 
> These functions are useful for implementing BPF OOM policies, but
> also can be used to accelerate access to the memcg data. Reading
> it through cgroupfs is much more expensive, roughly 5x, mostly
> because of the need to convert the data into the text and back.
> 
> JP Kobryn:
> An experiment was setup to compare the performance of a program that
> uses the traditional method of reading memory.stat vs a program using
> the new kfuncs. The control program opens up the root memory.stat file
> and for 1M iterations reads, converts the string values to numeric data,
> then seeks back to the beginning. The experimental program sets up the
> requisite libbpf objects and for 1M iterations invokes a bpf program
> which uses the kfuncs to fetch all available stats for node_stat_item,
> memcg_stat_item, and vm_event_item types.
> 
> The results showed a significant perf benefit on the experimental side,
> outperforming the control side by a margin of 93%. In kernel mode,
> elapsed time was reduced by 80%, while in user mode, over 99% of time
> was saved.
> 
> control: elapsed time
> real    0m38.318s
> user    0m25.131s
> sys     0m13.070s
> 
> experiment: elapsed time
> real    0m2.789s
> user    0m0.187s
> sys     0m2.512s
> 
> control: perf data
> 33.43% a.out libc.so.6         [.] __vfscanf_internal
>  6.88% a.out [kernel.kallsyms] [k] vsnprintf
>  6.33% a.out libc.so.6         [.] _IO_fgets
>  5.51% a.out [kernel.kallsyms] [k] format_decode
>  4.31% a.out libc.so.6         [.] __GI_____strtoull_l_internal
>  3.78% a.out [kernel.kallsyms] [k] string
>  3.53% a.out [kernel.kallsyms] [k] number
>  2.71% a.out libc.so.6         [.] _IO_sputbackc
>  2.41% a.out [kernel.kallsyms] [k] strlen
>  1.98% a.out a.out             [.] main
>  1.70% a.out libc.so.6         [.] _IO_getline_info
>  1.51% a.out libc.so.6         [.] __isoc99_sscanf
>  1.47% a.out [kernel.kallsyms] [k] memory_stat_format
>  1.47% a.out [kernel.kallsyms] [k] memcpy_orig
>  1.41% a.out [kernel.kallsyms] [k] seq_buf_printf
> 
> experiment: perf data
> 10.55% memcgstat bpf_prog_..._query [k] bpf_prog_16aab2f19fa982a7_query
>  6.90% memcgstat [kernel.kallsyms]  [k] memcg_page_state_output
>  3.55% memcgstat [kernel.kallsyms]  [k] _raw_spin_lock
>  3.12% memcgstat [kernel.kallsyms]  [k] memcg_events
>  2.87% memcgstat [kernel.kallsyms]  [k] __memcg_slab_post_alloc_hook
>  2.73% memcgstat [kernel.kallsyms]  [k] kmem_cache_free
>  2.70% memcgstat [kernel.kallsyms]  [k] entry_SYSRETQ_unsafe_stack
>  2.25% memcgstat [kernel.kallsyms]  [k] __memcg_slab_free_hook
>  2.06% memcgstat [kernel.kallsyms]  [k] get_page_from_freelist
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> Co-developed-by: JP Kobryn <inwardvessel@gmail.com>
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  2 ++
>  mm/bpf_memcontrol.c        | 57 +++++++++++++++++++++++++++++++++++++-
>  2 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 39a6c7c8735b..b9e08dddd7ad 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -953,6 +953,8 @@ static inline void mod_memcg_page_state(struct page *page,
>  	rcu_read_unlock();
>  }
>  
> +unsigned long memcg_events(struct mem_cgroup *memcg, int event);
> +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
>  unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx);
>  unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>  unsigned long lruvec_page_state(struct lruvec *lruvec, enum node_stat_item idx);
> diff --git a/mm/bpf_memcontrol.c b/mm/bpf_memcontrol.c
> index 76c342318256..387255b8ab88 100644
> --- a/mm/bpf_memcontrol.c
> +++ b/mm/bpf_memcontrol.c
> @@ -75,6 +75,56 @@ __bpf_kfunc void bpf_put_mem_cgroup(struct mem_cgroup *memcg)
>  	css_put(&memcg->css);
>  }
>  
> +/**
> + * bpf_mem_cgroup_vm_events - Read memory cgroup's vm event counter
> + * @memcg: memory cgroup
> + * @event: event id
> + *
> + * Allows to read memory cgroup event counters.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_vm_events(struct mem_cgroup *memcg,
> +						enum vm_event_item event)
> +{
> +	return memcg_events(memcg, event);
> +}
> +
> +/**
> + * bpf_mem_cgroup_usage - Read memory cgroup's usage
> + * @memcg: memory cgroup
> + *
> + * Returns current memory cgroup size in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_usage(struct mem_cgroup *memcg)
> +{
> +	return page_counter_read(&memcg->memory);
> +}
> +
> +/**
> + * bpf_mem_cgroup_page_state - Read memory cgroup's page state counter
> + * @memcg: memory cgroup
> + * @idx: counter idx
> + *
> + * Allows to read memory cgroup statistics. The output is in bytes.
> + */
> +__bpf_kfunc unsigned long bpf_mem_cgroup_page_state(struct mem_cgroup *memcg, int idx)
> +{
> +	if (idx < 0 || idx >= MEMCG_NR_STAT)
> +		return (unsigned long)-1;
> +
> +	return memcg_page_state_output(memcg, idx);
> +}
> +
> +/**
> + * bpf_mem_cgroup_flush_stats - Flush memory cgroup's statistics
> + * @memcg: memory cgroup
> + *
> + * Propagate memory cgroup's statistics up the cgroup tree.
> + */
> +__bpf_kfunc void bpf_mem_cgroup_flush_stats(struct mem_cgroup *memcg)
> +{
> +	mem_cgroup_flush_stats(memcg);
> +}
> +
>  __bpf_kfunc_end_defs();
>  
>  BTF_KFUNCS_START(bpf_memcontrol_kfuncs)
> @@ -82,6 +132,11 @@ BTF_ID_FLAGS(func, bpf_get_root_mem_cgroup, KF_ACQUIRE | KF_RET_NULL)
>  BTF_ID_FLAGS(func, bpf_get_mem_cgroup, KF_ACQUIRE | KF_RET_NULL | KF_RCU)
>  BTF_ID_FLAGS(func, bpf_put_mem_cgroup, KF_RELEASE)
>  
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_vm_events, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_usage, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_page_state, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, bpf_mem_cgroup_flush_stats, KF_TRUSTED_ARGS | KF_SLEEPABLE)
> +
>  BTF_KFUNCS_END(bpf_memcontrol_kfuncs)
>  
>  static const struct btf_kfunc_id_set bpf_memcontrol_kfunc_set = {
> @@ -93,7 +148,7 @@ static int __init bpf_memcontrol_init(void)
>  {
>  	int err;
>  
> -	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +	err = register_btf_kfunc_id_set(BPF_PROG_TYPE_UNSPEC,
>  					&bpf_memcontrol_kfunc_set);
>  	if (err)
>  		pr_warn("error while registering bpf memcontrol kfuncs: %d", err);
> -- 
> 2.51.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 00/23] mm: BPF OOM
  2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
                   ` (9 preceding siblings ...)
  2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
@ 2025-10-31  9:31 ` Michal Hocko
  2025-10-31 16:48   ` Lance Yang
  2025-11-02 20:53   ` Roman Gushchin
  10 siblings, 2 replies; 83+ messages in thread
From: Michal Hocko @ 2025-10-31  9:31 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 27-10-25 16:17:03, Roman Gushchin wrote:
> The second part is related to the fundamental question on when to
> declare the OOM event. It's a trade-off between the risk of
> unnecessary OOM kills and associated work losses and the risk of
> infinite trashing and effective soft lockups.  In the last few years
> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> systemd-OOMd [4]). The common idea was to use userspace daemons to
> implement custom OOM logic as well as rely on PSI monitoring to avoid
> stalls. In this scenario the userspace daemon was supposed to handle
> the majority of OOMs, while the in-kernel OOM killer worked as the
> last resort measure to guarantee that the system would never deadlock
> on the memory. But this approach creates additional infrastructure
> churn: userspace OOM daemon is a separate entity which needs to be
> deployed, updated, monitored. A completely different pipeline needs to
> be built to monitor both types of OOM events and collect associated
> logs. A userspace daemon is more restricted in terms on what data is
> available to it. Implementing a daemon which can work reliably under a
> heavy memory pressure in the system is also tricky.

I do not see this part addressed in the series. Am I just missing
something or this will follow up once the initial (plugging to the
existing OOM handling) is merged?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: bpf_st_ops and cgroups. Was: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-31  6:14                 ` Song Liu
@ 2025-10-31 11:35                   ` Yafang Shao
  0 siblings, 0 replies; 83+ messages in thread
From: Yafang Shao @ 2025-10-31 11:35 UTC (permalink / raw)
  To: Song Liu
  Cc: Roman Gushchin, Alexei Starovoitov, Amery Hung, Andrew Morton,
	LKML, Alexei Starovoitov, Suren Baghdasaryan, Michal Hocko,
	Shakeel Butt, Johannes Weiner, Andrii Nakryiko, JP Kobryn,
	linux-mm, open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo

On Fri, Oct 31, 2025 at 2:14 PM Song Liu <song@kernel.org> wrote:
>
> On Thu, Oct 30, 2025 at 4:24 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
> >
> > Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
> >
> > > On Thu, Oct 30, 2025 at 12:06 PM Roman Gushchin
> > > <roman.gushchin@linux.dev> wrote:
> > >>
> > >> Ok, let me summarize the options we discussed here:
> > >>
> > >> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
> > >> itself. The attachment is happening at the reg() time.
> > >>
> > >>   +: It's convenient for complex stateful struct ops'es, because a
> > >>       single entity represents a combination of code and data.
> > >>   -: No way to attach a single struct ops to multiple entities.
> > >>
> > >> This approach is used by Tejun for per-cgroup sched_ext prototype.
> > >
> > > It's wrong. It should adopt bpf_struct_ops_link_create() approach
> > > and use attr->link_create.cgroup.relative_fd to attach.
> >
> > This is basically what I have in v2, but Andrii and Song suggested that
> > I should use attr->link_create.target_fd instead.
> >
> > I have a slight preference towards attr->link_create.cgroup.relative_fd
> > because it makes it clear that fd is a cgroup fd and potentially opens
> > a possibility to e.g. attach struct_ops to individual tasks and
> > cgroups, but I'm fine with both options.
>
> relative_fd and relative_id have specific meaning. When multiple
> programs are attached to the same object (cgroup, socket, etc.),
> relative_fd and relative_id (together with BPF_F_BEFORE and
> BPF_F_AFTER) are used to specify the order of execution.
>
> >
> > Also, as Song pointed out, fd==0 is in theory a valid target, so instead of
> > using the "if (fd) {...}" check we might need a new flag. Idk if it
> > really makes sense to complicate the code for it.
> >
> > Can we, please, decide on what's best here?
>
> How about we add a new attach_type BPF_STRUCT_OPS_CGROUP?

I'm concerned that defining a unique BPF_STRUCT_OPS_XXX for each type
might lead to a maintainability challenge. To keep the design clean
and forward-looking, we might want to consider a more generic
abstraction that could easily accommodate other kernel structures
(task_struct, mm_struct, etc.) without duplication.

-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 00/23] mm: BPF OOM
  2025-10-31  9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
@ 2025-10-31 16:48   ` Lance Yang
  2025-11-02 20:53   ` Roman Gushchin
  1 sibling, 0 replies; 83+ messages in thread
From: Lance Yang @ 2025-10-31 16:48 UTC (permalink / raw)
  To: mhocko
  Cc: akpm, andrii, ast, bpf, cgroups, hannes, inwardvessel,
	linux-kernel, linux-mm, martin.lau, memxor, roman.gushchin,
	shakeel.butt, song, surenb, tj, Lance Yang

From: Lance Yang <lance.yang@linux.dev>


On Fri, 31 Oct 2025 10:31:36 +0100, Michal Hocko wrote:
> On Mon 27-10-25 16:17:03, Roman Gushchin wrote:
> > The second part is related to the fundamental question on when to
> > declare the OOM event. It's a trade-off between the risk of
> > unnecessary OOM kills and associated work losses and the risk of
> > infinite trashing and effective soft lockups.  In the last few years
> > several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> > systemd-OOMd [4]). The common idea was to use userspace daemons to
> > implement custom OOM logic as well as rely on PSI monitoring to avoid
> > stalls. In this scenario the userspace daemon was supposed to handle
> > the majority of OOMs, while the in-kernel OOM killer worked as the
> > last resort measure to guarantee that the system would never deadlock
> > on the memory. But this approach creates additional infrastructure
> > churn: userspace OOM daemon is a separate entity which needs to be
> > deployed, updated, monitored. A completely different pipeline needs to
> > be built to monitor both types of OOM events and collect associated
> > logs. A userspace daemon is more restricted in terms on what data is
> > available to it. Implementing a daemon which can work reliably under a
> > heavy memory pressure in the system is also tricky.
> 
> I do not see this part addressed in the series. Am I just missing
> something or this will follow up once the initial (plugging to the
> existing OOM handling) is merged?

I noticed that this thread only shows up to patch 10/23. The subsequent
patches (11-23) appear to be missing ...

This might be why we're not seeing the userspace OOM daemon part
addressed. I suspect the relevant code is likely in those subsequent
patches.

Cheers,
Lance


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: bpf_st_ops and cgroups. Was: [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups
  2025-10-30 23:24               ` Roman Gushchin
  2025-10-31  3:03                 ` Yafang Shao
  2025-10-31  6:14                 ` Song Liu
@ 2025-10-31 17:37                 ` Alexei Starovoitov
  2 siblings, 0 replies; 83+ messages in thread
From: Alexei Starovoitov @ 2025-10-31 17:37 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Amery Hung, Song Liu, Andrew Morton, LKML, Alexei Starovoitov,
	Suren Baghdasaryan, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm,
	open list:CONTROL GROUP (CGROUP), bpf, Martin KaFai Lau,
	Kumar Kartikeya Dwivedi, Tejun Heo

On Thu, Oct 30, 2025 at 4:24 PM Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Alexei Starovoitov <alexei.starovoitov@gmail.com> writes:
>
> > On Thu, Oct 30, 2025 at 12:06 PM Roman Gushchin
> > <roman.gushchin@linux.dev> wrote:
> >>
> >> Ok, let me summarize the options we discussed here:
> >>
> >> 1) Make the attachment details (e.g. cgroup_id) the part of struct ops
> >> itself. The attachment is happening at the reg() time.
> >>
> >>   +: It's convenient for complex stateful struct ops'es, because a
> >>       single entity represents a combination of code and data.
> >>   -: No way to attach a single struct ops to multiple entities.
> >>
> >> This approach is used by Tejun for per-cgroup sched_ext prototype.
> >
> > It's wrong. It should adopt bpf_struct_ops_link_create() approach
> > and use attr->link_create.cgroup.relative_fd to attach.
>
> This is basically what I have in v2, but Andrii and Song suggested that
> I should use attr->link_create.target_fd instead.

Yes. Of course.
link_create.cgroup.relative_fd actually points to a program.
We will need it if/when we add support for mprog style attach.

> I have a slight preference towards attr->link_create.cgroup.relative_fd
> because it makes it clear that fd is a cgroup fd and potentially opens
> a possibility to e.g. attach struct_ops to individual tasks and
> cgroups, but I'm fine with both options.

yeah. The name is confusing. It's not a cgroup fd.

> Also, as Song pointed out, fd==0 is in theory a valid target, so instead of
> using the "if (fd) {...}" check we might need a new flag. Idk if it
> really makes sense to complicate the code for it.

One option is to cgroup_get_from_fd(attr->link_create.target_fd)
and if it's not a cgroup, just ignore it in bpf_struct_ops_link_create()
But a new flag like BPF_F_CGROUP_FD maybe cleaner ?
If we ever attach st_ops to tasks there will be another BPF_F_PID_FD flag ?
Or we may try different supported kinds like bpf_fd_probe_obj() does
and don't bother with flags.

New attach_type-s are not necessary. The type of st_ops itself
reflects the purpose and hook location.


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 00/23] mm: BPF OOM
  2025-10-31  9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
  2025-10-31 16:48   ` Lance Yang
@ 2025-11-02 20:53   ` Roman Gushchin
  2025-11-03 18:18     ` Michal Hocko
  1 sibling, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-11-02 20:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-10-25 16:17:03, Roman Gushchin wrote:
>> The second part is related to the fundamental question on when to
>> declare the OOM event. It's a trade-off between the risk of
>> unnecessary OOM kills and associated work losses and the risk of
>> infinite trashing and effective soft lockups.  In the last few years
>> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
>> systemd-OOMd [4]). The common idea was to use userspace daemons to
>> implement custom OOM logic as well as rely on PSI monitoring to avoid
>> stalls. In this scenario the userspace daemon was supposed to handle
>> the majority of OOMs, while the in-kernel OOM killer worked as the
>> last resort measure to guarantee that the system would never deadlock
>> on the memory. But this approach creates additional infrastructure
>> churn: userspace OOM daemon is a separate entity which needs to be
>> deployed, updated, monitored. A completely different pipeline needs to
>> be built to monitor both types of OOM events and collect associated
>> logs. A userspace daemon is more restricted in terms on what data is
>> available to it. Implementing a daemon which can work reliably under a
>> heavy memory pressure in the system is also tricky.
>
> I do not see this part addressed in the series. Am I just missing
> something or this will follow up once the initial (plugging to the
> existing OOM handling) is merged?

Did you receive patches 11-23?
git send-email failed on patch 10, so I had to send the second part separately.
It seems like the second part did at least to some recipients, as I got
feedback to some patches from that part.

In any case, you can find the whole series here:
https://github.com/rgushchin/linux/tree/bpfoom.2

And thank you for reviewing the series!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc
  2025-10-31  9:05   ` Michal Hocko
@ 2025-11-02 21:09     ` Roman Gushchin
  0 siblings, 0 replies; 83+ messages in thread
From: Roman Gushchin @ 2025-11-02 21:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-10-25 16:17:10, Roman Gushchin wrote:
>> Introduce bpf_oom_kill_process() bpf kfunc, which is supposed
>> to be used by BPF OOM programs. It allows to kill a process
>> in exactly the same way the OOM killer does: using the OOM reaper,
>> bumping corresponding memcg and global statistics, respecting
>> memory.oom.group etc.
>> 
>> On success, it sets om_control's bpf_memory_freed field to true,
>> enabling the bpf program to bypass the kernel OOM killer.
>> 
>> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
>
> LGTM
> Just a minor question
>
>> +	/* paired with put_task_struct() in oom_kill_process() */
>> +	task = tryget_task_struct(task);
>
> Any reason this is not a plain get_task_struct?

Fair enough, get_task_struct() should work too.

Thanks


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-10-31  9:02   ` Michal Hocko
@ 2025-11-02 21:36     ` Roman Gushchin
  2025-11-03 19:00       ` Michal Hocko
  0 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-11-02 21:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 27-10-25 16:17:09, Roman Gushchin wrote:
>> Introduce a bpf struct ops for implementing custom OOM handling
>> policies.
>> 
>> It's possible to load one bpf_oom_ops for the system and one
>> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> corresponding BPF OOM handlers are executed until some memory is
>> freed. If no memory is freed, the kernel OOM killer is invoked.
>
> Do you have any usecase in mind where parent memcg oom handler decides
> to not kill or cannot kill anything and hand over upwards in the
> hierarchy?

I believe that in most cases bpf handlers will handle ooms themselves,
but because strictly speaking I don't have control over what bpf
programs do or do not, the kernel should provide the fallback mechanism.
This is a common practice with bpf, e.g. sched_ext falls back to
CFS/EEVDF in case something is wrong.

Specifically to OOM case, I believe someone might want to use bpf
programs just for monitoring/collecting some information, without
trying to actually free some memory.

>> The struct ops provides the bpf_handle_out_of_memory() callback,
>> which expected to return 1 if it was able to free some memory and 0
>> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> field of the oom_control structure, which is expected to be set by
>> kfuncs suitable for releasing memory. If both are set, OOM is
>> considered handled, otherwise the next OOM handler in the chain
>> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
>> killer) is executed.
>
> Could you explain why do we need both? Why is not bpf_memory_freed
> return value sufficient?

Strictly speaking, bpf_memory_freed should be enough, but because
bpf programs have to return an int and there is no additional cost
to add this option (pass to next or in-kernel oom handler), I thought
it's not a bad idea. If you feel strongly otherwise, I can ignore
the return value on rely on bpf_memory_freed only.

>
>> The bpf_handle_out_of_memory() callback program is sleepable to enable
>> using iterators, e.g. cgroup iterators. The callback receives struct
>> oom_control as an argument, so it can determine the scope of the OOM
>> event: if this is a memcg-wide or system-wide OOM.
>
> This could be tricky because it might introduce a subtle and hard to
> debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
> Sleepable locks should be only allowed in trylock mode.

Agree, but it's achieved by controlling the context where oom can be
declared (e.g. in bpf_psi case it's done from a work context).

>
>> The callback is executed just before the kernel victim task selection
>> algorithm, so all heuristics and sysctls like panic on oom,
>> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
>> are respected.
>
> I guess you meant to say and sysctl_panic_on_oom.

Yep, fixed.
>
>> BPF OOM struct ops provides the handle_cgroup_offline() callback
>> which is good for releasing struct ops if the corresponding cgroup
>> is gone.
>
> What kind of synchronization is expected between handle_cgroup_offline
> and bpf_handle_out_of_memory?

You mean from a user's perspective? E.g. can these two callbacks run in
parallel? Currently yes, but it's a good question, I haven't thought
about it, maybe it's better to synchronize them.
Internally both rely on srcu to pin bpf_oom_ops in memory.

>  
>> The struct ops also has the name field, which allows to define a
>> custom name for the implemented policy. It's printed in the OOM report
>> in the oom_policy=<policy> format. "default" is printed if bpf is not
>> used or policy name is not specified.
>
> oom_handler seems like a better fit but nothing I would insist on. Also
> I would just print it if there is an actual handler so that existing
> users who do not use bpf oom killers do not need to change their
> parsers.

Sure, works for me too.

>
> Other than that this looks reasonable to me.

Sound great, thank you for taking a look!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 00/23] mm: BPF OOM
  2025-11-02 20:53   ` Roman Gushchin
@ 2025-11-03 18:18     ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2025-11-03 18:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Sun 02-11-25 12:53:53, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 27-10-25 16:17:03, Roman Gushchin wrote:
> >> The second part is related to the fundamental question on when to
> >> declare the OOM event. It's a trade-off between the risk of
> >> unnecessary OOM kills and associated work losses and the risk of
> >> infinite trashing and effective soft lockups.  In the last few years
> >> several PSI-based userspace solutions were developed (e.g. OOMd [3] or
> >> systemd-OOMd [4]). The common idea was to use userspace daemons to
> >> implement custom OOM logic as well as rely on PSI monitoring to avoid
> >> stalls. In this scenario the userspace daemon was supposed to handle
> >> the majority of OOMs, while the in-kernel OOM killer worked as the
> >> last resort measure to guarantee that the system would never deadlock
> >> on the memory. But this approach creates additional infrastructure
> >> churn: userspace OOM daemon is a separate entity which needs to be
> >> deployed, updated, monitored. A completely different pipeline needs to
> >> be built to monitor both types of OOM events and collect associated
> >> logs. A userspace daemon is more restricted in terms on what data is
> >> available to it. Implementing a daemon which can work reliably under a
> >> heavy memory pressure in the system is also tricky.
> >
> > I do not see this part addressed in the series. Am I just missing
> > something or this will follow up once the initial (plugging to the
> > existing OOM handling) is merged?
> 
> Did you receive patches 11-23?

OK, I found it. Patches 11-23 are threaded separately (patch 11
with Message-ID: <20251027232206.473085-1-roman.gushchin@linux.dev> doesn't
seem to have In-reply-to in header) and I have missed them previously. I
will have a look in upcoming days.


-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-11-02 21:36     ` Roman Gushchin
@ 2025-11-03 19:00       ` Michal Hocko
  2025-11-04  1:45         ` Roman Gushchin
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2025-11-03 19:00 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 27-10-25 16:17:09, Roman Gushchin wrote:
> >> Introduce a bpf struct ops for implementing custom OOM handling
> >> policies.
> >> 
> >> It's possible to load one bpf_oom_ops for the system and one
> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
> >> corresponding BPF OOM handlers are executed until some memory is
> >> freed. If no memory is freed, the kernel OOM killer is invoked.
> >
> > Do you have any usecase in mind where parent memcg oom handler decides
> > to not kill or cannot kill anything and hand over upwards in the
> > hierarchy?
> 
> I believe that in most cases bpf handlers will handle ooms themselves,
> but because strictly speaking I don't have control over what bpf
> programs do or do not, the kernel should provide the fallback mechanism.
> This is a common practice with bpf, e.g. sched_ext falls back to
> CFS/EEVDF in case something is wrong.

We do have fallback mechanism - the kernel oom handling. For that we do
not need to pass to parent handler. Please not that I am not opposing
this but I would like to understand thinking behind and hopefully start
with a simpler model and then extend it later than go with a more
complex one initially and then corner ourselves with weird side
effects.
 
> Specifically to OOM case, I believe someone might want to use bpf
> programs just for monitoring/collecting some information, without
> trying to actually free some memory.
> 
> >> The struct ops provides the bpf_handle_out_of_memory() callback,
> >> which expected to return 1 if it was able to free some memory and 0
> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
> >> field of the oom_control structure, which is expected to be set by
> >> kfuncs suitable for releasing memory. If both are set, OOM is
> >> considered handled, otherwise the next OOM handler in the chain
> >> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
> >> killer) is executed.
> >
> > Could you explain why do we need both? Why is not bpf_memory_freed
> > return value sufficient?
> 
> Strictly speaking, bpf_memory_freed should be enough, but because
> bpf programs have to return an int and there is no additional cost
> to add this option (pass to next or in-kernel oom handler), I thought
> it's not a bad idea. If you feel strongly otherwise, I can ignore
> the return value on rely on bpf_memory_freed only.

No, I do not feel strongly one way or the other but I would like to
understand thinking behind that. My slight preference would be to have a
single return status that clearly describe the intention. If you want to
have more flexible chaining semantic then an enum { IGNORED, HANDLED,
PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
to understand.

> >> The bpf_handle_out_of_memory() callback program is sleepable to enable
> >> using iterators, e.g. cgroup iterators. The callback receives struct
> >> oom_control as an argument, so it can determine the scope of the OOM
> >> event: if this is a memcg-wide or system-wide OOM.
> >
> > This could be tricky because it might introduce a subtle and hard to
> > debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
> > Sleepable locks should be only allowed in trylock mode.
> 
> Agree, but it's achieved by controlling the context where oom can be
> declared (e.g. in bpf_psi case it's done from a work context).

but out_of_memory is any sleepable context. So this is a real problem.
 
> >> The callback is executed just before the kernel victim task selection
> >> algorithm, so all heuristics and sysctls like panic on oom,
> >> sysctl_oom_kill_allocating_task and sysctl_oom_kill_allocating_task
> >> are respected.
> >
> > I guess you meant to say and sysctl_panic_on_oom.
> 
> Yep, fixed.
> >
> >> BPF OOM struct ops provides the handle_cgroup_offline() callback
> >> which is good for releasing struct ops if the corresponding cgroup
> >> is gone.
> >
> > What kind of synchronization is expected between handle_cgroup_offline
> > and bpf_handle_out_of_memory?
> 
> You mean from a user's perspective?

I mean from bpf handler writer POV

> E.g. can these two callbacks run in
> parallel? Currently yes, but it's a good question, I haven't thought
> about it, maybe it's better to synchronize them.
> Internally both rely on srcu to pin bpf_oom_ops in memory.

This should be really documented.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-11-03 19:00       ` Michal Hocko
@ 2025-11-04  1:45         ` Roman Gushchin
  2025-11-04  8:18           ` Michal Hocko
  0 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-11-04  1:45 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Mon 27-10-25 16:17:09, Roman Gushchin wrote:
>> >> Introduce a bpf struct ops for implementing custom OOM handling
>> >> policies.
>> >> 
>> >> It's possible to load one bpf_oom_ops for the system and one
>> >> bpf_oom_ops for every memory cgroup. In case of a memcg OOM, the
>> >> cgroup tree is traversed from the OOM'ing memcg up to the root and
>> >> corresponding BPF OOM handlers are executed until some memory is
>> >> freed. If no memory is freed, the kernel OOM killer is invoked.
>> >
>> > Do you have any usecase in mind where parent memcg oom handler decides
>> > to not kill or cannot kill anything and hand over upwards in the
>> > hierarchy?
>> 
>> I believe that in most cases bpf handlers will handle ooms themselves,
>> but because strictly speaking I don't have control over what bpf
>> programs do or do not, the kernel should provide the fallback mechanism.
>> This is a common practice with bpf, e.g. sched_ext falls back to
>> CFS/EEVDF in case something is wrong.
>
> We do have fallback mechanism - the kernel oom handling. For that we do
> not need to pass to parent handler. Please not that I am not opposing
> this but I would like to understand thinking behind and hopefully start
> with a simpler model and then extend it later than go with a more
> complex one initially and then corner ourselves with weird side
> effects.
>  
>> Specifically to OOM case, I believe someone might want to use bpf
>> programs just for monitoring/collecting some information, without
>> trying to actually free some memory.
>> 
>> >> The struct ops provides the bpf_handle_out_of_memory() callback,
>> >> which expected to return 1 if it was able to free some memory and 0
>> >> otherwise. If 1 is returned, the kernel also checks the bpf_memory_freed
>> >> field of the oom_control structure, which is expected to be set by
>> >> kfuncs suitable for releasing memory. If both are set, OOM is
>> >> considered handled, otherwise the next OOM handler in the chain
>> >> (e.g. BPF OOM attached to the parent cgroup or the in-kernel OOM
>> >> killer) is executed.
>> >
>> > Could you explain why do we need both? Why is not bpf_memory_freed
>> > return value sufficient?
>> 
>> Strictly speaking, bpf_memory_freed should be enough, but because
>> bpf programs have to return an int and there is no additional cost
>> to add this option (pass to next or in-kernel oom handler), I thought
>> it's not a bad idea. If you feel strongly otherwise, I can ignore
>> the return value on rely on bpf_memory_freed only.
>
> No, I do not feel strongly one way or the other but I would like to
> understand thinking behind that. My slight preference would be to have a
> single return status that clearly describe the intention. If you want to
> have more flexible chaining semantic then an enum { IGNORED, HANDLED,
> PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
> to understand.

The thinking is simple:
1) Most users will have a single global bpf oom policy, which basically
replaces the in-kernel oom killer.
2) If there are standalone containers, they might want to do the same on
their level. And the "host" system doesn't directly control it.
3) If for some reason the inner oom handler fails to free up some
memory, there are two potential fallback options: call the in-kernel oom
killer for that memory cgroup or call an upper level bpf oom killer, if
there is one.

I think the latter is more logical and less surprising. Imagine you're
running multiple containers and some of them implement their own bpf oom
logic and some don't. Why would we treat them differently if their bpf
logic fails?

Re a single return value: I can absolutely specify return values as an
enum, my point is that unlike the kernel code we can't fully trust the
value returned from a bpf program, this is why the second check is in
place.

Can we just ignore the returned value and rely on the freed_memory flag?
Sure, but I don't think it bus us anything.

Also, I have to admit that I don't have an immediate production use case
for nested oom handlers (I'm fine with a global one), but it was asked
by Alexei Starovoitov. And I agree with him that the containerized case
will come up soon, so it's better to think of it in advance.

>> >> The bpf_handle_out_of_memory() callback program is sleepable to enable
>> >> using iterators, e.g. cgroup iterators. The callback receives struct
>> >> oom_control as an argument, so it can determine the scope of the OOM
>> >> event: if this is a memcg-wide or system-wide OOM.
>> >
>> > This could be tricky because it might introduce a subtle and hard to
>> > debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
>> > Sleepable locks should be only allowed in trylock mode.
>> 
>> Agree, but it's achieved by controlling the context where oom can be
>> declared (e.g. in bpf_psi case it's done from a work context).
>
> but out_of_memory is any sleepable context. So this is a real problem.

We need to restrict both:
1) where from bpf_out_of_memory() can be called (already done, as of now
only from bpf_psi callback, which is safe).
2) which kfuncs are available to bpf oom handlers (only those, which are
not trying to grab unsafe locks) - I'll double check it in thenext version.

Thank you!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-11-04  1:45         ` Roman Gushchin
@ 2025-11-04  8:18           ` Michal Hocko
  2025-11-04 18:14             ` Roman Gushchin
  0 siblings, 1 reply; 83+ messages in thread
From: Michal Hocko @ 2025-11-04  8:18 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Mon 03-11-25 17:45:09, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
[...]
> > No, I do not feel strongly one way or the other but I would like to
> > understand thinking behind that. My slight preference would be to have a
> > single return status that clearly describe the intention. If you want to
> > have more flexible chaining semantic then an enum { IGNORED, HANDLED,
> > PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
> > to understand.
> 
> The thinking is simple:
> 1) Most users will have a single global bpf oom policy, which basically
> replaces the in-kernel oom killer.
> 2) If there are standalone containers, they might want to do the same on
> their level. And the "host" system doesn't directly control it.
> 3) If for some reason the inner oom handler fails to free up some
> memory, there are two potential fallback options: call the in-kernel oom
> killer for that memory cgroup or call an upper level bpf oom killer, if
> there is one.
> 
> I think the latter is more logical and less surprising. Imagine you're
> running multiple containers and some of them implement their own bpf oom
> logic and some don't. Why would we treat them differently if their bpf
> logic fails?

I think both approaches are valid and it should be the actual handler to
tell what to do next. If the handler would prefer the in-kernel fallback
it should be able to enforce that rather than a potentially unknown bpf
handler up the chain.

> Re a single return value: I can absolutely specify return values as an
> enum, my point is that unlike the kernel code we can't fully trust the
> value returned from a bpf program, this is why the second check is in
> place.

I do not understand this. Could you elaborate? Why we cannot trust the
return value but we can trust a combination of the return value and a
state stored in a helper structure?

> Can we just ignore the returned value and rely on the freed_memory flag?

I do not think having a single freed_memory flag is more helpful. This
is just a number that cannot say much more than a memory has been freed.
It is not really important whether and how much memory bpf handler
believes it has freed. It is much more important to note whether it
believes it is done, it needs assistance from a different handler up the
chain or just pass over to the in-kernel implementation.

> Sure, but I don't think it bus us anything.
> 
> Also, I have to admit that I don't have an immediate production use case
> for nested oom handlers (I'm fine with a global one), but it was asked
> by Alexei Starovoitov. And I agree with him that the containerized case
> will come up soon, so it's better to think of it in advance.

I agree it is good to be prepared for that.

> >> >> The bpf_handle_out_of_memory() callback program is sleepable to enable
> >> >> using iterators, e.g. cgroup iterators. The callback receives struct
> >> >> oom_control as an argument, so it can determine the scope of the OOM
> >> >> event: if this is a memcg-wide or system-wide OOM.
> >> >
> >> > This could be tricky because it might introduce a subtle and hard to
> >> > debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
> >> > Sleepable locks should be only allowed in trylock mode.
> >> 
> >> Agree, but it's achieved by controlling the context where oom can be
> >> declared (e.g. in bpf_psi case it's done from a work context).
> >
> > but out_of_memory is any sleepable context. So this is a real problem.
> 
> We need to restrict both:
> 1) where from bpf_out_of_memory() can be called (already done, as of now
> only from bpf_psi callback, which is safe).
> 2) which kfuncs are available to bpf oom handlers (only those, which are
> not trying to grab unsafe locks) - I'll double check it in thenext version.

OK. All I am trying to say is that only safe sleepable locks are
trylocks and that should be documented because I do not think it can be
enforced

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-11-04  8:18           ` Michal Hocko
@ 2025-11-04 18:14             ` Roman Gushchin
  2025-11-04 19:22               ` Michal Hocko
  0 siblings, 1 reply; 83+ messages in thread
From: Roman Gushchin @ 2025-11-04 18:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

Michal Hocko <mhocko@suse.com> writes:

> On Mon 03-11-25 17:45:09, Roman Gushchin wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
> [...]
>> > No, I do not feel strongly one way or the other but I would like to
>> > understand thinking behind that. My slight preference would be to have a
>> > single return status that clearly describe the intention. If you want to
>> > have more flexible chaining semantic then an enum { IGNORED, HANDLED,
>> > PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
>> > to understand.
>> 
>> The thinking is simple:
>> 1) Most users will have a single global bpf oom policy, which basically
>> replaces the in-kernel oom killer.
>> 2) If there are standalone containers, they might want to do the same on
>> their level. And the "host" system doesn't directly control it.
>> 3) If for some reason the inner oom handler fails to free up some
>> memory, there are two potential fallback options: call the in-kernel oom
>> killer for that memory cgroup or call an upper level bpf oom killer, if
>> there is one.
>> 
>> I think the latter is more logical and less surprising. Imagine you're
>> running multiple containers and some of them implement their own bpf oom
>> logic and some don't. Why would we treat them differently if their bpf
>> logic fails?
>
> I think both approaches are valid and it should be the actual handler to
> tell what to do next. If the handler would prefer the in-kernel fallback
> it should be able to enforce that rather than a potentially unknown bpf
> handler up the chain.

The counter-argument is that cgroups are hierarchical and higher level
cgroups should be able to enforce the desired behavior for their
sub-trees. I'm not sure what's more important here and have to think
more about it.
Do you have an example when it might be important for container to not
pass to a higher level bpf handler?

>
>> Re a single return value: I can absolutely specify return values as an
>> enum, my point is that unlike the kernel code we can't fully trust the
>> value returned from a bpf program, this is why the second check is in
>> place.
>
> I do not understand this. Could you elaborate? Why we cannot trust the
> return value but we can trust a combination of the return value and a
> state stored in a helper structure?

Imagine bpf program which does nothing and simple returns 1. Imagine
it's loaded as a system-wide oom handler. This will effectively disable
the oom killer and lead to a potential deadlock on memory.
But it's a perfectly valid bpf program.
This is something I want to avoid (and it's a common practice with other
bpf programs).

What I do I also rely on the value of the oom control's field, which is
not accessible to the bpf program for write directly, but can be changed
by calling certain helper functions, e.g. bpf_oom_kill_process.

>> Can we just ignore the returned value and rely on the freed_memory flag?
>
> I do not think having a single freed_memory flag is more helpful. This
> is just a number that cannot say much more than a memory has been freed.
> It is not really important whether and how much memory bpf handler
> believes it has freed. It is much more important to note whether it
> believes it is done, it needs assistance from a different handler up the
> chain or just pass over to the in-kernel implementation.

Btw in general in a containerized environment a bpf handler knows
nothing about bpf programs up in the cgroup hierarchy... So it only
knows whether it was able to free some memory or not.

>
>> Sure, but I don't think it bus us anything.
>> 
>> Also, I have to admit that I don't have an immediate production use case
>> for nested oom handlers (I'm fine with a global one), but it was asked
>> by Alexei Starovoitov. And I agree with him that the containerized case
>> will come up soon, so it's better to think of it in advance.
>
> I agree it is good to be prepared for that.
>
>> >> >> The bpf_handle_out_of_memory() callback program is sleepable to enable
>> >> >> using iterators, e.g. cgroup iterators. The callback receives struct
>> >> >> oom_control as an argument, so it can determine the scope of the OOM
>> >> >> event: if this is a memcg-wide or system-wide OOM.
>> >> >
>> >> > This could be tricky because it might introduce a subtle and hard to
>> >> > debug lock dependency chain. lock(a); allocation() -> oom -> lock(a).
>> >> > Sleepable locks should be only allowed in trylock mode.
>> >> 
>> >> Agree, but it's achieved by controlling the context where oom can be
>> >> declared (e.g. in bpf_psi case it's done from a work context).
>> >
>> > but out_of_memory is any sleepable context. So this is a real problem.
>> 
>> We need to restrict both:
>> 1) where from bpf_out_of_memory() can be called (already done, as of now
>> only from bpf_psi callback, which is safe).
>> 2) which kfuncs are available to bpf oom handlers (only those, which are
>> not trying to grab unsafe locks) - I'll double check it in thenext version.
>
> OK. All I am trying to say is that only safe sleepable locks are
> trylocks and that should be documented because I do not think it can be
> enforced

It can! Not directly, but by controlling which kfuncs/helpers are
available to bpf programs.
I agree with you in principle re locks and necessary precaution here.

Thanks!


^ permalink raw reply	[flat|nested] 83+ messages in thread

* Re: [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling
  2025-11-04 18:14             ` Roman Gushchin
@ 2025-11-04 19:22               ` Michal Hocko
  0 siblings, 0 replies; 83+ messages in thread
From: Michal Hocko @ 2025-11-04 19:22 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, linux-kernel, Alexei Starovoitov,
	Suren Baghdasaryan, Shakeel Butt, Johannes Weiner,
	Andrii Nakryiko, JP Kobryn, linux-mm, cgroups, bpf,
	Martin KaFai Lau, Song Liu, Kumar Kartikeya Dwivedi, Tejun Heo

On Tue 04-11-25 10:14:05, Roman Gushchin wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Mon 03-11-25 17:45:09, Roman Gushchin wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Sun 02-11-25 13:36:25, Roman Gushchin wrote:
> >> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > No, I do not feel strongly one way or the other but I would like to
> >> > understand thinking behind that. My slight preference would be to have a
> >> > single return status that clearly describe the intention. If you want to
> >> > have more flexible chaining semantic then an enum { IGNORED, HANDLED,
> >> > PASS_TO_PARENT, ...} would be both more flexible, extensible and easier
> >> > to understand.
> >> 
> >> The thinking is simple:
> >> 1) Most users will have a single global bpf oom policy, which basically
> >> replaces the in-kernel oom killer.
> >> 2) If there are standalone containers, they might want to do the same on
> >> their level. And the "host" system doesn't directly control it.
> >> 3) If for some reason the inner oom handler fails to free up some
> >> memory, there are two potential fallback options: call the in-kernel oom
> >> killer for that memory cgroup or call an upper level bpf oom killer, if
> >> there is one.
> >> 
> >> I think the latter is more logical and less surprising. Imagine you're
> >> running multiple containers and some of them implement their own bpf oom
> >> logic and some don't. Why would we treat them differently if their bpf
> >> logic fails?
> >
> > I think both approaches are valid and it should be the actual handler to
> > tell what to do next. If the handler would prefer the in-kernel fallback
> > it should be able to enforce that rather than a potentially unknown bpf
> > handler up the chain.
> 
> The counter-argument is that cgroups are hierarchical and higher level
> cgroups should be able to enforce the desired behavior for their
> sub-trees. I'm not sure what's more important here and have to think
> more about it.

Right and they can enforce that through their limits - hence oom.

> Do you have an example when it might be important for container to not
> pass to a higher level bpf handler?

Nothing really specific. I still trying to wrap my head around what
level of flexibility is necessary here. My initial thoughts would be
just deal with it in the scope of the bpf handler and fallback to the
kernel implementation if it cannot deal with the situation. Since you
brought that up you made me think.

I know that we do not provide userspace like no-regression policy to BPF
programs but it would be still good to have a way to add new potential
fallback policies without breaking existing handlers.

> >> Re a single return value: I can absolutely specify return values as an
> >> enum, my point is that unlike the kernel code we can't fully trust the
> >> value returned from a bpf program, this is why the second check is in
> >> place.
> >
> > I do not understand this. Could you elaborate? Why we cannot trust the
> > return value but we can trust a combination of the return value and a
> > state stored in a helper structure?
> 
> Imagine bpf program which does nothing and simple returns 1. Imagine
> it's loaded as a system-wide oom handler. This will effectively disable
> the oom killer and lead to a potential deadlock on memory.
> But it's a perfectly valid bpf program.
> This is something I want to avoid (and it's a common practice with other
> bpf programs).
> 
> What I do I also rely on the value of the oom control's field, which is
> not accessible to the bpf program for write directly, but can be changed
> by calling certain helper functions, e.g. bpf_oom_kill_process.

OK, now I can see your point. You want to have a line of defense from
trusted BPF facing interface. This makes sense to me. Maybe it would be
good to call that out more explicitly. Something like 
The BPF OOM infrastructure only trusts BPF handlers which are using pre
selected functions to free up memory e.g. bpf_oom_kill_process. Those
will set an internal state not available to those handlers directly.
BPF handler return value is ignored if that state is not set.

I would rather call this differently to freed_memory as the actual
memory might be freed asynchronously (e.g. oom_reaper) and this is more
about conformity/trust than actual physical memory being freed. I do not
care much about naming as long as this is clearly document though.
Including set of functions that are forming that prescribed API.

[...]
> > OK. All I am trying to say is that only safe sleepable locks are
> > trylocks and that should be documented because I do not think it can be
> > enforced
> 
> It can! Not directly, but by controlling which kfuncs/helpers are
> available to bpf programs.

OK, I see. This is better than relying only on having this documented.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 83+ messages in thread

end of thread, other threads:[~2025-11-04 19:22 UTC | newest]

Thread overview: 83+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-27 23:17 [PATCH v2 00/23] mm: BPF OOM Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 01/23] bpf: move bpf_struct_ops_link into bpf.h Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 02/23] bpf: initial support for attaching struct ops to cgroups Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 15:57     ` Roman Gushchin
2025-10-29 18:01   ` Song Liu
2025-10-29 20:26     ` Roman Gushchin
2025-10-30 17:22     ` Roman Gushchin
2025-10-30 18:03       ` Song Liu
2025-10-30 18:19         ` Amery Hung
2025-10-30 19:06           ` Roman Gushchin
2025-10-30 21:34             ` Song Liu
2025-10-30 22:42               ` Martin KaFai Lau
2025-10-30 23:14                 ` Roman Gushchin
2025-10-31  0:05                 ` Song Liu
2025-10-30 22:19             ` bpf_st_ops and cgroups. Was: " Alexei Starovoitov
2025-10-30 23:24               ` Roman Gushchin
2025-10-31  3:03                 ` Yafang Shao
2025-10-31  6:14                 ` Song Liu
2025-10-31 11:35                   ` Yafang Shao
2025-10-31 17:37                 ` Alexei Starovoitov
2025-10-29 18:14   ` Tejun Heo
2025-10-29 20:25     ` Roman Gushchin
2025-10-29 20:36       ` Tejun Heo
2025-10-29 21:18         ` Song Liu
2025-10-29 21:27           ` Tejun Heo
2025-10-29 21:37             ` Song Liu
2025-10-29 21:45               ` Tejun Heo
2025-10-30  4:32                 ` Song Liu
2025-10-30 16:13                   ` Tejun Heo
2025-10-30 17:56                     ` Song Liu
2025-10-29 21:53           ` Roman Gushchin
2025-10-29 22:43             ` Alexei Starovoitov
2025-10-29 22:53               ` Tejun Heo
2025-10-29 23:53                 ` Alexei Starovoitov
2025-10-30  0:03                   ` Tejun Heo
2025-10-30  0:16                     ` Alexei Starovoitov
2025-10-30  6:33                       ` Yafang Shao
2025-10-29 21:04   ` Song Liu
2025-10-30  0:43   ` Martin KaFai Lau
2025-10-27 23:17 ` [PATCH v2 03/23] bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 04/23] mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG Roman Gushchin
2025-10-31  8:32   ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 05/23] mm: declare memcg_page_state_output() in memcontrol.h Roman Gushchin
2025-10-31  8:34   ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 06/23] mm: introduce BPF struct ops for OOM handling Roman Gushchin
2025-10-27 23:57   ` bot+bpf-ci
2025-10-28 17:45   ` Alexei Starovoitov
2025-10-28 18:42     ` Roman Gushchin
2025-10-28 22:07       ` Alexei Starovoitov
2025-10-28 22:56         ` Roman Gushchin
2025-10-28 21:33   ` Song Liu
2025-10-28 23:24     ` Roman Gushchin
2025-10-30  0:20   ` Martin KaFai Lau
2025-10-30  5:57   ` Yafang Shao
2025-10-30 14:26     ` Roman Gushchin
2025-10-31  9:02   ` Michal Hocko
2025-11-02 21:36     ` Roman Gushchin
2025-11-03 19:00       ` Michal Hocko
2025-11-04  1:45         ` Roman Gushchin
2025-11-04  8:18           ` Michal Hocko
2025-11-04 18:14             ` Roman Gushchin
2025-11-04 19:22               ` Michal Hocko
2025-10-27 23:17 ` [PATCH v2 07/23] mm: introduce bpf_oom_kill_process() bpf kfunc Roman Gushchin
2025-10-31  9:05   ` Michal Hocko
2025-11-02 21:09     ` Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 08/23] mm: introduce BPF kfuncs to deal with memcg pointers Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 16:10     ` Roman Gushchin
2025-10-28 17:12       ` Alexei Starovoitov
2025-10-28 18:03         ` Chris Mason
2025-10-28 18:32           ` Roman Gushchin
2025-10-28 17:42   ` Tejun Heo
2025-10-28 18:12     ` Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 09/23] mm: introduce bpf_get_root_mem_cgroup() BPF kfunc Roman Gushchin
2025-10-27 23:17 ` [PATCH v2 10/23] mm: introduce BPF kfuncs to access memcg statistics and events Roman Gushchin
2025-10-27 23:48   ` bot+bpf-ci
2025-10-28 16:16     ` Roman Gushchin
2025-10-31  9:08   ` Michal Hocko
2025-10-31  9:31 ` [PATCH v2 00/23] mm: BPF OOM Michal Hocko
2025-10-31 16:48   ` Lance Yang
2025-11-02 20:53   ` Roman Gushchin
2025-11-03 18:18     ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).